--- title: "Web scraping with task_tracker" date: 2019-06-14T14:31:42-04:00 draft: true tags: ["scraping", "task_tracker"] author: simon987 --- I built a tool to simplify long-running scraping tasks processes. **task_tracker** is a simple job queue with a web frontend. This is a simple demo of a common use-case. Let's start with a simple script I use to aggregate data from Spotify's API: {{}} import spotipy from spotipy.oauth2 import SpotifyClientCredentials spotify = spotipy.Spotify(...) def search_artist(name, mbid): # Surround with quotes to get exact matches name = '"' + name + '"' with open(os.devnull, 'w') as null: # Silence spotipy's stdout... sys.stdout = null res = spotify.search(name, type="artist", limit=20) sys.stdout = sys.__stdout__ with sqlite3.connect(dbfile) as conn: conn.execute("INSERT INTO artist (mbid, query, data) VALUES (?,?,?)", (mbid, name, json.dumps(res))) conn.commit() {{}} I need to call `search_artist()` about 350000 times and I don't want to bother setting up multithreading, error handling and keeping the script up to date on an arbitrary server so let's integrate it in the tracker. ## Configurating the task_tracker project My usual workflow is to create a project per script. I pushed the script to a [Gogs](https://gogs.io/) instance and created the project. This also works with Github/Gitea. {{< figure src="/tt/new_project.png" title="New task_tracker project">}} After the Webhook is setup, **task\_tracker** will stay in sync with the repository its workers will be made aware of the new changes instantly. This is not something we have to worry about since our **task_tracker_drone** takes care of deploying and updating the projects in real time with no additional configuration. {{< figure src="/tt/hook.png" title="Gogs webhook configuration">}} The final configuration step is to set the *project secret*, which we will use to store authentication details. Only workers which we have given explicit `ASSIGN` permission will be able to read this information. {{< figure src="/tt/secret.png" title="Project secret settings">}} ## Writing the worker script The way **task_tracker_drone** works is by passing the task object and project secret as command line arguments to the executable file called `run` in the root of the git repository. It also expects a json object telling it if the task was processed successfully, and if there are additionnal actions that needs to be executed: {{}} { "result": 1, "logs": [ {"message": "This is an 'ERROR' (level 3) message that will be saved in the tracker", "level": 3} ], "tasks": [ { "project": 1, "recipe": "This task will be submitted to the tracker", "..." }, ] } {{}} *(See [LogLevel constants](https://github.com/simon987/task_tracker_drone/blob/master/src/tt_drone/api.py#L12))* This is what the body of the final worker script looks like: The program expects the task recipe and project secret as arguments, and it outputs the result object to stdout. {{}} try: # This script is called like this: # python ./run.py "" "" task_str = sys.argv[1] task = json.loads(task_str) secret_str = sys.argv[2] secret = json.loads(secret_str) CLIENT_ID = secret["CLIENT_ID"] CLIENT_SECRET = secret["CLIENT_SECRET"] client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET) spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager) # This job's recipe is an array of name & mbid pairs recipe = json.loads(task["recipe"]) for job in recipe: search_artist(job["name"], job["mbid"]) except Exception as e: print(json.dumps({ # tell task_tracker that this task failed. it will be re-attempted later "result": 1, # send full stack trace. it will be available in the logs page. "logs": [ {"message": str(e) + traceback.format_exc(), "level": 3} ] })) quit(2) print(json.dumps({ "result": 0, })) {{}} {{< figure src="/tt/perms.png" title="Private project require approval">}}