mirror of
https://github.com/simon987/dataarchivist.net.git
synced 2025-04-19 17:16:42 +00:00
Little bit more work on tt_1
This commit is contained in:
parent
320048b60c
commit
653705b07d
@ -2,11 +2,14 @@
|
|||||||
title: "Web scraping with task_tracker"
|
title: "Web scraping with task_tracker"
|
||||||
date: 2019-06-14T14:31:42-04:00
|
date: 2019-06-14T14:31:42-04:00
|
||||||
draft: true
|
draft: true
|
||||||
|
tags: ["scraping", "task_tracker"]
|
||||||
author: simon987
|
author: simon987
|
||||||
---
|
---
|
||||||
|
|
||||||
|
I built a tool to simplify long-running scraping tasks processes. **task_tracker** is a simple job queue
|
||||||
|
with a web frontend. This is a simple demo of a common use-case.
|
||||||
|
|
||||||
|
Let's start with a simple script I use to aggregate data from Spotify's API:
|
||||||
|
|
||||||
{{<highlight python >}}
|
{{<highlight python >}}
|
||||||
import spotipy
|
import spotipy
|
||||||
@ -20,17 +23,30 @@ def search_artist(name, mbid):
|
|||||||
name = '"' + name + '"'
|
name = '"' + name + '"'
|
||||||
|
|
||||||
with open(os.devnull, 'w') as null:
|
with open(os.devnull, 'w') as null:
|
||||||
# Silence spotipy's stdout
|
# Silence spotipy's stdout...
|
||||||
stdout = sys.stdout
|
|
||||||
sys.stdout = null
|
sys.stdout = null
|
||||||
res = spotify.search(name, type="artist", limit=20)
|
res = spotify.search(name, type="artist", limit=20)
|
||||||
sys.stdout = stdout
|
sys.stdout = sys.__stdout__
|
||||||
|
|
||||||
with sqlite3.connect(dbfile) as conn:
|
with sqlite3.connect(dbfile) as conn:
|
||||||
conn.execute("INSERT INTO artist (mbid, query, data) VALUES (?,?,?)", (mbid, name, json.dumps(res)))
|
conn.execute("INSERT INTO artist (mbid, query, data) VALUES (?,?,?)", (mbid, name, json.dumps(res)))
|
||||||
conn.commit()
|
conn.commit()
|
||||||
{{</highlight>}}
|
{{</highlight>}}
|
||||||
|
|
||||||
|
I need to call `search_artist()` about 350000 times and I don't want to bother setting up multithreading, error handling and
|
||||||
|
keeping the script up to date on an arbitrary server so let's integrate it in the tracker.
|
||||||
|
|
||||||
|
My usual workflow is to create a project per script. I pushed the script to a [Gogs](https://gogs.io/) instance and created the project.
|
||||||
|
This also works with Github/Gitea.
|
||||||
|
{{< figure src="/tt/new_project.png" title="New task_tracker project">}}
|
||||||
|
|
||||||
|
After the Webhook is setup, **task\_tracker** will stay in sync with the repository its workers will be made aware of the new changes
|
||||||
|
instantly. This is not something we have to worry about since our **task_tracker_drone** takes care of deploying and updating the projects
|
||||||
|
in real time with no additional configuration.
|
||||||
|
|
||||||
|
{{< figure src="/tt/hook.png" title="Gogs webhook configuration">}}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
{{<highlight python >}}
|
{{<highlight python >}}
|
||||||
try:
|
try:
|
||||||
@ -70,7 +86,5 @@ print(json.dumps({
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
{{< figure src="/tt/new_project.png" title="New task_tracker project">}}
|
|
||||||
{{< figure src="/tt/secret.png" title="Project secret settings">}}
|
{{< figure src="/tt/secret.png" title="Project secret settings">}}
|
||||||
{{< figure src="/tt/hook.png" title="Gogs webhook configuration">}}
|
|
||||||
{{< figure src="/tt/perms.png" title="Private project require approval">}}
|
{{< figure src="/tt/perms.png" title="Private project require approval">}}
|
||||||
|
Loading…
x
Reference in New Issue
Block a user