diff --git a/content/about/index.md b/content/about/index.md index be11b51..8202a45 100644 --- a/content/about/index.md +++ b/content/about/index.md @@ -2,7 +2,6 @@ title: "About" date: 2019-09-13T09:30:47-04:00 draft: false -author: "simon987" --- Source code of this website can be found [here](https://github.com/simon987/dataarchivist.net). diff --git a/content/posts/cf_1.md b/content/posts/cf_1.md index 5f36e46..03b24f5 100644 --- a/content/posts/cf_1.md +++ b/content/posts/cf_1.md @@ -34,7 +34,7 @@ for item, board in scanner.all_posts(): ## Deduplication To avoid publishing the same item twice, the application keeps track of what items were visited in its **state**. -Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state doesn't need to be visited again. +Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state don't need to be visited again. This deduplication step greatly reduces the amount of HTTP requests necessary to stay up to date, and more importantly, it enables the crawler to quickly resume where it left off in the case of a fault. diff --git a/content/posts/scrape_1.md b/content/posts/scrape_1.md index e0104f4..502d160 100644 --- a/content/posts/scrape_1.md +++ b/content/posts/scrape_1.md @@ -57,7 +57,7 @@ with `l.b.MD5_SECRET_KEY` as the salt. {{< figure src="/scrape/dev_tools3.png" title="MD5_SECRET_KEY hidden in plain sight">}} -The secret code was hidden only a few keystrokes into the source. Now that we have all the puzzle pieces, +The secret code was hidden only a few keystrokes away into the source. Now that we have all the puzzle pieces, let's hack together a simple Python script to automate the download process: {{}} diff --git a/content/posts/tt_1.md b/content/posts/tt_1.md index ce6ec5a..9ae88d4 100644 --- a/content/posts/tt_1.md +++ b/content/posts/tt_1.md @@ -33,7 +33,7 @@ def search_artist(name, mbid): conn.commit() {{}} -I need to call `search_artist()` about 350000 times and I don't want to bother setting up multithreading, error handling and +I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and keeping the script up to date on an arbitrary server so let's integrate it in the tracker. ## Configurating the task_tracker project @@ -60,6 +60,7 @@ The way **task_tracker_drone** works is by passing the task object and project s executable file called `run` in the root of the git repository. It also expects a json object telling it if the task was processed successfully, and if there are additionnal actions that needs to be executed: +**Expected result in stdout**: {{}} { "result": 1, @@ -80,7 +81,7 @@ The way **task_tracker_drone** works is by passing the task object and project s This is what the body of the final worker script looks like: -The program expects the task recipe and project secret as arguments, and it outputs the result object +The program receives the task recipe and project secret as arguments, and it outputs the result object to stdout. {{}} @@ -119,5 +120,30 @@ print(json.dumps({ {{}} +## Allocating worker machines -{{< figure src="/tt/perms.png" title="Private project require approval">}} +On the worker machines, you can execute the task runner and it will automatically start +working on the available projects. Private projects require explicit explicit approval to start executing tasks: + +{{}} +git clone https://github.com/simon987/task_tracker_drone +cd task_tracker_drone +python -m pip install -r requirements.txt + +python ./src/drone.py "https://exemple-api-url.com/api" "worker alias" + +# Request access for 1 r={"ok":true} +# Request access for 2 r={"ok":true} +# Request access for 3 r={"ok":true} +# Starting 10 working contexts +# No tasks, waiting... +{{}} + + +{{< figure src="/tt/perms.png" title="Private projects require approval">}} + +As soon as you give permission to the worker, it will automatically start executing tasks. +When a task fails, it will be put back in the task queue up to `task["max_retries"]` times. +The logs can be found on the web interface: + +{{< figure src="/tt/logs.png" title="Logs page">}} diff --git a/jenkins/Jenkinsfile b/jenkins/Jenkinsfile index c105a9c..7c65116 100644 --- a/jenkins/Jenkinsfile +++ b/jenkins/Jenkinsfile @@ -8,6 +8,7 @@ remote.allowAnyHosts = true remote.retryCount = 3 remote.retryWaitSec = 3 logLevel = 'FINER' +remote.port = 2299 pipeline { agent none diff --git a/static/tt/logs.png b/static/tt/logs.png new file mode 100644 index 0000000..549a90b Binary files /dev/null and b/static/tt/logs.png differ