1
0
mirror of https://github.com/simon987/dataarchivist.net.git synced 2025-04-09 21:46:42 +00:00

update about page

This commit is contained in:
simon 2019-11-03 21:58:59 -05:00
parent fe50947f93
commit 5270edcc89
6 changed files with 32 additions and 6 deletions

@ -2,7 +2,6 @@
title: "About"
date: 2019-09-13T09:30:47-04:00
draft: false
author: "simon987"
---
Source code of this website can be found [here](https://github.com/simon987/dataarchivist.net).

@ -34,7 +34,7 @@ for item, board in scanner.all_posts():
## Deduplication
To avoid publishing the same item twice, the application keeps track of what items were visited in its **state**.
Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state doesn't need to be visited again.
Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state don't need to be visited again.
This deduplication step greatly reduces the amount of HTTP requests necessary to stay up to date, and more importantly,
it enables the crawler to quickly resume where it left off in the case of a fault.

@ -57,7 +57,7 @@ with `l.b.MD5_SECRET_KEY` as the salt.
{{< figure src="/scrape/dev_tools3.png" title="MD5_SECRET_KEY hidden in plain sight">}}
The secret code was hidden only a few keystrokes into the source. Now that we have all the puzzle pieces,
The secret code was hidden only a few keystrokes away into the source. Now that we have all the puzzle pieces,
let's hack together a simple Python script to automate the download process:
{{<highlight python "linenos=table,linenostart=18">}}

@ -33,7 +33,7 @@ def search_artist(name, mbid):
conn.commit()
{{</highlight>}}
I need to call `search_artist()` about 350000 times and I don't want to bother setting up multithreading, error handling and
I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and
keeping the script up to date on an arbitrary server so let's integrate it in the tracker.
## Configurating the task_tracker project
@ -60,6 +60,7 @@ The way **task_tracker_drone** works is by passing the task object and project s
executable file called `run` in the root of the git repository. It also expects a json
object telling it if the task was processed successfully, and if there are additionnal actions that needs to be executed:
**Expected result in stdout**:
{{<highlight json >}}
{
"result": 1,
@ -80,7 +81,7 @@ The way **task_tracker_drone** works is by passing the task object and project s
This is what the body of the final worker script looks like:
The program expects the task recipe and project secret as arguments, and it outputs the result object
The program receives the task recipe and project secret as arguments, and it outputs the result object
to stdout.
{{<highlight python >}}
@ -119,5 +120,30 @@ print(json.dumps({
{{</highlight>}}
## Allocating worker machines
{{< figure src="/tt/perms.png" title="Private project require approval">}}
On the worker machines, you can execute the task runner and it will automatically start
working on the available projects. Private projects require explicit explicit approval to start executing tasks:
{{<highlight bash >}}
git clone https://github.com/simon987/task_tracker_drone
cd task_tracker_drone
python -m pip install -r requirements.txt
python ./src/drone.py "https://exemple-api-url.com/api" "worker alias"
# Request access for 1 r={"ok":true}
# Request access for 2 r={"ok":true}
# Request access for 3 r={"ok":true}
# Starting 10 working contexts
# No tasks, waiting...
{{</highlight>}}
{{< figure src="/tt/perms.png" title="Private projects require approval">}}
As soon as you give permission to the worker, it will automatically start executing tasks.
When a task fails, it will be put back in the task queue up to `task["max_retries"]` times.
The logs can be found on the web interface:
{{< figure src="/tt/logs.png" title="Logs page">}}

1
jenkins/Jenkinsfile vendored

@ -8,6 +8,7 @@ remote.allowAnyHosts = true
remote.retryCount = 3
remote.retryWaitSec = 3
logLevel = 'FINER'
remote.port = 2299
pipeline {
agent none

BIN
static/tt/logs.png Normal file

Binary file not shown.

After

(image error) Size: 126 KiB