mirror of
https://github.com/simon987/dataarchivist.net.git
synced 2025-04-09 21:46:42 +00:00
update about page
This commit is contained in:
parent
fe50947f93
commit
5270edcc89
@ -2,7 +2,6 @@
|
||||
title: "About"
|
||||
date: 2019-09-13T09:30:47-04:00
|
||||
draft: false
|
||||
author: "simon987"
|
||||
---
|
||||
|
||||
Source code of this website can be found [here](https://github.com/simon987/dataarchivist.net).
|
||||
|
@ -34,7 +34,7 @@ for item, board in scanner.all_posts():
|
||||
## Deduplication
|
||||
|
||||
To avoid publishing the same item twice, the application keeps track of what items were visited in its **state**.
|
||||
Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state doesn't need to be visited again.
|
||||
Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state don't need to be visited again.
|
||||
|
||||
This deduplication step greatly reduces the amount of HTTP requests necessary to stay up to date, and more importantly,
|
||||
it enables the crawler to quickly resume where it left off in the case of a fault.
|
||||
|
@ -57,7 +57,7 @@ with `l.b.MD5_SECRET_KEY` as the salt.
|
||||
|
||||
{{< figure src="/scrape/dev_tools3.png" title="MD5_SECRET_KEY hidden in plain sight">}}
|
||||
|
||||
The secret code was hidden only a few keystrokes into the source. Now that we have all the puzzle pieces,
|
||||
The secret code was hidden only a few keystrokes away into the source. Now that we have all the puzzle pieces,
|
||||
let's hack together a simple Python script to automate the download process:
|
||||
|
||||
{{<highlight python "linenos=table,linenostart=18">}}
|
||||
|
@ -33,7 +33,7 @@ def search_artist(name, mbid):
|
||||
conn.commit()
|
||||
{{</highlight>}}
|
||||
|
||||
I need to call `search_artist()` about 350000 times and I don't want to bother setting up multithreading, error handling and
|
||||
I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and
|
||||
keeping the script up to date on an arbitrary server so let's integrate it in the tracker.
|
||||
|
||||
## Configurating the task_tracker project
|
||||
@ -60,6 +60,7 @@ The way **task_tracker_drone** works is by passing the task object and project s
|
||||
executable file called `run` in the root of the git repository. It also expects a json
|
||||
object telling it if the task was processed successfully, and if there are additionnal actions that needs to be executed:
|
||||
|
||||
**Expected result in stdout**:
|
||||
{{<highlight json >}}
|
||||
{
|
||||
"result": 1,
|
||||
@ -80,7 +81,7 @@ The way **task_tracker_drone** works is by passing the task object and project s
|
||||
|
||||
This is what the body of the final worker script looks like:
|
||||
|
||||
The program expects the task recipe and project secret as arguments, and it outputs the result object
|
||||
The program receives the task recipe and project secret as arguments, and it outputs the result object
|
||||
to stdout.
|
||||
|
||||
{{<highlight python >}}
|
||||
@ -119,5 +120,30 @@ print(json.dumps({
|
||||
{{</highlight>}}
|
||||
|
||||
|
||||
## Allocating worker machines
|
||||
|
||||
{{< figure src="/tt/perms.png" title="Private project require approval">}}
|
||||
On the worker machines, you can execute the task runner and it will automatically start
|
||||
working on the available projects. Private projects require explicit explicit approval to start executing tasks:
|
||||
|
||||
{{<highlight bash >}}
|
||||
git clone https://github.com/simon987/task_tracker_drone
|
||||
cd task_tracker_drone
|
||||
python -m pip install -r requirements.txt
|
||||
|
||||
python ./src/drone.py "https://exemple-api-url.com/api" "worker alias"
|
||||
|
||||
# Request access for 1 r={"ok":true}
|
||||
# Request access for 2 r={"ok":true}
|
||||
# Request access for 3 r={"ok":true}
|
||||
# Starting 10 working contexts
|
||||
# No tasks, waiting...
|
||||
{{</highlight>}}
|
||||
|
||||
|
||||
{{< figure src="/tt/perms.png" title="Private projects require approval">}}
|
||||
|
||||
As soon as you give permission to the worker, it will automatically start executing tasks.
|
||||
When a task fails, it will be put back in the task queue up to `task["max_retries"]` times.
|
||||
The logs can be found on the web interface:
|
||||
|
||||
{{< figure src="/tt/logs.png" title="Logs page">}}
|
||||
|
1
jenkins/Jenkinsfile
vendored
1
jenkins/Jenkinsfile
vendored
@ -8,6 +8,7 @@ remote.allowAnyHosts = true
|
||||
remote.retryCount = 3
|
||||
remote.retryWaitSec = 3
|
||||
logLevel = 'FINER'
|
||||
remote.port = 2299
|
||||
|
||||
pipeline {
|
||||
agent none
|
||||
|
BIN
static/tt/logs.png
Normal file
BIN
static/tt/logs.png
Normal file
Binary file not shown.
After ![]() (image error) Size: 126 KiB |
Loading…
x
Reference in New Issue
Block a user