mirror of
https://github.com/simon987/dataarchivist.net.git
synced 2025-04-18 00:46:42 +00:00
update about page
This commit is contained in:
parent
fe50947f93
commit
5270edcc89
@ -2,7 +2,6 @@
|
|||||||
title: "About"
|
title: "About"
|
||||||
date: 2019-09-13T09:30:47-04:00
|
date: 2019-09-13T09:30:47-04:00
|
||||||
draft: false
|
draft: false
|
||||||
author: "simon987"
|
|
||||||
---
|
---
|
||||||
|
|
||||||
Source code of this website can be found [here](https://github.com/simon987/dataarchivist.net).
|
Source code of this website can be found [here](https://github.com/simon987/dataarchivist.net).
|
||||||
|
@ -34,7 +34,7 @@ for item, board in scanner.all_posts():
|
|||||||
## Deduplication
|
## Deduplication
|
||||||
|
|
||||||
To avoid publishing the same item twice, the application keeps track of what items were visited in its **state**.
|
To avoid publishing the same item twice, the application keeps track of what items were visited in its **state**.
|
||||||
Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state doesn't need to be visited again.
|
Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state don't need to be visited again.
|
||||||
|
|
||||||
This deduplication step greatly reduces the amount of HTTP requests necessary to stay up to date, and more importantly,
|
This deduplication step greatly reduces the amount of HTTP requests necessary to stay up to date, and more importantly,
|
||||||
it enables the crawler to quickly resume where it left off in the case of a fault.
|
it enables the crawler to quickly resume where it left off in the case of a fault.
|
||||||
|
@ -57,7 +57,7 @@ with `l.b.MD5_SECRET_KEY` as the salt.
|
|||||||
|
|
||||||
{{< figure src="/scrape/dev_tools3.png" title="MD5_SECRET_KEY hidden in plain sight">}}
|
{{< figure src="/scrape/dev_tools3.png" title="MD5_SECRET_KEY hidden in plain sight">}}
|
||||||
|
|
||||||
The secret code was hidden only a few keystrokes into the source. Now that we have all the puzzle pieces,
|
The secret code was hidden only a few keystrokes away into the source. Now that we have all the puzzle pieces,
|
||||||
let's hack together a simple Python script to automate the download process:
|
let's hack together a simple Python script to automate the download process:
|
||||||
|
|
||||||
{{<highlight python "linenos=table,linenostart=18">}}
|
{{<highlight python "linenos=table,linenostart=18">}}
|
||||||
|
@ -33,7 +33,7 @@ def search_artist(name, mbid):
|
|||||||
conn.commit()
|
conn.commit()
|
||||||
{{</highlight>}}
|
{{</highlight>}}
|
||||||
|
|
||||||
I need to call `search_artist()` about 350000 times and I don't want to bother setting up multithreading, error handling and
|
I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and
|
||||||
keeping the script up to date on an arbitrary server so let's integrate it in the tracker.
|
keeping the script up to date on an arbitrary server so let's integrate it in the tracker.
|
||||||
|
|
||||||
## Configurating the task_tracker project
|
## Configurating the task_tracker project
|
||||||
@ -60,6 +60,7 @@ The way **task_tracker_drone** works is by passing the task object and project s
|
|||||||
executable file called `run` in the root of the git repository. It also expects a json
|
executable file called `run` in the root of the git repository. It also expects a json
|
||||||
object telling it if the task was processed successfully, and if there are additionnal actions that needs to be executed:
|
object telling it if the task was processed successfully, and if there are additionnal actions that needs to be executed:
|
||||||
|
|
||||||
|
**Expected result in stdout**:
|
||||||
{{<highlight json >}}
|
{{<highlight json >}}
|
||||||
{
|
{
|
||||||
"result": 1,
|
"result": 1,
|
||||||
@ -80,7 +81,7 @@ The way **task_tracker_drone** works is by passing the task object and project s
|
|||||||
|
|
||||||
This is what the body of the final worker script looks like:
|
This is what the body of the final worker script looks like:
|
||||||
|
|
||||||
The program expects the task recipe and project secret as arguments, and it outputs the result object
|
The program receives the task recipe and project secret as arguments, and it outputs the result object
|
||||||
to stdout.
|
to stdout.
|
||||||
|
|
||||||
{{<highlight python >}}
|
{{<highlight python >}}
|
||||||
@ -119,5 +120,30 @@ print(json.dumps({
|
|||||||
{{</highlight>}}
|
{{</highlight>}}
|
||||||
|
|
||||||
|
|
||||||
|
## Allocating worker machines
|
||||||
|
|
||||||
{{< figure src="/tt/perms.png" title="Private project require approval">}}
|
On the worker machines, you can execute the task runner and it will automatically start
|
||||||
|
working on the available projects. Private projects require explicit explicit approval to start executing tasks:
|
||||||
|
|
||||||
|
{{<highlight bash >}}
|
||||||
|
git clone https://github.com/simon987/task_tracker_drone
|
||||||
|
cd task_tracker_drone
|
||||||
|
python -m pip install -r requirements.txt
|
||||||
|
|
||||||
|
python ./src/drone.py "https://exemple-api-url.com/api" "worker alias"
|
||||||
|
|
||||||
|
# Request access for 1 r={"ok":true}
|
||||||
|
# Request access for 2 r={"ok":true}
|
||||||
|
# Request access for 3 r={"ok":true}
|
||||||
|
# Starting 10 working contexts
|
||||||
|
# No tasks, waiting...
|
||||||
|
{{</highlight>}}
|
||||||
|
|
||||||
|
|
||||||
|
{{< figure src="/tt/perms.png" title="Private projects require approval">}}
|
||||||
|
|
||||||
|
As soon as you give permission to the worker, it will automatically start executing tasks.
|
||||||
|
When a task fails, it will be put back in the task queue up to `task["max_retries"]` times.
|
||||||
|
The logs can be found on the web interface:
|
||||||
|
|
||||||
|
{{< figure src="/tt/logs.png" title="Logs page">}}
|
||||||
|
1
jenkins/Jenkinsfile
vendored
1
jenkins/Jenkinsfile
vendored
@ -8,6 +8,7 @@ remote.allowAnyHosts = true
|
|||||||
remote.retryCount = 3
|
remote.retryCount = 3
|
||||||
remote.retryWaitSec = 3
|
remote.retryWaitSec = 3
|
||||||
logLevel = 'FINER'
|
logLevel = 'FINER'
|
||||||
|
remote.port = 2299
|
||||||
|
|
||||||
pipeline {
|
pipeline {
|
||||||
agent none
|
agent none
|
||||||
|
BIN
static/tt/logs.png
Normal file
BIN
static/tt/logs.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 126 KiB |
Loading…
x
Reference in New Issue
Block a user