mirror of
https://github.com/simon987/dataarchivist.net.git
synced 2025-04-04 08:42:58 +00:00
chan_feed post
This commit is contained in:
parent
1163d858fa
commit
fe50947f93
1
.gitignore
vendored
1
.gitignore
vendored
@ -1 +1,2 @@
|
||||
public/
|
||||
*~
|
||||
|
8
content/about/index.md
Normal file
8
content/about/index.md
Normal file
@ -0,0 +1,8 @@
|
||||
---
|
||||
title: "About"
|
||||
date: 2019-09-13T09:30:47-04:00
|
||||
draft: false
|
||||
author: "simon987"
|
||||
---
|
||||
|
||||
Source code of this website can be found [here](https://github.com/simon987/dataarchivist.net).
|
160
content/posts/cf_1.md
Normal file
160
content/posts/cf_1.md
Normal file
@ -0,0 +1,160 @@
|
||||
---
|
||||
title: "Large-scale image board archival"
|
||||
date: 2019-09-13T09:30:47-04:00
|
||||
tags: ["scraping"]
|
||||
draft: false
|
||||
author: "simon987"
|
||||
---
|
||||
|
||||
# *chan Crawler Overview
|
||||
Image boards are volatile by design and require special considerations when scraping
|
||||
(especially when dealing with dozens of them at the same time!).
|
||||
This is an overview of my implementation of a infrastructure that reliably collects and processes
|
||||
~150 GiB/day of data from over 27 image boards in real time.
|
||||
|
||||
|
||||
|
||||
{{< figure src="/cf/grafana.png" title="">}}
|
||||
|
||||
## Core
|
||||
|
||||
The core of the crawler is very straightforward: most of the
|
||||
work depends entirely on the website, as some boards have simple APIs with JSON endpoints
|
||||
and others require more complex HTML parsing.
|
||||
|
||||
{{<highlight python >}}
|
||||
scanner = ChanScanner(helper=CHANS["4chan"])
|
||||
publish_queue = queue.Queue()
|
||||
|
||||
for item, board in scanner.all_posts():
|
||||
# Publishing & post-processing is done asynchronously on separate threads
|
||||
publish_queue.put((post, board))
|
||||
{{</highlight>}}
|
||||
|
||||
## Deduplication
|
||||
|
||||
To avoid publishing the same item twice, the application keeps track of what items were visited in its **state**.
|
||||
Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state doesn't need to be visited again.
|
||||
|
||||
This deduplication step greatly reduces the amount of HTTP requests necessary to stay up to date, and more importantly,
|
||||
it enables the crawler to quickly resume where it left off in the case of a fault.
|
||||
|
||||
{{<highlight python >}}
|
||||
|
||||
# The state is saved synchronously to a SQLite database
|
||||
state = ChanState()
|
||||
|
||||
def once(func):
|
||||
"""Ensures that a function is called at most once per item"""
|
||||
def wrapper(item):
|
||||
if not state.has_visited(item):
|
||||
func(item)
|
||||
state.mark_visited(item)
|
||||
|
||||
return wrapper
|
||||
|
||||
@once
|
||||
def publish(item):
|
||||
# Publish item to RabbitMQ...
|
||||
|
||||
|
||||
{{</highlight>}}
|
||||
|
||||
|
||||
## Rate-limiting
|
||||
|
||||
A similar approach is used to rate-limit HTTP requests. The `rate_limit` decorator is
|
||||
applied to the `self._get` method, which enables me to set a different `reqs_per_second` value
|
||||
for each website. Inactive boards only require a request every 10 minutes, while larger ones
|
||||
require at least 1 request per second to stay up to date.
|
||||
|
||||
{{<highlight python >}}
|
||||
class Web:
|
||||
def __init__(self, reqs_per_second):
|
||||
self.session = requests.Session()
|
||||
|
||||
@rate_limit(reqs_per_second)
|
||||
def _get(url, **kwargs):
|
||||
return self.session.get(url, **kwargs)
|
||||
|
||||
self._get = _get
|
||||
{{</highlight>}}
|
||||
|
||||
## Post-Processing
|
||||
|
||||
Before storing the collected data, there is a post-processing step where I parse the post body
|
||||
to find images and URLs pointing to other domains. This information might be useful in the
|
||||
future, so we might as well do it while the data is in memory.
|
||||
|
||||
All images are downloaded and their hashes are calculated for easy image comparison.
|
||||
Checksums can be used for exact matching and image hashes (`ahash`, `dhash`, `phash` and `whash`)
|
||||
can be used for *fuzzy* image matching (See [Looks Like It - Algorithms for comparing pictures](https://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html)).
|
||||
|
||||
This kind of information can be used by data scientists to track the spread of an image on the internet,
|
||||
even if the image was slightly modified or resized along the way.
|
||||
|
||||
{{<highlight json >}}
|
||||
{
|
||||
"_v": 1.5,
|
||||
"_id": 100002938,
|
||||
"_chan": 25,
|
||||
"_board": "a",
|
||||
"_urls": [
|
||||
"https://chanon.ro/a/src/156675779970.jpg"
|
||||
],
|
||||
"_img": [
|
||||
{
|
||||
"url": 0,
|
||||
"size": 109261,
|
||||
"width": 601,
|
||||
"height": 1024,
|
||||
"crc32": "e4b7245e",
|
||||
"md5": "a46897b44d42d955fac6988f548b1b2f",
|
||||
"sha1": "5d6151241dfb5c65cb3f736f22b4cda978cc1cd0",
|
||||
"ahash": "A/w/A/A/A/G/e/A/E/A/A/A/",
|
||||
"dhash": "cY+YqWOaGVDaOdyWGWOUKSOS",
|
||||
"phash": "jVVSItap1dlSX0zZUmE2ZqZn",
|
||||
"whash": "Dw8PDw8PDw8="
|
||||
}
|
||||
],
|
||||
"id": 2938,
|
||||
"html": "<table>...</table>",
|
||||
"time": 1566782999,
|
||||
"type": "post"
|
||||
}
|
||||
{{</highlight >}}
|
||||
|
||||
## Archival
|
||||
|
||||
{{< figure src="/cf/pg.png" title="">}}
|
||||
|
||||
The storage itself is handled by the **feed_archiver** process, which is
|
||||
completely independent of the \*chan scraper (in fact, the exact same application is
|
||||
used to store [reddit_feed](https://github.com/simon987/reddit_feed) data).
|
||||
|
||||
The application will automatically create new tables as needed, and will store the items to
|
||||
PostgreSQL.
|
||||
|
||||
{{< figure src="/cf/diagram.png" title="">}}
|
||||
|
||||
Data is saved in a `JSONB` column for easy access.
|
||||
Retrieving all posts made this year is possible with plain SQL:
|
||||
|
||||
{{<highlight sql >}}
|
||||
SELECT *
|
||||
FROM chan_7chan_post
|
||||
WHERE (data->>'time')::INT >= '2019-01-01'::abstime::INT
|
||||
{{</highlight >}}
|
||||
|
||||
|
||||
# Real-time visualization
|
||||
|
||||
Because we're working with a message queue, it's trivial to attach other components to
|
||||
the pipeline. As an experiment, I made a [WebSocket adapter](https://github.com/simon987/ws_feed_adapter)
|
||||
and an [application](https://github.com/simon987/feed_viz) that displays the images as they are ingested in
|
||||
your web browser.
|
||||
|
||||
{{< figure src="/cf/_chan.jpg" title="">}}
|
||||
|
||||
The live feed can be accessed [here](https://feed.the-eye.eu).
|
||||
|
12
content/posts/virus.md
Normal file
12
content/posts/virus.md
Normal file
@ -0,0 +1,12 @@
|
||||
---
|
||||
title: "Playing with viruses and Windows VMs"
|
||||
date: 2019-07-05T22:13:13-04:00
|
||||
draft: true
|
||||
---
|
||||
|
||||
I had a bit of downtime so I decided to have a little bit of fun with viruses and unsecured Windows virtual machines.
|
||||
{{< figure src="/vm/diagram.png" title="">}}
|
||||
|
||||
|
||||
I started by creating a LAN of twelve machines,
|
||||
|
BIN
diagrams/chan_feed.dia
Normal file
BIN
diagrams/chan_feed.dia
Normal file
Binary file not shown.
BIN
diagrams/vm_stuff.dia
Normal file
BIN
diagrams/vm_stuff.dia
Normal file
Binary file not shown.
BIN
static/cf/_chan.jpg
Normal file
BIN
static/cf/_chan.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 1.1 MiB |
BIN
static/cf/diagram.png
Normal file
BIN
static/cf/diagram.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 28 KiB |
BIN
static/cf/grafana.png
Normal file
BIN
static/cf/grafana.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 366 KiB |
BIN
static/cf/pg.png
Normal file
BIN
static/cf/pg.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 57 KiB |
BIN
static/vm/diagram.png
Normal file
BIN
static/vm/diagram.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 85 KiB |
Loading…
x
Reference in New Issue
Block a user