chan_feed post

2025-04-24 11:15:50 +00:00 · 2019-09-13 17:56:31 -04:00 · 2019-09-13 17:56:31 -04:00 · fe50947f93
commit fe50947f93
parent 1163d858fa
11 changed files with 181 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1 +1,2 @@
 public/
+*~
--- a/content/about/index.md
+++ b/content/about/index.md
@ -0,0 +1,8 @@
+---
+title: "About"
+date: 2019-09-13T09:30:47-04:00
+draft: false
+author: "simon987"
+---
+
+Source code of this website can be found [here](https://github.com/simon987/dataarchivist.net).
--- a/content/posts/cf_1.md
+++ b/content/posts/cf_1.md
@ -0,0 +1,160 @@
+---
+title: "Large-scale image board archival"
+date: 2019-09-13T09:30:47-04:00
+tags: ["scraping"]
+draft: false
+author: "simon987"
+---
+
+# *chan Crawler Overview
+Image boards are volatile by design and require special considerations when scraping 
+(especially when dealing with dozens of them at the same time!). 
+This is an overview of my implementation of a infrastructure that reliably collects and processes
+~150 GiB/day of data from over 27 image boards in real time.
+
+
+
+{{< figure src="/cf/grafana.png" title="">}}
+
+## Core
+
+The core of the crawler is very straightforward: most of the
+work depends entirely on the website, as some boards have simple APIs with JSON endpoints
+and others require more complex HTML parsing.
+
+{{<highlight python >}}
+scanner = ChanScanner(helper=CHANS["4chan"]) 
+publish_queue = queue.Queue()
+
+for item, board in scanner.all_posts():
+    # Publishing & post-processing is done asynchronously on separate threads
+	publish_queue.put((post, board))
+{{</highlight>}}
+
+## Deduplication
+
+To avoid publishing the same item twice, the application keeps track of what items were visited in its **state**. 
+Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state doesn't need to be visited again.
+
+This deduplication step greatly reduces the amount of HTTP requests necessary to stay up to date, and more importantly,
+it enables the crawler to quickly resume where it left off in the case of a fault.
+
+{{<highlight python >}}
+
+# The state is saved synchronously to a SQLite database
+state = ChanState()
+
+def once(func):
+    """Ensures that a function is called at most once per item"""
+    def wrapper(item):
+        if not state.has_visited(item):
+            func(item)
+            state.mark_visited(item)
+
+    return wrapper
+
+@once
+def publish(item):
+	# Publish item to RabbitMQ...
+
+
+{{</highlight>}}
+
+
+## Rate-limiting
+
+A similar approach is used to rate-limit HTTP requests. The `rate_limit` decorator is
+applied to the `self._get` method, which enables me to set a different `reqs_per_second` value
+for each website. Inactive boards only require a request every 10 minutes, while larger ones
+require at least 1 request per second to stay up to date.
+
+{{<highlight python >}}
+class Web:
+    def __init__(self, reqs_per_second):
+        self.session = requests.Session()
+
+        @rate_limit(reqs_per_second)
+        def _get(url, **kwargs):
+            return self.session.get(url, **kwargs)
+
+        self._get = _get
+{{</highlight>}}
+
+## Post-Processing
+
+Before storing the collected data, there is a post-processing step where I parse the post body
+to find images and URLs pointing to other domains. This information might be useful in the
+future, so we might as well do it while the data is in memory.
+
+All images are downloaded and their hashes are calculated for easy image comparison.
+Checksums can be used for exact matching and image hashes (`ahash`, `dhash`, `phash` and `whash`)
+can be used for *fuzzy* image matching (See [Looks Like It - Algorithms for comparing pictures](https://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html)). 
+
+This kind of information can be used by data scientists to track the spread of an image on the internet, 
+even if the image was slightly modified or resized along the way.
+
+{{<highlight json >}}
+{
+  "_v": 1.5,
+  "_id": 100002938,
+  "_chan": 25,
+  "_board": "a",
+  "_urls": [
+    "https://chanon.ro/a/src/156675779970.jpg"
+  ],
+  "_img": [
+    {
+      "url": 0,
+      "size": 109261,
+      "width": 601,
+      "height": 1024,
+      "crc32": "e4b7245e",
+      "md5": "a46897b44d42d955fac6988f548b1b2f",
+      "sha1": "5d6151241dfb5c65cb3f736f22b4cda978cc1cd0",
+      "ahash": "A/w/A/A/A/G/e/A/E/A/A/A/",
+      "dhash": "cY+YqWOaGVDaOdyWGWOUKSOS",
+      "phash": "jVVSItap1dlSX0zZUmE2ZqZn",
+      "whash": "Dw8PDw8PDw8="
+    }
+  ],
+  "id": 2938,
+  "html": "<table>...</table>",
+  "time": 1566782999,
+  "type": "post"
+}
+{{</highlight >}}
+
+## Archival
+
+{{< figure src="/cf/pg.png" title="">}}
+
+The storage itself is handled by the **feed_archiver** process, which is
+completely independent of the \*chan scraper (in fact, the exact same application is
+used to store [reddit_feed](https://github.com/simon987/reddit_feed) data). 
+
+The application will automatically create new tables as needed, and will store the items to
+PostgreSQL.
+
+{{< figure src="/cf/diagram.png" title="">}}
+
+Data is saved in a `JSONB` column for easy access. 
+Retrieving all posts made this year is possible with plain SQL:
+
+{{<highlight sql >}}
+SELECT *
+FROM chan_7chan_post
+WHERE (data->>'time')::INT >= '2019-01-01'::abstime::INT
+{{</highlight >}}
+
+
+# Real-time visualization
+
+Because we're working with a message queue, it's trivial to attach other components to
+the pipeline. As an experiment, I made a [WebSocket adapter](https://github.com/simon987/ws_feed_adapter)
+and an [application](https://github.com/simon987/feed_viz) that displays the images as they are ingested in
+your web browser.
+
+{{< figure src="/cf/_chan.jpg" title="">}}
+
+The live feed can be accessed [here](https://feed.the-eye.eu).
+
--- a/content/posts/virus.md
+++ b/content/posts/virus.md
@ -0,0 +1,12 @@
+---
+title: "Playing with viruses and Windows VMs"
+date: 2019-07-05T22:13:13-04:00
+draft: true
+---
+
+I had a bit of downtime so I decided to have a little bit of fun with viruses and unsecured Windows virtual machines.
+{{< figure src="/vm/diagram.png" title="">}}
+
+
+I started by creating a LAN of twelve machines,
+
--- a/diagrams/chan_feed.dia
+++ b/diagrams/chan_feed.dia
--- a/diagrams/vm_stuff.dia
+++ b/diagrams/vm_stuff.dia
--- a/static/cf/_chan.jpg
+++ b/static/cf/_chan.jpg
--- a/static/cf/diagram.png
+++ b/static/cf/diagram.png
--- a/static/cf/grafana.png
+++ b/static/cf/grafana.png
--- a/static/cf/pg.png
+++ b/static/cf/pg.png
--- a/static/vm/diagram.png
+++ b/static/vm/diagram.png