chan_feed post

2025-12-13 03:29:03 +00:00 · 2019-09-13 17:56:31 -04:00
parent 1163d858fa
commit fe50947f93
11 changed files with 181 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,2 @@
 public/
 *~
--- a/content/about/index.md
+++ b/content/about/index.md
@@ -0,0 +1,8 @@
 ---
 title: "About"
 date: 2019-09-13T09:30:47-04:00
 draft: false
 author: "simon987"
 ---
 Source code of this website can be found [here](https://github.com/simon987/dataarchivist.net).
--- a/content/posts/cf_1.md
+++ b/content/posts/cf_1.md
@@ -0,0 +1,160 @@
 ---
 title: "Large-scale image board archival"
 date: 2019-09-13T09:30:47-04:00
 tags: ["scraping"]
 draft: false
 author: "simon987"
 ---
 # *chan Crawler Overview
 Image boards are volatile by design and require special considerations when scraping 
 (especially when dealing with dozens of them at the same time!). 
 This is an overview of my implementation of a infrastructure that reliably collects and processes
 ~150 GiB/day of data from over 27 image boards in real time.
 {{< figure src="/cf/grafana.png" title="">}}
 ## Core
 The core of the crawler is very straightforward: most of the
 work depends entirely on the website, as some boards have simple APIs with JSON endpoints
 and others require more complex HTML parsing.
 {{<highlight python >}}
 scanner = ChanScanner(helper=CHANS["4chan"]) 
 publish_queue = queue.Queue()
 for item, board in scanner.all_posts():
    # Publishing & post-processing is done asynchronously on separate threads
 	publish_queue.put((post, board))
 {{</highlight>}}
 ## Deduplication
 To avoid publishing the same item twice, the application keeps track of what items were visited in its **state**. 
 Items that have the same `last_modified`, `reply_count` or `timestamp` value as the state doesn't need to be visited again.
 This deduplication step greatly reduces the amount of HTTP requests necessary to stay up to date, and more importantly,
 it enables the crawler to quickly resume where it left off in the case of a fault.
 {{<highlight python >}}
 # The state is saved synchronously to a SQLite database
 state = ChanState()
 def once(func):
    """Ensures that a function is called at most once per item"""
    def wrapper(item):
        if not state.has_visited(item):
            func(item)
            state.mark_visited(item)
    return wrapper
@once
 def publish(item):
 	# Publish item to RabbitMQ...
 {{</highlight>}}
 ## Rate-limiting
 A similar approach is used to rate-limit HTTP requests. The `rate_limit` decorator is
 applied to the `self._get` method, which enables me to set a different `reqs_per_second` value
 for each website. Inactive boards only require a request every 10 minutes, while larger ones
 require at least 1 request per second to stay up to date.
 {{<highlight python >}}
 class Web:
    def __init__(self, reqs_per_second):
        self.session = requests.Session()
        @rate_limit(reqs_per_second)
        def _get(url, **kwargs):
            return self.session.get(url, **kwargs)
        self._get = _get
 {{</highlight>}}
 ## Post-Processing
 Before storing the collected data, there is a post-processing step where I parse the post body
 to find images and URLs pointing to other domains. This information might be useful in the
 future, so we might as well do it while the data is in memory.
 All images are downloaded and their hashes are calculated for easy image comparison.
 Checksums can be used for exact matching and image hashes (`ahash`, `dhash`, `phash` and `whash`)
 can be used for *fuzzy* image matching (See [Looks Like It - Algorithms for comparing pictures](https://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html)). 
 This kind of information can be used by data scientists to track the spread of an image on the internet, 
 even if the image was slightly modified or resized along the way.
 {{<highlight json >}}
 {
  "_v": 1.5,
  "_id": 100002938,
  "_chan": 25,
  "_board": "a",
  "_urls": [
    "https://chanon.ro/a/src/156675779970.jpg"
  ],
  "_img": [
    {
      "url": 0,
      "size": 109261,
      "width": 601,
      "height": 1024,
      "crc32": "e4b7245e",
      "md5": "a46897b44d42d955fac6988f548b1b2f",
      "sha1": "5d6151241dfb5c65cb3f736f22b4cda978cc1cd0",
      "ahash": "A/w/A/A/A/G/e/A/E/A/A/A/",
      "dhash": "cY+YqWOaGVDaOdyWGWOUKSOS",
      "phash": "jVVSItap1dlSX0zZUmE2ZqZn",
      "whash": "Dw8PDw8PDw8="
    }
  ],
  "id": 2938,
  "html": "<table>...</table>",
  "time": 1566782999,
  "type": "post"
 }
 {{</highlight >}}
 ## Archival
 {{< figure src="/cf/pg.png" title="">}}
 The storage itself is handled by the **feed_archiver** process, which is
 completely independent of the \*chan scraper (in fact, the exact same application is
 used to store [reddit_feed](https://github.com/simon987/reddit_feed) data). 
 The application will automatically create new tables as needed, and will store the items to
 PostgreSQL.
 {{< figure src="/cf/diagram.png" title="">}}
 Data is saved in a `JSONB` column for easy access. 
 Retrieving all posts made this year is possible with plain SQL:
 {{<highlight sql >}}
 SELECT *
 FROM chan_7chan_post
 WHERE (data->>'time')::INT >= '2019-01-01'::abstime::INT
 {{</highlight >}}
 # Real-time visualization
 Because we're working with a message queue, it's trivial to attach other components to
 the pipeline. As an experiment, I made a [WebSocket adapter](https://github.com/simon987/ws_feed_adapter)
 and an [application](https://github.com/simon987/feed_viz) that displays the images as they are ingested in
 your web browser.
 {{< figure src="/cf/_chan.jpg" title="">}}
 The live feed can be accessed [here](https://feed.the-eye.eu).
--- a/content/posts/virus.md
+++ b/content/posts/virus.md
@@ -0,0 +1,12 @@
 ---
 title: "Playing with viruses and Windows VMs"
 date: 2019-07-05T22:13:13-04:00
 draft: true
 ---
 I had a bit of downtime so I decided to have a little bit of fun with viruses and unsecured Windows virtual machines.
 {{< figure src="/vm/diagram.png" title="">}}
 I started by creating a LAN of twelve machines,
--- a/diagrams/chan_feed.dia
+++ b/diagrams/chan_feed.dia
--- a/diagrams/vm_stuff.dia
+++ b/diagrams/vm_stuff.dia
--- a/static/cf/_chan.jpg
+++ b/static/cf/_chan.jpg
--- a/static/cf/diagram.png
+++ b/static/cf/diagram.png
--- a/static/cf/grafana.png
+++ b/static/cf/grafana.png
--- a/static/cf/pg.png
+++ b/static/cf/pg.png
--- a/static/vm/diagram.png
+++ b/static/vm/diagram.png