2019-11-06 09:42:47 -05:00

7.7 KiB

title date draft tags author
Indexing your files with sist2 2019-11-04T19:31:45-05:00 false
data curation
misc
simon987

Overview

sist2 (simple incremental search tool) is a more powerful and more lightweight version of its Python predecessor. It is currently being used to allow full-text search of terabytes of online documents such as scientific papers and comic books at the-eye.eu.

It can parse many common file types (See README.md for the updated list) and will extract text from their metadata and contents.

The indexing process is typically done in three steps: scan, index then web. For example:

{{}} sist2 scan ./my_documents/ -o idx/ {{}}

After this step, the raw index (./idx/) has been created and direct access to the files is no longer necessary. This means that you can pass around the raw index folder or use sist2 to index files stored on cold storage.

The index step will convert the raw index into JSON documents and push them to Elasticsearch. Sist2 is compatible with versions 6.X and 7.X.

{{}}

Start a debug elasticsearch instance

docker run -d -p 9201:9200
-e "discovery.type=single-node"
docker.elastic.co/elasticsearch/elasticsearch:7.4.2

The --force-reset flag tells sist2 to (re)initialize

the Elasticsearch mappings & settings

sist2 index idx/ --force-reset --es-url http://localhost:9201 {{}}

{{}} sist2 web idx/ --port 8080

Starting web server @ http://localhost:8080

{{}}

Web interface

The web module can serve the search interface on its own without additional configuration. What's interesting to note is that the files themselves can either be served by a remote HTTP server that acts as an external CDN, or they can be served by sist2 directly from the disk. In the latter case, Partial Content is supported, meaning that Range requests are accepted and media files can be 'seeked' from the browser.

{{< figure height="350px" src="/sist/sist_web.png" title="">}}

The UI itself is not that much different from the original Python/Flask version, however, the Javascript client is a bit thicker, meaning that most operations that were originally handled by the Flask server, such as auto-complete and the retrieval of the mime type list are done client side.

This is possible because Elasticsearch queries are proxied as is through sist2, for example, the mime type selection widget is populated with a function similar to this:

{{}} $.post("es", { // Elasticsearch query body aggs: { mimeTypes: { terms: { field: "mime", size: 10000 } } }, size: 0, }).then(resp => { resp["aggregations"]["mimeTypes"]["buckets"].forEach(bucket => { console.log(bucket); //... }); }); {{}}

{{< figure src="/sist/sist_buckets.png" title="">}}

Another improvement was to re-skin the whole page to allow users to choose the dark OLED-friendly theme. Pressing on the theme toggle button sets Cookie: sist=dark, which tells sist2 to serve different content depending on the value of the cookie.

{{< figure src="/sist/sist.png" title="Web interface (Dark theme) displaying Occult Library books">}}

Thumbnail storage

An LMDB (Lightning Memory-Mapped Database) key-value store is used to asynchronously save the thumbnails as they are generated by the indexer. Once the scan step is done, the database file is used by the web module to serve the thumbnails with very little latency.

Since the database is mapped in memory (See mmap(2)), the web process may appear to have a high memory usage under load, but almost all of it is allocated for the data.mdb file. In fact if we take a look with pmap, we can see that virtually all of the resident memory is used for LMDB and that none of it is dirty. This means that the operating system will eventually reclaim the memory and that, over time, the memory usage will return to ~20M.

{{}} $pmap -x

Adress Kbytes RSS Dirty Mode Mapping 00005641d5689000 21300 536 0 r-x-- sist2 00005641d5689000 0 0 0 r-x-- sist2 00005641d6d56000 432 8 0 r-x-- sist2 00005641d6d56000 0 0 0 r-x-- sist2 00005641d6dc2000 32696 768 8 rwx-- sist2 00005641d6dc2000 0 0 0 rwx-- sist2 00005641d8db0000 8452 100 4 rwx-- [ anon ] ... 00007fd1d7419000 3180068 160000 0 rwxs- data.mdb 00007fd2998a4000 2290452 240 0 rwxs- data.mdb 00007fd32586b000 10721328 64868 0 rwxs- data.mdb 00007fd5b4179000 3535892 51616 0 rwxs- data.mdb 00007fd68c180000 4446024 118668 0 rwxs- data.mdb 00007fd79ba54000 1411416 47992 0 rwxs- data.mdb 00007fd7f1fac000 560000 6044 0 rwxs- data.mdb 00007fd81458e000 9069792 217464 0 rwxs- data.mdb ... 00007fda42736000 2048 0 0 ----- libc-2.24.so


total kB 36085472 683468 10472 {{}}

Media Files

All audio and video files are handled by ffmpeg's libav* libraries, which is extremely helpful since we can handle all audio/*, video/* and image/* (images are videos that have only one frame), file types the same way. For instance, there is no difference in the code between thumbnails that are generated from the embedded cover art of a .mp3 file versus thumbnails generated from a video stream of a .mkv container. We also don't have to worry about odd encodings because ffmpeg is bundled with hundreds of decoders.

Font Files

Font files were especially painful to work with, since I had to implement the code to generate the thumbnails mostly from scratch. Each letter is individually drawn into a bitmap, which is then converted to uncompressed BMP Format and saved directly to disk. Thankfully, most font faces are relatively standard, in that they are meant to be displayed from left to right, and glyphs for the basic Latin alphabet are available.

{{< figure src="/sist/font.png" title="">}}

For the rest, I would mostly have to handle each corner case one by one. At the time of writing this, I gave up on trying to render atypical font faces.

Raw Index Binary Format

For simplicity's sake, the document metadata structure is dumped directly from memory to file without much additional processing. While it's not as space-efficient as it could be, it's much more (about 350%) smaller than the equivalent in JSON.

idx/_index_<pid> {{}} 000 e5 94 64 1d 82 91 4f 25 80 31 2b 69 db 23 14 79 ..d...O%.1+i.#.y 010 dd 00 84 00 31 08 00 00 10 fa 27 00 00 00 00 00 ....1.....'..... 020 8a 01 06 00 cc ea a7 5c 00 00 08 00 00 00 00 00 ............... 030 62 6f 62 72 6f 73 73 2e 77 65 62 6d 00 f6 8b 00 bobross.webm.... 040 00 00 f2 00 05 00 00 f3 d0 02 00 00 0a
{{}}

This, of course, makes little difference since neither format is needed after it has been indexed to Elasticsearch.

(Elasticsearch JSON document) {{}} { "_id": "e594641d-8291-4f25-8031-2b69db231479", "_index": "sist2", "_type": "_doc", "_source": { "index": "bb3d8cc5-2e5c-4f1c-ac04-1b1f6d9b070a", "mime": "video/webm", "size": 2619920, "mtime": 1554508492, "extension": "webm", "name": "bobross", "path": "", "videoc": "vp8", "width": 1280, "height": 720 } } {{}}