dataarchivist.net/content/posts/sist2.md

---
title: "Indexing your files with sist2"
date: 2019-11-04T19:31:45-05:00
draft: false
tags: ["data curation", "misc"]
author: simon987
---


# Overview

[sist2](https://github.com/simon987/sist2) (simple incremental search tool) is a more powerful and more lightweight version of
its [Python predecessor](https://github.com/simon987/Simple-Incremental-Search-Tool).
It is currently being used to allow full-text search of terabytes of online documents such as scientific papers and comic books
at [the-eye.eu](https://searchin.the-eye.eu/).

It can parse many common file types (See [README.md](https://github.com/simon987/sist2/blob/master/README.md#format-support) for
the updated list) and will extract text from their metadata and contents.

The indexing process is typically done in three steps: `scan`, `index` then `web`.
For example:

{{<highlight bash>}}
sist2 scan ./my_documents/ -o idx/
{{</highlight>}}

After this step, the raw index (**./idx/**) has been created and direct access to the files is no longer necessary.
This means that you can pass around the raw index folder or use sist2 to index files stored on cold storage.

The `index` step will convert the raw index into JSON documents and push them to Elasticsearch. Sist2
is compatible with versions 6.X and 7.X.

{{<highlight bash>}}
# Start a debug elasticsearch instance
docker run -d -p 9201:9200 \
	-e "discovery.type=single-node" \
	docker.elastic.co/elasticsearch/elasticsearch:7.4.2

# The --force-reset flag tells sist2 to (re)initialize
#  the Elasticsearch mappings & settings
sist2 index idx/ --force-reset --es-url http://localhost:9201
{{</highlight>}}

{{<highlight bash>}}
sist2 web idx/ --port 8080
# Starting web server @ http://localhost:8080
{{</highlight>}}

## Web interface

The web module can serve the search interface on its own without additional configuration.
What's interesting to note is that the files themselves can either be served by a remote HTTP server that
acts as an external CDN, or they can be served by sist2 directly from the disk. In the latter case, Partial
Content is supported, meaning that `Range` requests are accepted and media files can be *'seeked'* from
the browser.

{{< figure height="350px" src="/sist/sist_web.png" title="">}}

The UI itself is not that much different from the original Python/Flask version, however, the Javascript
client is a bit *thicker*, meaning that most operations that were originally handled by the Flask server,
such as auto-complete and the retrieval of the mime type list are done client side.

This is possible because Elasticsearch queries are proxied as is through sist2, for example, the mime type
selection widget is populated with a function similar to this:

{{<highlight javascript>}}
$.post("es", {
	// Elasticsearch query body
    aggs: {
        mimeTypes: {
            terms: {
                field: "mime",
                size: 10000
            }
        }
    },
    size: 0,
}).then(resp => {
    resp["aggregations"]["mimeTypes"]["buckets"].forEach(bucket => {
		console.log(bucket);
		//...
	});
});
{{</highlight>}}

{{< figure src="/sist/sist_buckets.png" title="">}}

Another improvement was to re-skin the whole page to allow users to choose the dark *OLED*-friendly theme. Pressing on
the theme toggle button sets `Cookie: sist=dark`, which tells sist2 to serve different content depending on the value
of the cookie.

{{< figure src="/sist/sist.png" title="Web interface (Dark theme) displaying Occult Library books">}}


## Thumbnail storage

An [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) (Lightning Memory-Mapped Database)
key-value store is used to asynchronously save the thumbnails as they are generated by the indexer.
Once the `scan` step is done, the database file is used by the `web` module to serve the thumbnails
with very little latency.

Since the database is mapped in memory (See [mmap(2)](https://en.wikipedia.org/wiki/Mmap)),
the `web` process may appear to have a high memory usage under load,
but almost all of it is allocated for the **data.mdb** file. In fact if we take a look with `pmap`,
we can see that virtually all of the resident memory is used for LMDB and that none of it is *dirty*.
This means that the operating system will eventually reclaim the memory and that, over time, the memory usage will
return to ~20M.

{{<highlight _>}}
$pmap -x <PID>

Adress           Kbytes     RSS   Dirty Mode  Mapping
00005641d5689000   21300     536       0 r-x-- sist2
00005641d5689000       0       0       0 r-x-- sist2
00005641d6d56000     432       8       0 r-x-- sist2
00005641d6d56000       0       0       0 r-x-- sist2
00005641d6dc2000   32696     768       8 rwx-- sist2
00005641d6dc2000       0       0       0 rwx-- sist2
00005641d8db0000    8452     100       4 rwx--   [ anon ]
...
00007fd1d7419000 3180068  160000       0 rwxs- data.mdb
00007fd2998a4000 2290452     240       0 rwxs- data.mdb
00007fd32586b000 10721328   64868       0 rwxs- data.mdb
00007fd5b4179000 3535892   51616       0 rwxs- data.mdb
00007fd68c180000 4446024  118668       0 rwxs- data.mdb
00007fd79ba54000 1411416   47992       0 rwxs- data.mdb
00007fd7f1fac000  560000    6044       0 rwxs- data.mdb
00007fd81458e000 9069792  217464       0 rwxs- data.mdb
...
00007fda42736000    2048       0       0 ----- libc-2.24.so
---------------- ------- ------- -------
total kB         36085472  683468   10472
{{</highlight>}}


## Media Files

All audio and video files are handled by ffmpeg's libav\* libraries, which is extremely helpful since
we can handle all `audio/*`, `video/*` and `image/*` (images are videos that have only one frame),  file
types the same way.
For instance, there is no difference in the code between thumbnails that are generated from the embedded cover art of a `.mp3`
file versus thumbnails generated from a video stream of a `.mkv` container.  We also don't have to worry about
 odd encodings because ffmpeg is bundled with hundreds of decoders.

## Font Files

Font files were especially painful to work with, since I had to implement the code
to generate the thumbnails mostly from scratch. Each letter is individually drawn into
a bitmap, which is then converted to uncompressed *BMP* Format and saved directly to disk.
Thankfully, *most* font faces are relatively standard, in that they are meant
to be displayed from left to right, and
glyphs for the basic Latin alphabet are available.

{{< figure src="/sist/font.png" title="">}}

For the rest, I would mostly have to handle each corner case one by one. At the
time of writing this, I gave up on trying to render atypical font faces.

## Raw Index Binary Format

For simplicity's sake, the document metadata structure is dumped directly from memory to
file without much additional processing. While it's not as space-efficient as it could be,
it's much more (about 350%) smaller than the equivalent in JSON.

**idx/_index\_\<pid\>**
{{<highlight hexdump>}}
000  e5 94 64 1d 82 91 4f 25  80 31 2b 69 db 23 14 79  ..d...O%.1+i.#.y
010  dd 00 84 00 31 08 00 00  10 fa 27 00 00 00 00 00  ....1.....'.....
020  8a 01 06 00 cc ea a7 5c  00 00 08 00 00 00 00 00  .......\........
030  62 6f 62 72 6f 73 73 2e  77 65 62 6d 00 f6 8b 00  bobross.webm....
040  00 00 f2 00 05 00 00 f3  d0 02 00 00 0a
{{</highlight>}}

This, of course, makes little difference since neither format is needed
after it has been indexed to Elasticsearch.

**(Elasticsearch JSON document)**
{{<highlight json>}}
{
  "_id": "e594641d-8291-4f25-8031-2b69db231479",
  "_index": "sist2",
  "_type": "_doc",
  "_source": {
    "index": "bb3d8cc5-2e5c-4f1c-ac04-1b1f6d9b070a",
    "mime": "video/webm",
    "size": 2619920,
    "mtime": 1554508492,
    "extension": "webm",
    "name": "bobross",
    "path": "",
    "videoc": "vp8",
    "width": 1280,
    "height": 720
  }
}
{{</highlight>}}