mirror of
https://github.com/simon987/dataarchivist.net.git
synced 2025-04-18 00:46:42 +00:00
198 lines
7.7 KiB
Markdown
198 lines
7.7 KiB
Markdown
---
|
||
title: "Indexing your files with sist2"
|
||
date: 2019-11-04T19:31:45-05:00
|
||
draft: false
|
||
tags: ["data curation", "misc"]
|
||
author: simon987
|
||
---
|
||
|
||
|
||
# Overview
|
||
|
||
[sist2](https://github.com/simon987/sist2) (simple incremental search tool) is a more powerful and more lightweight version of
|
||
its [Python predecessor](https://github.com/simon987/Simple-Incremental-Search-Tool).
|
||
It is currently being used to allow full-text search of terabytes of online documents such as scientific papers and comic books
|
||
at [the-eye.eu](https://searchin.the-eye.eu/).
|
||
|
||
It can parse many common file types (See [README.md](https://github.com/simon987/sist2/blob/master/README.md#format-support) for
|
||
the updated list) and will extract text from their metadata and contents.
|
||
|
||
The indexing process is typically done in three steps: `scan`, `index` then `web`.
|
||
For example:
|
||
|
||
{{<highlight bash>}}
|
||
sist2 scan ./my_documents/ -o idx/
|
||
{{</highlight>}}
|
||
|
||
After this step, the raw index (**./idx/**) has been created and direct access to the files is no longer necessary.
|
||
This means that you can pass around the raw index folder or use sist2 to index files stored on cold storage.
|
||
|
||
The `index` step will convert the raw index into JSON documents and push them to Elasticsearch. Sist2
|
||
is compatible with versions 6.X and 7.X.
|
||
|
||
{{<highlight bash>}}
|
||
# Start a debug elasticsearch instance
|
||
docker run -d -p 9201:9200 \
|
||
-e "discovery.type=single-node" \
|
||
docker.elastic.co/elasticsearch/elasticsearch:7.4.2
|
||
|
||
# The --force-reset flag tells sist2 to (re)initialize
|
||
# the Elasticsearch mappings & settings
|
||
sist2 index idx/ --force-reset --es-url http://localhost:9201
|
||
{{</highlight>}}
|
||
|
||
{{<highlight bash>}}
|
||
sist2 web idx/ --port 8080
|
||
# Starting web server @ http://localhost:8080
|
||
{{</highlight>}}
|
||
|
||
## Web interface
|
||
|
||
The web module can serve the search interface on its own without additional configuration.
|
||
What's interesting to note is that the files themselves can either be served by a remote HTTP server that
|
||
acts as an external CDN, or they can be served by sist2 directly from the disk. In the latter case, Partial
|
||
Content is supported, meaning that `Range` requests are accepted and media files can be *'seeked'* from
|
||
the browser.
|
||
|
||
{{< figure height="350px" src="/sist/sist_web.png" title="">}}
|
||
|
||
The UI itself is not that much different from the original Python/Flask version, however, the Javascript
|
||
client is a bit *thicker*, meaning that most operations that were originally handled by the Flask server,
|
||
such as auto-complete and the retrieval of the mime type list are done client side.
|
||
|
||
This is possible because Elasticsearch queries are proxied as is through sist2, for example, the mime type
|
||
selection widget is populated with a function similar to this:
|
||
|
||
{{<highlight javascript>}}
|
||
$.post("es", {
|
||
// Elasticsearch query body
|
||
aggs: {
|
||
mimeTypes: {
|
||
terms: {
|
||
field: "mime",
|
||
size: 10000
|
||
}
|
||
}
|
||
},
|
||
size: 0,
|
||
}).then(resp => {
|
||
resp["aggregations"]["mimeTypes"]["buckets"].forEach(bucket => {
|
||
console.log(bucket);
|
||
//...
|
||
});
|
||
});
|
||
{{</highlight>}}
|
||
|
||
{{< figure src="/sist/sist_buckets.png" title="">}}
|
||
|
||
Another improvement was to re-skin the whole page to allow users to choose the dark *OLED*-friendly theme. Pressing on
|
||
the theme toggle button sets `Cookie: sist=dark`, which tells sist2 to serve different content depending on the value
|
||
of the cookie.
|
||
|
||
{{< figure src="/sist/sist.png" title="Web interface (Dark theme) displaying Occult Library books">}}
|
||
|
||
|
||
## Thumbnail storage
|
||
|
||
An [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) (Lightning Memory-Mapped Database)
|
||
key-value store is used to asynchronously save the thumbnails as they are generated by the indexer.
|
||
Once the `scan` step is done, the database file is used by the `web` module to serve the thumbnails
|
||
with very little latency.
|
||
|
||
Since the database is mapped in memory (See [mmap(2)](https://en.wikipedia.org/wiki/Mmap)),
|
||
the `web` process may appear to have a high memory usage under load,
|
||
but almost all of it is allocated for the **data.mdb** file. In fact if we take a look with `pmap`,
|
||
we can see that virtually all of the resident memory is used for LMDB and that none of it is *dirty*.
|
||
This means that the operating system will eventually reclaim the memory and that, over time, the memory usage will
|
||
return to ~20M.
|
||
|
||
{{<highlight _>}}
|
||
$pmap -x <PID>
|
||
|
||
Adress Kbytes RSS Dirty Mode Mapping
|
||
00005641d5689000 21300 536 0 r-x-- sist2
|
||
00005641d5689000 0 0 0 r-x-- sist2
|
||
00005641d6d56000 432 8 0 r-x-- sist2
|
||
00005641d6d56000 0 0 0 r-x-- sist2
|
||
00005641d6dc2000 32696 768 8 rwx-- sist2
|
||
00005641d6dc2000 0 0 0 rwx-- sist2
|
||
00005641d8db0000 8452 100 4 rwx-- [ anon ]
|
||
...
|
||
00007fd1d7419000 3180068 160000 0 rwxs- data.mdb
|
||
00007fd2998a4000 2290452 240 0 rwxs- data.mdb
|
||
00007fd32586b000 10721328 64868 0 rwxs- data.mdb
|
||
00007fd5b4179000 3535892 51616 0 rwxs- data.mdb
|
||
00007fd68c180000 4446024 118668 0 rwxs- data.mdb
|
||
00007fd79ba54000 1411416 47992 0 rwxs- data.mdb
|
||
00007fd7f1fac000 560000 6044 0 rwxs- data.mdb
|
||
00007fd81458e000 9069792 217464 0 rwxs- data.mdb
|
||
...
|
||
00007fda42736000 2048 0 0 ----- libc-2.24.so
|
||
---------------- ------- ------- -------
|
||
total kB 36085472 683468 10472
|
||
{{</highlight>}}
|
||
|
||
|
||
|
||
## Media Files
|
||
|
||
All audio and video files are handled by ffmpeg's libav\* libraries, which is extremely helpful since
|
||
we can handle all `audio/*`, `video/*` and `image/*` (images are videos that have only one frame), file
|
||
types the same way.
|
||
For instance, there is no difference in the code between thumbnails that are generated from the embedded cover art of a `.mp3`
|
||
file versus thumbnails generated from a video stream of a `.mkv` container. We also don't have to worry about
|
||
odd encodings because ffmpeg is bundled with hundreds of decoders.
|
||
|
||
## Font Files
|
||
|
||
Font files were especially painful to work with, since I had to implement the code
|
||
to generate the thumbnails mostly from scratch. Each letter is individually drawn into
|
||
a bitmap, which is then converted to uncompressed *BMP* Format and saved directly to disk.
|
||
Thankfully, *most* font faces are relatively standard, in that they are meant
|
||
to be displayed from left to right, and
|
||
glyphs for the basic Latin alphabet are available.
|
||
|
||
{{< figure src="/sist/font.png" title="">}}
|
||
|
||
For the rest, I would mostly have to handle each corner case one by one. At the
|
||
time of writing this, I gave up on trying to render atypical font faces.
|
||
|
||
## Raw Index Binary Format
|
||
|
||
For simplicity's sake, the document metadata structure is dumped directly from memory to
|
||
file without much additional processing. While it's not as space-efficient as it could be,
|
||
it's much more (about 350%) smaller than the equivalent in JSON.
|
||
|
||
**idx/_index\_\<pid\>**
|
||
{{<highlight hexdump>}}
|
||
000 e5 94 64 1d 82 91 4f 25 80 31 2b 69 db 23 14 79 ..d...O%.1+i.#.y
|
||
010 dd 00 84 00 31 08 00 00 10 fa 27 00 00 00 00 00 ....1.....'.....
|
||
020 8a 01 06 00 cc ea a7 5c 00 00 08 00 00 00 00 00 .......\........
|
||
030 62 6f 62 72 6f 73 73 2e 77 65 62 6d 00 f6 8b 00 bobross.webm....
|
||
040 00 00 f2 00 05 00 00 f3 d0 02 00 00 0a
|
||
{{</highlight>}}
|
||
|
||
This, of course, makes little difference since neither format is needed
|
||
after it has been indexed to Elasticsearch.
|
||
|
||
**(Elasticsearch JSON document)**
|
||
{{<highlight json>}}
|
||
{
|
||
"_id": "e594641d-8291-4f25-8031-2b69db231479",
|
||
"_index": "sist2",
|
||
"_type": "_doc",
|
||
"_source": {
|
||
"index": "bb3d8cc5-2e5c-4f1c-ac04-1b1f6d9b070a",
|
||
"mime": "video/webm",
|
||
"size": 2619920,
|
||
"mtime": 1554508492,
|
||
"extension": "webm",
|
||
"name": "bobross",
|
||
"path": "",
|
||
"videoc": "vp8",
|
||
"width": 1280,
|
||
"height": 720
|
||
}
|
||
}
|
||
{{</highlight>}}
|