diff --git a/content/posts/sist2.md b/content/posts/sist2.md new file mode 100644 index 0000000..cefeb49 --- /dev/null +++ b/content/posts/sist2.md @@ -0,0 +1,196 @@ +--- +title: "Indexing your files with sist2" +date: 2019-11-04T19:31:45-05:00 +draft: false +author: simon987 +--- + + +# Overview + +[sist2](https://github.com/simon987/sist2) (simple incremental search tool) is a more powerful and more lightweight version of +its [Python predecessor](https://github.com/simon987/Simple-Incremental-Search-Tool). +It is currently being used to allow full-text search of terabytes of online documents such as scientific papers and comic books +at [the-eye.eu](https://searchin.the-eye.eu/). + +It can parse many common file types (See [README.md](https://github.com/simon987/sist2/blob/master/README.md#format-support) for +the updated list) and will extract text from their metadata and contents. + +The indexing process is typically done in three steps: `scan`, `index` then `web`. +For example: + +{{}} +sist2 scan ./my_documents/ -o idx/ +{{}} + +After this step, the raw index (**./idx/**) has been created and direct access to the files is no longer necessary. +This means that you can pass around the raw index folder or use sist2 to index files stored on cold storage. + +The `index` step will convert the raw index into JSON documents and push them to Elasticsearch. Sist2 +is compatible with versions 6.X and 7.X. + +{{}} +# Start a debug elasticsearch instance +docker run -d -p 9201:9200 \ + -e "discovery.type=single-node" \ + docker.elastic.co/elasticsearch/elasticsearch:7.4.2 + +# The --force-reset flag tells sist2 to (re)initialize +# the Elasticsearch mappings & settings +sist2 index idx/ --force-reset --es-url http://localhost:9201 +{{}} + +{{}} +sist2 web idx/ --port 8080 +# Starting web server @ http://localhost:8080 +{{}} + +## Web interface + +The web module can serve the search interface on its own without additional configuration. +What's interesting to note is that the files themselves can either be served by a remote HTTP server that +acts as an external CDN, or they can be served by sist2 directly from the disk. In the latter case, Partial +Content is supported, meaning that `Range` requests are accepted and media files can be *'seeked'* from +the browser. + +{{< figure src="/sist/sist_web.png" title="">}} + +The UI itself is not that much different from the original Python/Flask version, however, the Javascript +client is a bit *thicker*, meaning that most operations that were originally handled by the Flask server, +such as auto-complete and the retrieval of the mime type list are done client side. + +This is possible because Elasticsearch queries are proxied as is through sist2, for example, the mime type +selection widget is populated with a function similar to this: + +{{}} +$.post("es", { + // Elasticsearch query body + aggs: { + mimeTypes: { + terms: { + field: "mime", + size: 10000 + } + } + }, + size: 0, +}).then(resp => { + resp["aggregations"]["mimeTypes"]["buckets"].forEach(bucket => { + console.log(bucket); + //... + }); +}); +{{}} + +{{< figure src="/sist/sist_buckets.png" title="">}} + +Another improvement was to re-skin the whole page to allow users to choose the dark *OLED*-friendly theme. Pressing on +the theme toggle button sets `Cookie: sist=dark`, which tells sist2 to serve different content depending on the value +of the cookie. + +{{< figure src="/sist/sist.png" title="Web interface (Dark theme) displaying Occult Library books">}} + + +## Thumbnail storage + +An [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) (Lightning Memory-Mapped Database) +key-value store is used to asynchronously save the thumbnails as they are generated by the indexer. +Once the `scan` step is done, the database file is used by the `web` module to serve the thumbnails +with very little latency. + +Since the database is mapped in memory (See [mmap(2)](https://en.wikipedia.org/wiki/Mmap)), +the `web` process may appear to have a high memory usage under load, +but almost all of it is allocated for the **data.mdb** file. In fact if we take a look with `pmap`, +we can see that virtually all of the resident memory is used for LMDB and that none of it is *dirty*. +This means that the operating system will eventually reclaim the memory and that, over time, the memory usage will +return to ~20M. + +{{}} +$pmap -x + +Adress Kbytes RSS Dirty Mode Mapping +00005641d5689000 21300 536 0 r-x-- sist2 +00005641d5689000 0 0 0 r-x-- sist2 +00005641d6d56000 432 8 0 r-x-- sist2 +00005641d6d56000 0 0 0 r-x-- sist2 +00005641d6dc2000 32696 768 8 rwx-- sist2 +00005641d6dc2000 0 0 0 rwx-- sist2 +00005641d8db0000 8452 100 4 rwx-- [ anon ] +... +00007fd1d7419000 3180068 160000 0 rwxs- data.mdb +00007fd2998a4000 2290452 240 0 rwxs- data.mdb +00007fd32586b000 10721328 64868 0 rwxs- data.mdb +00007fd5b4179000 3535892 51616 0 rwxs- data.mdb +00007fd68c180000 4446024 118668 0 rwxs- data.mdb +00007fd79ba54000 1411416 47992 0 rwxs- data.mdb +00007fd7f1fac000 560000 6044 0 rwxs- data.mdb +00007fd81458e000 9069792 217464 0 rwxs- data.mdb +... +00007fda42736000 2048 0 0 ----- libc-2.24.so +---------------- ------- ------- ------- +total kB 36085472 683468 10472 +{{}} + + + +## Media Files + +All audio and video files are handled by ffmpeg's libav\* libraries, which is extremely helpful since +we can handle all `audio/*`, `video/*` and `image/*` (images are videos that have only one frame), file +types the same way. +For instance, there is no difference in the code between thumbnails that are generated from the embedded cover art of a `.mp3` +file versus thumbnails generated from a video stream of a `.mkv` container. We also don't have to worry about + odd encodings because ffmpeg is bundled with hundreds of decoders. + +## Font Files + +Font files were especially painful to work with, since I had to implement the code +to generate the thumbnails mostly from scratch. Each letter is individually drawn into +a bitmap, which is then converted to uncompressed *BMP* Format and saved directly to disk. +Thankfully, *most* font faces are relatively standard, in that they are meant +to be displayed from left to right, and +glyphs for the basic Latin alphabet are available. + +{{< figure src="/sist/font.png" title="">}} + +For the rest, I would mostly have to handle each corner case one by one. At the +time of writing this, I gave up on trying to render atypical font faces. + +## Raw Index Binary Format + +For simplicity's sake, the document metadata structure is dumped directly from memory to +file without much additional processing. While it's not as space-efficient as it could be, +it's much more (about 350%) smaller than the equivalent in JSON. + +**idx/_index\_\** +{{}} +000 e5 94 64 1d 82 91 4f 25 80 31 2b 69 db 23 14 79 ..d...O%.1+i.#.y +010 dd 00 84 00 31 08 00 00 10 fa 27 00 00 00 00 00 ....1.....'..... +020 8a 01 06 00 cc ea a7 5c 00 00 08 00 00 00 00 00 .......\........ +030 62 6f 62 72 6f 73 73 2e 77 65 62 6d 00 f6 8b 00 bobross.webm.... +040 00 00 f2 00 05 00 00 f3 d0 02 00 00 0a +{{}} + +This, of course, makes little difference since neither format is needed +after it has been indexed to Elasticsearch. + +**(Elasticsearch JSON document)** +{{}} +{ + "_id": "e594641d-8291-4f25-8031-2b69db231479", + "_index": "sist2", + "_type": "_doc", + "_source": { + "index": "bb3d8cc5-2e5c-4f1c-ac04-1b1f6d9b070a", + "mime": "video/webm", + "size": 2619920, + "mtime": 1554508492, + "extension": "webm", + "name": "bobross", + "path": "", + "videoc": "vp8", + "width": 1280, + "height": 720 + } +} +{{}} diff --git a/content/posts/zpaq.md b/content/posts/zpaq.md new file mode 100644 index 0000000..56a9de2 --- /dev/null +++ b/content/posts/zpaq.md @@ -0,0 +1,51 @@ +--- +title: "Android phone backups with zpaq" +date: 2019-11-05T13:16:27-05:00 +draft: true +author: simon987 +--- + +{{< figure src="/zpaq/10gb.png" title="Benchmark for 10GB">}} + +{{}} +pkg install g++ make git +git clone "https://github.com/zpaq/zpaq" +cd zpaq/ +# zpaq must be compiled with -DNOJIT for non-x86 processors +g++ -Ofast -DNOJIT -Dunix zpaq.cpp libzpaq.cpp -pthread -o zpaq +{{}} + +{{}} +## Initial backup can take a while to complete, +$ zpaq add "arc???" ./files/ -index local-index.zpaq +0.000000 + (955.283380 -> 687.840444 -> 622.268166) = 622.268166 MB +45.737 seconds (all OK) + +## but subsequent ones are almost instantaneous if no files were changed +$ zpaq add "arc???" ./files/ -index local-index.zpaq +0.000000 + (0.000000 -> 0.000000 -> 0.000104) = 0.000104 MB +0.408 seconds (all OK) + + +## +$ ls -lh +total 594M +-rw------- 1 u0_a94 u0_a94 594M Nov 5 14:18 arc001.zpaq +-rw------- 1 u0_a94 u0_a94 104 Nov 5 14:18 arc002.zpaq +-rwx------ 1 u0_a94 u0_a94 362 Nov 5 14:17 backup.sh +-rw------- 1 u0_a94 u0_a94 411K Nov 5 14:18 local-index.zpaq +{{}} + +{{}} +#!/usr/bin/env bash + +zpaq add "arc???" \ + ~/storage/shared/DCIM \ + ~/storage/shared/Documents \ + ~/storage/shared/Download \ + #... + -index local-index.zpaq -m2 + +rclone move arc*.zpaq my-remote:/backups +{{}} + diff --git a/diagrams/sist2_web.dia b/diagrams/sist2_web.dia new file mode 100644 index 0000000..0ac78ce Binary files /dev/null and b/diagrams/sist2_web.dia differ diff --git a/layouts/partials/css/tables-min.css b/layouts/partials/css/tables-min.css new file mode 100644 index 0000000..e141d94 --- /dev/null +++ b/layouts/partials/css/tables-min.css @@ -0,0 +1,78 @@ +/*! +Pure v1.0.0 +Copyright 2013 Yahoo! +Licensed under the BSD License. +https://github.com/yahoo/pure/blob/master/LICENSE.md +*/ +.pure-table { + /* Remove spacing between table cells (from Normalize.css) */ + border-collapse: collapse; + border-spacing: 0; + empty-cells: show; + border: 1px solid #cbcbcb; + width: 100%; +} + +.pure-table caption { + color: #000; + font: italic 85%/1 arial, sans-serif; + padding: 1em 0; + text-align: center; +} + +.pure-table td, +.pure-table th { + border-left: 1px solid #cbcbcb;/* inner column border */ + border-width: 0 0 0 1px; + font-size: inherit; + margin: 0; + overflow: visible; /*to make ths where the title is really long work*/ + padding: 0.5em 1em; /* cell padding */ + line-height: 1.1; +} + +.pure-table thead { + background-color: #E0E0E0; + color: #000; + text-align: left; + vertical-align: bottom; +} + +/* +striping: + even - #fff (white) + odd - #f2f2f2 (light gray) +*/ +.pure-table td { + background-color: transparent; +} +.pure-table-odd td { + background-color: #f2f2f2; +} + +/* nth-child selector for modern browsers */ +.pure-table-striped tr:nth-child(2n-1) td { + background-color: #212121; +} + +/* BORDERED TABLES */ +.pure-table-bordered td { + border-bottom: 1px solid #cbcbcb; +} +.pure-table-bordered tbody > tr:last-child > td { + border-bottom-width: 0; +} + + +/* HORIZONTAL BORDERED TABLES */ + +.pure-table-horizontal td, +.pure-table-horizontal th { + border-width: 0 0 1px 0; + border-bottom: 1px solid #cbcbcb; +} +.pure-table-horizontal tbody > tr:last-child > td { + border-bottom-width: 0; +} + + diff --git a/static/sist/font.png b/static/sist/font.png new file mode 100644 index 0000000..f51e0f3 Binary files /dev/null and b/static/sist/font.png differ diff --git a/static/sist/sist.png b/static/sist/sist.png new file mode 100644 index 0000000..b7867dc Binary files /dev/null and b/static/sist/sist.png differ diff --git a/static/sist/sist_buckets.png b/static/sist/sist_buckets.png new file mode 100644 index 0000000..57ffb93 Binary files /dev/null and b/static/sist/sist_buckets.png differ diff --git a/static/sist/sist_web.png b/static/sist/sist_web.png new file mode 100644 index 0000000..25be191 Binary files /dev/null and b/static/sist/sist_web.png differ diff --git a/static/zpaq/10gb.png b/static/zpaq/10gb.png new file mode 100644 index 0000000..28bdd61 Binary files /dev/null and b/static/zpaq/10gb.png differ