This commit is contained in:
simon 2019-11-06 09:26:24 -05:00
parent 5270edcc89
commit 455c9a7144
9 changed files with 325 additions and 0 deletions

196
content/posts/sist2.md Normal file
View File

@ -0,0 +1,196 @@
---
title: "Indexing your files with sist2"
date: 2019-11-04T19:31:45-05:00
draft: false
author: simon987
---
# Overview
[sist2](https://github.com/simon987/sist2) (simple incremental search tool) is a more powerful and more lightweight version of
its [Python predecessor](https://github.com/simon987/Simple-Incremental-Search-Tool).
It is currently being used to allow full-text search of terabytes of online documents such as scientific papers and comic books
at [the-eye.eu](https://searchin.the-eye.eu/).
It can parse many common file types (See [README.md](https://github.com/simon987/sist2/blob/master/README.md#format-support) for
the updated list) and will extract text from their metadata and contents.
The indexing process is typically done in three steps: `scan`, `index` then `web`.
For example:
{{<highlight bash>}}
sist2 scan ./my_documents/ -o idx/
{{</highlight>}}
After this step, the raw index (**./idx/**) has been created and direct access to the files is no longer necessary.
This means that you can pass around the raw index folder or use sist2 to index files stored on cold storage.
The `index` step will convert the raw index into JSON documents and push them to Elasticsearch. Sist2
is compatible with versions 6.X and 7.X.
{{<highlight bash>}}
# Start a debug elasticsearch instance
docker run -d -p 9201:9200 \
-e "discovery.type=single-node" \
docker.elastic.co/elasticsearch/elasticsearch:7.4.2
# The --force-reset flag tells sist2 to (re)initialize
# the Elasticsearch mappings & settings
sist2 index idx/ --force-reset --es-url http://localhost:9201
{{</highlight>}}
{{<highlight bash>}}
sist2 web idx/ --port 8080
# Starting web server @ http://localhost:8080
{{</highlight>}}
## Web interface
The web module can serve the search interface on its own without additional configuration.
What's interesting to note is that the files themselves can either be served by a remote HTTP server that
acts as an external CDN, or they can be served by sist2 directly from the disk. In the latter case, Partial
Content is supported, meaning that `Range` requests are accepted and media files can be *'seeked'* from
the browser.
{{< figure src="/sist/sist_web.png" title="">}}
The UI itself is not that much different from the original Python/Flask version, however, the Javascript
client is a bit *thicker*, meaning that most operations that were originally handled by the Flask server,
such as auto-complete and the retrieval of the mime type list are done client side.
This is possible because Elasticsearch queries are proxied as is through sist2, for example, the mime type
selection widget is populated with a function similar to this:
{{<highlight javascript>}}
$.post("es", {
// Elasticsearch query body
aggs: {
mimeTypes: {
terms: {
field: "mime",
size: 10000
}
}
},
size: 0,
}).then(resp => {
resp["aggregations"]["mimeTypes"]["buckets"].forEach(bucket => {
console.log(bucket);
//...
});
});
{{</highlight>}}
{{< figure src="/sist/sist_buckets.png" title="">}}
Another improvement was to re-skin the whole page to allow users to choose the dark *OLED*-friendly theme. Pressing on
the theme toggle button sets `Cookie: sist=dark`, which tells sist2 to serve different content depending on the value
of the cookie.
{{< figure src="/sist/sist.png" title="Web interface (Dark theme) displaying Occult Library books">}}
## Thumbnail storage
An [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) (Lightning Memory-Mapped Database)
key-value store is used to asynchronously save the thumbnails as they are generated by the indexer.
Once the `scan` step is done, the database file is used by the `web` module to serve the thumbnails
with very little latency.
Since the database is mapped in memory (See [mmap(2)](https://en.wikipedia.org/wiki/Mmap)),
the `web` process may appear to have a high memory usage under load,
but almost all of it is allocated for the **data.mdb** file. In fact if we take a look with `pmap`,
we can see that virtually all of the resident memory is used for LMDB and that none of it is *dirty*.
This means that the operating system will eventually reclaim the memory and that, over time, the memory usage will
return to ~20M.
{{<highlight _>}}
$pmap -x <PID>
Adress Kbytes RSS Dirty Mode Mapping
00005641d5689000 21300 536 0 r-x-- sist2
00005641d5689000 0 0 0 r-x-- sist2
00005641d6d56000 432 8 0 r-x-- sist2
00005641d6d56000 0 0 0 r-x-- sist2
00005641d6dc2000 32696 768 8 rwx-- sist2
00005641d6dc2000 0 0 0 rwx-- sist2
00005641d8db0000 8452 100 4 rwx-- [ anon ]
...
00007fd1d7419000 3180068 160000 0 rwxs- data.mdb
00007fd2998a4000 2290452 240 0 rwxs- data.mdb
00007fd32586b000 10721328 64868 0 rwxs- data.mdb
00007fd5b4179000 3535892 51616 0 rwxs- data.mdb
00007fd68c180000 4446024 118668 0 rwxs- data.mdb
00007fd79ba54000 1411416 47992 0 rwxs- data.mdb
00007fd7f1fac000 560000 6044 0 rwxs- data.mdb
00007fd81458e000 9069792 217464 0 rwxs- data.mdb
...
00007fda42736000 2048 0 0 ----- libc-2.24.so
---------------- ------- ------- -------
total kB 36085472 683468 10472
{{</highlight>}}
## Media Files
All audio and video files are handled by ffmpeg's libav\* libraries, which is extremely helpful since
we can handle all `audio/*`, `video/*` and `image/*` (images are videos that have only one frame), file
types the same way.
For instance, there is no difference in the code between thumbnails that are generated from the embedded cover art of a `.mp3`
file versus thumbnails generated from a video stream of a `.mkv` container. We also don't have to worry about
odd encodings because ffmpeg is bundled with hundreds of decoders.
## Font Files
Font files were especially painful to work with, since I had to implement the code
to generate the thumbnails mostly from scratch. Each letter is individually drawn into
a bitmap, which is then converted to uncompressed *BMP* Format and saved directly to disk.
Thankfully, *most* font faces are relatively standard, in that they are meant
to be displayed from left to right, and
glyphs for the basic Latin alphabet are available.
{{< figure src="/sist/font.png" title="">}}
For the rest, I would mostly have to handle each corner case one by one. At the
time of writing this, I gave up on trying to render atypical font faces.
## Raw Index Binary Format
For simplicity's sake, the document metadata structure is dumped directly from memory to
file without much additional processing. While it's not as space-efficient as it could be,
it's much more (about 350%) smaller than the equivalent in JSON.
**idx/_index\_\<pid\>**
{{<highlight hexdump>}}
000 e5 94 64 1d 82 91 4f 25 80 31 2b 69 db 23 14 79 ..d...O%.1+i.#.y
010 dd 00 84 00 31 08 00 00 10 fa 27 00 00 00 00 00 ....1.....'.....
020 8a 01 06 00 cc ea a7 5c 00 00 08 00 00 00 00 00 .......\........
030 62 6f 62 72 6f 73 73 2e 77 65 62 6d 00 f6 8b 00 bobross.webm....
040 00 00 f2 00 05 00 00 f3 d0 02 00 00 0a
{{</highlight>}}
This, of course, makes little difference since neither format is needed
after it has been indexed to Elasticsearch.
**(Elasticsearch JSON document)**
{{<highlight json>}}
{
"_id": "e594641d-8291-4f25-8031-2b69db231479",
"_index": "sist2",
"_type": "_doc",
"_source": {
"index": "bb3d8cc5-2e5c-4f1c-ac04-1b1f6d9b070a",
"mime": "video/webm",
"size": 2619920,
"mtime": 1554508492,
"extension": "webm",
"name": "bobross",
"path": "",
"videoc": "vp8",
"width": 1280,
"height": 720
}
}
{{</highlight>}}

51
content/posts/zpaq.md Normal file
View File

@ -0,0 +1,51 @@
---
title: "Android phone backups with zpaq"
date: 2019-11-05T13:16:27-05:00
draft: true
author: simon987
---
{{< figure src="/zpaq/10gb.png" title="Benchmark for 10GB">}}
{{<highlight bash>}}
pkg install g++ make git
git clone "https://github.com/zpaq/zpaq"
cd zpaq/
# zpaq must be compiled with -DNOJIT for non-x86 processors
g++ -Ofast -DNOJIT -Dunix zpaq.cpp libzpaq.cpp -pthread -o zpaq
{{</highlight>}}
{{<highlight _>}}
## Initial backup can take a while to complete,
$ zpaq add "arc???" ./files/ -index local-index.zpaq
0.000000 + (955.283380 -> 687.840444 -> 622.268166) = 622.268166 MB
45.737 seconds (all OK)
## but subsequent ones are almost instantaneous if no files were changed
$ zpaq add "arc???" ./files/ -index local-index.zpaq
0.000000 + (0.000000 -> 0.000000 -> 0.000104) = 0.000104 MB
0.408 seconds (all OK)
##
$ ls -lh
total 594M
-rw------- 1 u0_a94 u0_a94 594M Nov 5 14:18 arc001.zpaq
-rw------- 1 u0_a94 u0_a94 104 Nov 5 14:18 arc002.zpaq
-rwx------ 1 u0_a94 u0_a94 362 Nov 5 14:17 backup.sh
-rw------- 1 u0_a94 u0_a94 411K Nov 5 14:18 local-index.zpaq
{{</highlight>}}
{{<highlight bash "linenos=table">}}
#!/usr/bin/env bash
zpaq add "arc???" \
~/storage/shared/DCIM \
~/storage/shared/Documents \
~/storage/shared/Download \
#...
-index local-index.zpaq -m2
rclone move arc*.zpaq my-remote:/backups
{{</highlight>}}

BIN
diagrams/sist2_web.dia Normal file

Binary file not shown.

78
layouts/partials/css/tables-min.css vendored Normal file
View File

@ -0,0 +1,78 @@
/*!
Pure v1.0.0
Copyright 2013 Yahoo!
Licensed under the BSD License.
https://github.com/yahoo/pure/blob/master/LICENSE.md
*/
.pure-table {
/* Remove spacing between table cells (from Normalize.css) */
border-collapse: collapse;
border-spacing: 0;
empty-cells: show;
border: 1px solid #cbcbcb;
width: 100%;
}
.pure-table caption {
color: #000;
font: italic 85%/1 arial, sans-serif;
padding: 1em 0;
text-align: center;
}
.pure-table td,
.pure-table th {
border-left: 1px solid #cbcbcb;/* inner column border */
border-width: 0 0 0 1px;
font-size: inherit;
margin: 0;
overflow: visible; /*to make ths where the title is really long work*/
padding: 0.5em 1em; /* cell padding */
line-height: 1.1;
}
.pure-table thead {
background-color: #E0E0E0;
color: #000;
text-align: left;
vertical-align: bottom;
}
/*
striping:
even - #fff (white)
odd - #f2f2f2 (light gray)
*/
.pure-table td {
background-color: transparent;
}
.pure-table-odd td {
background-color: #f2f2f2;
}
/* nth-child selector for modern browsers */
.pure-table-striped tr:nth-child(2n-1) td {
background-color: #212121;
}
/* BORDERED TABLES */
.pure-table-bordered td {
border-bottom: 1px solid #cbcbcb;
}
.pure-table-bordered tbody > tr:last-child > td {
border-bottom-width: 0;
}
/* HORIZONTAL BORDERED TABLES */
.pure-table-horizontal td,
.pure-table-horizontal th {
border-width: 0 0 1px 0;
border-bottom: 1px solid #cbcbcb;
}
.pure-table-horizontal tbody > tr:last-child > td {
border-bottom-width: 0;
}

BIN
static/sist/font.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

BIN
static/sist/sist.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 671 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

BIN
static/sist/sist_web.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 169 KiB

BIN
static/zpaq/10gb.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB