mirror of
https://github.com/simon987/dataarchivist.net.git
synced 2025-04-04 08:42:58 +00:00
sist2
This commit is contained in:
parent
5270edcc89
commit
455c9a7144
196
content/posts/sist2.md
Normal file
196
content/posts/sist2.md
Normal file
@ -0,0 +1,196 @@
|
||||
---
|
||||
title: "Indexing your files with sist2"
|
||||
date: 2019-11-04T19:31:45-05:00
|
||||
draft: false
|
||||
author: simon987
|
||||
---
|
||||
|
||||
|
||||
# Overview
|
||||
|
||||
[sist2](https://github.com/simon987/sist2) (simple incremental search tool) is a more powerful and more lightweight version of
|
||||
its [Python predecessor](https://github.com/simon987/Simple-Incremental-Search-Tool).
|
||||
It is currently being used to allow full-text search of terabytes of online documents such as scientific papers and comic books
|
||||
at [the-eye.eu](https://searchin.the-eye.eu/).
|
||||
|
||||
It can parse many common file types (See [README.md](https://github.com/simon987/sist2/blob/master/README.md#format-support) for
|
||||
the updated list) and will extract text from their metadata and contents.
|
||||
|
||||
The indexing process is typically done in three steps: `scan`, `index` then `web`.
|
||||
For example:
|
||||
|
||||
{{<highlight bash>}}
|
||||
sist2 scan ./my_documents/ -o idx/
|
||||
{{</highlight>}}
|
||||
|
||||
After this step, the raw index (**./idx/**) has been created and direct access to the files is no longer necessary.
|
||||
This means that you can pass around the raw index folder or use sist2 to index files stored on cold storage.
|
||||
|
||||
The `index` step will convert the raw index into JSON documents and push them to Elasticsearch. Sist2
|
||||
is compatible with versions 6.X and 7.X.
|
||||
|
||||
{{<highlight bash>}}
|
||||
# Start a debug elasticsearch instance
|
||||
docker run -d -p 9201:9200 \
|
||||
-e "discovery.type=single-node" \
|
||||
docker.elastic.co/elasticsearch/elasticsearch:7.4.2
|
||||
|
||||
# The --force-reset flag tells sist2 to (re)initialize
|
||||
# the Elasticsearch mappings & settings
|
||||
sist2 index idx/ --force-reset --es-url http://localhost:9201
|
||||
{{</highlight>}}
|
||||
|
||||
{{<highlight bash>}}
|
||||
sist2 web idx/ --port 8080
|
||||
# Starting web server @ http://localhost:8080
|
||||
{{</highlight>}}
|
||||
|
||||
## Web interface
|
||||
|
||||
The web module can serve the search interface on its own without additional configuration.
|
||||
What's interesting to note is that the files themselves can either be served by a remote HTTP server that
|
||||
acts as an external CDN, or they can be served by sist2 directly from the disk. In the latter case, Partial
|
||||
Content is supported, meaning that `Range` requests are accepted and media files can be *'seeked'* from
|
||||
the browser.
|
||||
|
||||
{{< figure src="/sist/sist_web.png" title="">}}
|
||||
|
||||
The UI itself is not that much different from the original Python/Flask version, however, the Javascript
|
||||
client is a bit *thicker*, meaning that most operations that were originally handled by the Flask server,
|
||||
such as auto-complete and the retrieval of the mime type list are done client side.
|
||||
|
||||
This is possible because Elasticsearch queries are proxied as is through sist2, for example, the mime type
|
||||
selection widget is populated with a function similar to this:
|
||||
|
||||
{{<highlight javascript>}}
|
||||
$.post("es", {
|
||||
// Elasticsearch query body
|
||||
aggs: {
|
||||
mimeTypes: {
|
||||
terms: {
|
||||
field: "mime",
|
||||
size: 10000
|
||||
}
|
||||
}
|
||||
},
|
||||
size: 0,
|
||||
}).then(resp => {
|
||||
resp["aggregations"]["mimeTypes"]["buckets"].forEach(bucket => {
|
||||
console.log(bucket);
|
||||
//...
|
||||
});
|
||||
});
|
||||
{{</highlight>}}
|
||||
|
||||
{{< figure src="/sist/sist_buckets.png" title="">}}
|
||||
|
||||
Another improvement was to re-skin the whole page to allow users to choose the dark *OLED*-friendly theme. Pressing on
|
||||
the theme toggle button sets `Cookie: sist=dark`, which tells sist2 to serve different content depending on the value
|
||||
of the cookie.
|
||||
|
||||
{{< figure src="/sist/sist.png" title="Web interface (Dark theme) displaying Occult Library books">}}
|
||||
|
||||
|
||||
## Thumbnail storage
|
||||
|
||||
An [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) (Lightning Memory-Mapped Database)
|
||||
key-value store is used to asynchronously save the thumbnails as they are generated by the indexer.
|
||||
Once the `scan` step is done, the database file is used by the `web` module to serve the thumbnails
|
||||
with very little latency.
|
||||
|
||||
Since the database is mapped in memory (See [mmap(2)](https://en.wikipedia.org/wiki/Mmap)),
|
||||
the `web` process may appear to have a high memory usage under load,
|
||||
but almost all of it is allocated for the **data.mdb** file. In fact if we take a look with `pmap`,
|
||||
we can see that virtually all of the resident memory is used for LMDB and that none of it is *dirty*.
|
||||
This means that the operating system will eventually reclaim the memory and that, over time, the memory usage will
|
||||
return to ~20M.
|
||||
|
||||
{{<highlight _>}}
|
||||
$pmap -x <PID>
|
||||
|
||||
Adress Kbytes RSS Dirty Mode Mapping
|
||||
00005641d5689000 21300 536 0 r-x-- sist2
|
||||
00005641d5689000 0 0 0 r-x-- sist2
|
||||
00005641d6d56000 432 8 0 r-x-- sist2
|
||||
00005641d6d56000 0 0 0 r-x-- sist2
|
||||
00005641d6dc2000 32696 768 8 rwx-- sist2
|
||||
00005641d6dc2000 0 0 0 rwx-- sist2
|
||||
00005641d8db0000 8452 100 4 rwx-- [ anon ]
|
||||
...
|
||||
00007fd1d7419000 3180068 160000 0 rwxs- data.mdb
|
||||
00007fd2998a4000 2290452 240 0 rwxs- data.mdb
|
||||
00007fd32586b000 10721328 64868 0 rwxs- data.mdb
|
||||
00007fd5b4179000 3535892 51616 0 rwxs- data.mdb
|
||||
00007fd68c180000 4446024 118668 0 rwxs- data.mdb
|
||||
00007fd79ba54000 1411416 47992 0 rwxs- data.mdb
|
||||
00007fd7f1fac000 560000 6044 0 rwxs- data.mdb
|
||||
00007fd81458e000 9069792 217464 0 rwxs- data.mdb
|
||||
...
|
||||
00007fda42736000 2048 0 0 ----- libc-2.24.so
|
||||
---------------- ------- ------- -------
|
||||
total kB 36085472 683468 10472
|
||||
{{</highlight>}}
|
||||
|
||||
|
||||
|
||||
## Media Files
|
||||
|
||||
All audio and video files are handled by ffmpeg's libav\* libraries, which is extremely helpful since
|
||||
we can handle all `audio/*`, `video/*` and `image/*` (images are videos that have only one frame), file
|
||||
types the same way.
|
||||
For instance, there is no difference in the code between thumbnails that are generated from the embedded cover art of a `.mp3`
|
||||
file versus thumbnails generated from a video stream of a `.mkv` container. We also don't have to worry about
|
||||
odd encodings because ffmpeg is bundled with hundreds of decoders.
|
||||
|
||||
## Font Files
|
||||
|
||||
Font files were especially painful to work with, since I had to implement the code
|
||||
to generate the thumbnails mostly from scratch. Each letter is individually drawn into
|
||||
a bitmap, which is then converted to uncompressed *BMP* Format and saved directly to disk.
|
||||
Thankfully, *most* font faces are relatively standard, in that they are meant
|
||||
to be displayed from left to right, and
|
||||
glyphs for the basic Latin alphabet are available.
|
||||
|
||||
{{< figure src="/sist/font.png" title="">}}
|
||||
|
||||
For the rest, I would mostly have to handle each corner case one by one. At the
|
||||
time of writing this, I gave up on trying to render atypical font faces.
|
||||
|
||||
## Raw Index Binary Format
|
||||
|
||||
For simplicity's sake, the document metadata structure is dumped directly from memory to
|
||||
file without much additional processing. While it's not as space-efficient as it could be,
|
||||
it's much more (about 350%) smaller than the equivalent in JSON.
|
||||
|
||||
**idx/_index\_\<pid\>**
|
||||
{{<highlight hexdump>}}
|
||||
000 e5 94 64 1d 82 91 4f 25 80 31 2b 69 db 23 14 79 ..d...O%.1+i.#.y
|
||||
010 dd 00 84 00 31 08 00 00 10 fa 27 00 00 00 00 00 ....1.....'.....
|
||||
020 8a 01 06 00 cc ea a7 5c 00 00 08 00 00 00 00 00 .......\........
|
||||
030 62 6f 62 72 6f 73 73 2e 77 65 62 6d 00 f6 8b 00 bobross.webm....
|
||||
040 00 00 f2 00 05 00 00 f3 d0 02 00 00 0a
|
||||
{{</highlight>}}
|
||||
|
||||
This, of course, makes little difference since neither format is needed
|
||||
after it has been indexed to Elasticsearch.
|
||||
|
||||
**(Elasticsearch JSON document)**
|
||||
{{<highlight json>}}
|
||||
{
|
||||
"_id": "e594641d-8291-4f25-8031-2b69db231479",
|
||||
"_index": "sist2",
|
||||
"_type": "_doc",
|
||||
"_source": {
|
||||
"index": "bb3d8cc5-2e5c-4f1c-ac04-1b1f6d9b070a",
|
||||
"mime": "video/webm",
|
||||
"size": 2619920,
|
||||
"mtime": 1554508492,
|
||||
"extension": "webm",
|
||||
"name": "bobross",
|
||||
"path": "",
|
||||
"videoc": "vp8",
|
||||
"width": 1280,
|
||||
"height": 720
|
||||
}
|
||||
}
|
||||
{{</highlight>}}
|
51
content/posts/zpaq.md
Normal file
51
content/posts/zpaq.md
Normal file
@ -0,0 +1,51 @@
|
||||
---
|
||||
title: "Android phone backups with zpaq"
|
||||
date: 2019-11-05T13:16:27-05:00
|
||||
draft: true
|
||||
author: simon987
|
||||
---
|
||||
|
||||
{{< figure src="/zpaq/10gb.png" title="Benchmark for 10GB">}}
|
||||
|
||||
{{<highlight bash>}}
|
||||
pkg install g++ make git
|
||||
git clone "https://github.com/zpaq/zpaq"
|
||||
cd zpaq/
|
||||
# zpaq must be compiled with -DNOJIT for non-x86 processors
|
||||
g++ -Ofast -DNOJIT -Dunix zpaq.cpp libzpaq.cpp -pthread -o zpaq
|
||||
{{</highlight>}}
|
||||
|
||||
{{<highlight _>}}
|
||||
## Initial backup can take a while to complete,
|
||||
$ zpaq add "arc???" ./files/ -index local-index.zpaq
|
||||
0.000000 + (955.283380 -> 687.840444 -> 622.268166) = 622.268166 MB
|
||||
45.737 seconds (all OK)
|
||||
|
||||
## but subsequent ones are almost instantaneous if no files were changed
|
||||
$ zpaq add "arc???" ./files/ -index local-index.zpaq
|
||||
0.000000 + (0.000000 -> 0.000000 -> 0.000104) = 0.000104 MB
|
||||
0.408 seconds (all OK)
|
||||
|
||||
|
||||
##
|
||||
$ ls -lh
|
||||
total 594M
|
||||
-rw------- 1 u0_a94 u0_a94 594M Nov 5 14:18 arc001.zpaq
|
||||
-rw------- 1 u0_a94 u0_a94 104 Nov 5 14:18 arc002.zpaq
|
||||
-rwx------ 1 u0_a94 u0_a94 362 Nov 5 14:17 backup.sh
|
||||
-rw------- 1 u0_a94 u0_a94 411K Nov 5 14:18 local-index.zpaq
|
||||
{{</highlight>}}
|
||||
|
||||
{{<highlight bash "linenos=table">}}
|
||||
#!/usr/bin/env bash
|
||||
|
||||
zpaq add "arc???" \
|
||||
~/storage/shared/DCIM \
|
||||
~/storage/shared/Documents \
|
||||
~/storage/shared/Download \
|
||||
#...
|
||||
-index local-index.zpaq -m2
|
||||
|
||||
rclone move arc*.zpaq my-remote:/backups
|
||||
{{</highlight>}}
|
||||
|
BIN
diagrams/sist2_web.dia
Normal file
BIN
diagrams/sist2_web.dia
Normal file
Binary file not shown.
78
layouts/partials/css/tables-min.css
vendored
Normal file
78
layouts/partials/css/tables-min.css
vendored
Normal file
@ -0,0 +1,78 @@
|
||||
/*!
|
||||
Pure v1.0.0
|
||||
Copyright 2013 Yahoo!
|
||||
Licensed under the BSD License.
|
||||
https://github.com/yahoo/pure/blob/master/LICENSE.md
|
||||
*/
|
||||
.pure-table {
|
||||
/* Remove spacing between table cells (from Normalize.css) */
|
||||
border-collapse: collapse;
|
||||
border-spacing: 0;
|
||||
empty-cells: show;
|
||||
border: 1px solid #cbcbcb;
|
||||
width: 100%;
|
||||
}
|
||||
|
||||
.pure-table caption {
|
||||
color: #000;
|
||||
font: italic 85%/1 arial, sans-serif;
|
||||
padding: 1em 0;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.pure-table td,
|
||||
.pure-table th {
|
||||
border-left: 1px solid #cbcbcb;/* inner column border */
|
||||
border-width: 0 0 0 1px;
|
||||
font-size: inherit;
|
||||
margin: 0;
|
||||
overflow: visible; /*to make ths where the title is really long work*/
|
||||
padding: 0.5em 1em; /* cell padding */
|
||||
line-height: 1.1;
|
||||
}
|
||||
|
||||
.pure-table thead {
|
||||
background-color: #E0E0E0;
|
||||
color: #000;
|
||||
text-align: left;
|
||||
vertical-align: bottom;
|
||||
}
|
||||
|
||||
/*
|
||||
striping:
|
||||
even - #fff (white)
|
||||
odd - #f2f2f2 (light gray)
|
||||
*/
|
||||
.pure-table td {
|
||||
background-color: transparent;
|
||||
}
|
||||
.pure-table-odd td {
|
||||
background-color: #f2f2f2;
|
||||
}
|
||||
|
||||
/* nth-child selector for modern browsers */
|
||||
.pure-table-striped tr:nth-child(2n-1) td {
|
||||
background-color: #212121;
|
||||
}
|
||||
|
||||
/* BORDERED TABLES */
|
||||
.pure-table-bordered td {
|
||||
border-bottom: 1px solid #cbcbcb;
|
||||
}
|
||||
.pure-table-bordered tbody > tr:last-child > td {
|
||||
border-bottom-width: 0;
|
||||
}
|
||||
|
||||
|
||||
/* HORIZONTAL BORDERED TABLES */
|
||||
|
||||
.pure-table-horizontal td,
|
||||
.pure-table-horizontal th {
|
||||
border-width: 0 0 1px 0;
|
||||
border-bottom: 1px solid #cbcbcb;
|
||||
}
|
||||
.pure-table-horizontal tbody > tr:last-child > td {
|
||||
border-bottom-width: 0;
|
||||
}
|
||||
|
||||
|
BIN
static/sist/font.png
Normal file
BIN
static/sist/font.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 17 KiB |
BIN
static/sist/sist.png
Normal file
BIN
static/sist/sist.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 671 KiB |
BIN
static/sist/sist_buckets.png
Normal file
BIN
static/sist/sist_buckets.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 26 KiB |
BIN
static/sist/sist_web.png
Normal file
BIN
static/sist/sist_web.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 169 KiB |
BIN
static/zpaq/10gb.png
Normal file
BIN
static/zpaq/10gb.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 54 KiB |
Loading…
x
Reference in New Issue
Block a user