diff --git a/README.md b/README.md index 8261cc8..d74ab9c 100644 --- a/README.md +++ b/README.md @@ -8,9 +8,12 @@ sist2 (Simple incremental search tool) *Warning: sist2 is in early development* +![sist2.png](sist2.png) + ## Features * Fast, low memory usage, multi-threaded +* Mobile-friendly Web interface * Portable (all its features are packaged in a single executable) * Extracts text from common file types \* * Generates thumbnails \* @@ -26,11 +29,27 @@ sist2 (Simple incremental search tool) ## Getting Started -1. Have an [Elasticsearch](https://www.elastic.co/downloads/elasticsearch) instance running -1. +1. Have an Elasticsearch (>= 6.X.X) instance running + 1. Download [from official website](https://www.elastic.co/downloads/elasticsearch) + 1. *(or)* Run using docker: + ```bash + docker run -d --name es1 --net sist2_net -p 9200:9200 \ + -e "discovery.type=single-node" elasticsearch:7.5.2 + ``` + 1. *(or)* Run using docker-compose: + ```yaml + elasticsearch: + image: docker.elastic.co/elasticsearch/elasticsearch:7.5.2 + environment: + - discovery.type=single-node + - "ES_JAVA_OPTS=-Xms1G -Xmx2G" + ``` +1. Download sist2 executable 1. Download the [latest sist2 release](https://github.com/simon987/sist2/releases) * 1. *(or)* Download a [development snapshot](https://files.simon987.net/artifacts/Sist2/Build/) *(Not recommended!)* 1. *(or)* `docker pull simon987/sist2:latest` + +1. See [Usage guide](USAGE.md) \* *Windows users*: **sist2** runs under [WSL](https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux) @@ -39,53 +58,11 @@ sist2 (Simple incremental search tool) ## Example usage +See [Usage guide](USAGE.md) for more details -![demo](demo.gif) - -See help page `sist2 --help` for more details. - -**Scan a directory** -```bash -sist2 scan ~/Documents -o ./orig_idx/ -sist2 scan --threads 4 --content-size 16384 /mnt/Pictures -sist2 scan --incremental ./orig_idx/ -o ./updated_idx/ ~/Documents -``` - -**Push index to Elasticsearch or file** -```bash -sist2 index --force-reset ./my_idx -sist2 index --print ./my_idx > raw_documents.ndjson -``` - -**Start web interface** -```bash -sist2 web --bind 0.0.0.0 --port 4321 ./my_idx1 ./my_idx2 ./my_idx3 -``` - -### Use sist2 with docker - -**scan** -```bash -docker run -it \ - -v /path/to/files/:/files \ - -v $PWD/out/:/out \ - simon987/sist2 scan -t 4 /files -o /out/my_idx1 -``` -**index** -```bash -docker run -it --network host\ - -v $PWD/out/:/out \ - simon987/sist2 index /out/my_idx1 -``` - -**web** -```bash -docker run --rm --network host -d --name sist2\ - -v $PWD/out/my_idx:/idx \ - -v $PWD/my/files:/files - simon987/sist2 web --bind 0.0.0.0 /idx -docker stop sist2 -``` +1. Scan a directory: `sist2 scan ~/Documents -o ./docs_idx` +1. Push index to Elasticsearch: `sist2 index ./docs_idx` +1. Start web interface: `sist2 web ./docs_idx` ## Format support @@ -145,8 +122,9 @@ binaries. ```bash apt install git cmake pkg-config libglib2.0-dev \ libssl-dev uuid-dev python3 libmagic-dev libfreetype6-dev \ - libcurl-dev libbz2-dev yasm libharfbuzz-dev ragel \ - libarchive-dev libtiff5 libpng16-16 libpango1.0-dev + libcurl4-openssl-dev libbz2-dev yasm libharfbuzz-dev ragel \ + libarchive-dev libtiff5 libpng16-16 libpango1.0-dev \ + libxml2-dev ``` 2. Build diff --git a/USAGE.md b/USAGE.md new file mode 100644 index 0000000..b245fb4 --- /dev/null +++ b/USAGE.md @@ -0,0 +1,275 @@ +# Usage + +*More examples (specifically with docker/compose) are in progress* + +* [scan](#scan) + * [options](#scan-options) + * [examples](#scan-examples) + * [index format](#index-format) +* [index](#index) + * [options](#index-options) + * [examples](#index-examples) +* [web](#web) + * [options](#web-options) + * [examples](#web-examples) + * [rewrite_url](#rewrite_url) + * [link to specific indices](#link-to-specific-indices) + +``` +Usage: sist2 scan [OPTION]... PATH + or: sist2 index [OPTION]... INDEX + or: sist2 web [OPTION]... INDEX... +Lightning-fast file system indexer and search tool. + + -h, --help show this help message and exit + -v, --version Show version and exit + --verbose Turn on logging + --very-verbose Turn on debug messages + +Scan options + -t, --threads= Number of threads. DEFAULT=1 + -q, --quality= Thumbnail quality, on a scale of 1.0 to 31.0, 1.0 being the best. DEFAULT=5 + --size= Thumbnail size, in pixels. Use negative value to disable. DEFAULT=500 + --content-size= Number of bytes to be extracted from text documents. Use negative value to disable. DEFAULT=32768 + --incremental= Reuse an existing index and only scan modified files. + -o, --output= Output directory. DEFAULT=index.sist2/ + --rewrite-url= Serve files from this url instead of from disk. + --name= Index display name. DEFAULT: (name of the directory) + --depth= Scan up to DEPTH subdirectories deep. Use 0 to only scan files in PATH. DEFAULT: -1 + --archive= Archive file mode (skip|list|shallow|recurse). skip: Don't parse, list: only get file names as text, shallow: Don't parse archives inside archives. DEFAULT: recurse + --ocr= Tesseract language (use tesseract --list-langs to see which are installed on your machine) + -e, --exclude= Files that match this regex will not be scanned + --fast Only index file names & mime type + +Index options + --es-url= Elasticsearch url with port. DEFAULT=http://localhost:9200 + -p, --print Just print JSON documents to stdout. + --script-file= Path to user script. + --batch-size= Index batch size. DEFAULT: 100 + -f, --force-reset Reset Elasticsearch mappings and settings. (You must use this option the first time you use the index command) + +Web options + --es-url= Elasticsearch url. DEFAULT=http://localhost:9200 + --bind= Listen on this address. DEFAULT=localhost + --port= Listen on this port. DEFAULT=4090 + --auth= Basic auth in user:password format +Made by simon987 . Released under GPL-3.0 + +``` + +## Scan + +### Scan options + +* `-t, --threads` + Number of threads for file parsing. **Do not set a number higher than `$(nproc)`!**. +* `-q, --quality` + Thumbnail quality, on a scale of 1.0 to 31.0, 1.0 being the best. *Does not affect PDF thumbnails quality* +* `--size` + Thumbnail size in pixels. +* `--content-size` + Number of bytes of text to be extracted from the content of files (plain text and PDFs). + Repeated whitespace and special characters do not count toward this limit. +* `--incremental` + Specify an existing index. Information about files in this index that were not modified (based on *mtime* attribute) + will be copied to the new index and will not be parsed again. +* `-o, --output` Output directory. +* `--rewrite-url` Set the `rewrite_url` option for the web module (See [rewrite_url](#rewrite_url)) +* `--name` Set the `name` option for the web module +* `--depth` Maximum scan dept. Set to 0 only scan files directly in the root directory, set to -1 for infinite depth +* `--archive` Archive file mode. + * skip: Don't parse + * list: Only get file names as text + * shallow: Don't parse archives inside archives. + * recurse: Scan archives recursively (default) +* `--ocr` See [OCR](README.md#OCR) +* `-e, --exclude` Regex pattern to exclude files. A file is excluded if the pattern matches any + part of the full absolute path. + + Examples: + * `-e ".*\.ttf"`: Ignore ttf files + * `-e ".*\.(ttf|rar)"`: Ignore ttf and rar files + * `-e "^/mnt/backups/"`: Ignore all files in the `/mnt/backups/` directory + * `-e "^/mnt/Data[12]/"`: Ignore all files in the `/mnt/Data1/` and `/mnt/Data2/` directory + * `-e "(^/usr/)|(^/var/)|(^/media/DRIVE-A/tmp/)|(^/media/DRIVE-B/Trash/)"` Exclude the + `/usr`, `/var`, `/media/DRIVE-A/tmp`, `/media/DRIVE-B/Trash` directories +* `--fast` Only index file names and mime type + +### Scan examples + +Simple scan +```bash +sist2 scan ~/Documents + +sist2 scan \ + --threads 4 --content-size 16000000 --quality 1.0 --archive shallow \ + --name "My Documents" --rewrite-url "http://nas.domain.local/My Documents/" \ + ~/Documents -o ./documents.idx/ +``` + +Incremental scan +``` +sist2 scan --incremental ./orig_idx/ -o ./updated_idx/ ~/Documents +``` + +### Index format + +A typical `binary` type index structure looks like this: +``` +documents.idx/ +├── descriptor.json +├── _index_139965416830720 +├── _index_139965425223424 +├── _index_139965433616128 +├── _index_139965442008832 +└── thumbs + ├── data.mdb + └── lock.mdb +``` + +The `_index_*` files contain the raw binary index data and are not meant to be +read by other applications. The format is generally compatible across different +sist2 versions. + +The `thumbs/` folder is a [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) +database containing the thumbnails. + +The `descriptor.json` file contains general information about the index. The +following fields are safe to modify manually: `root`, `name`, [rewrite_url](#rewrite_url) and `timestamp`. + + +*Advanced usage* + +Instead of using the `scan` module, you can also import an index generated +by a third party application. The 'external' index must have the following format: + +``` +my_index/ +├── descriptor.json +├── _index_0 +└── thumbs + ├── data.mdb + └── lock.mdb +``` + +*descriptor.json*: +```json +{ + "uuid": "", + "version": "_external_v1", + "root": "(optional)", + "name": "", + "rewrite_url": "(optional)", + "type": "json", + "timestamp": 1578971024 +} +``` + +*_index_0*: NDJSON format (One json object per line) + +```json +{ + "_id": "unique uuid for the file", + "index": "index uuid4 (same one as descriptor.json!)", + "mime": "application/x-cbz", + "size": 14341204, + "mtime": 1578882996, + "extension": "cbz", + "name": "my_book", + "path": "path/to/books", + "content": "text contents of the book", + "title": "Title of the book", + "tag": ["genre.fiction", "author.someguy", "etc..."], + "_keyword": [ + {"k": "ISBN", "v": "ABCD34789231"} + ], + "_text": [ + {"k": "other", "v": "This will be indexed as text"} + ] +} +``` + +You can find the full list of supported fields [here](src/io/serialize.c#L90) + +The `_keyword.*` items will be indexed and searchable as **keyword** fields (only full matches allowed). +The `_text.*` items will be indexed and searchable as **text** fields (fuzzy searching allowed) + + +*thumbs/*: + +LMDB key-value store. Keys are **binary** 128-bit UUID4s (`_id` field) +and values are raw image bytes. + +Importing an external `binary` type index is technically possible but +it is currently unsupported and has no guaranties of back/forward compatibility. + + +## Index +### Index options + * `--es-url` + Elasticsearch url and port. If you are using docker, make sure that both containers are on the + same network. + * `-p, --print` + Print index in JSON format to stdout. + * `--script-file` + Path to user script. See [Scripting](scripting/README.md). + * `--batch-size=` + Index batch size. Indexing is generally faster with larger batches, but payloads that + are too large will fail and additional overhead for retrying with smaller sizes may slow + down the process. + * `-f, --force-reset` + Reset Elasticsearch mappings and settings. + **(You must use this option the first time you use the index command)**. + +### Index examples + +**Push to elasticsearch** +```bash +sist2 index --force-reset --batch-size 1000 --es-url http://localhost:9200 ./my_index/ +sist2 index ./my_index/ +``` + +**Save index in JSON format** +```bash +sist2 index --print ./my_index/ > my_index.ndjson +``` + +**Inspect contents of an index** +```bash +sist2 index --print ./my_index/ | jq | less +``` + +## Web + +### Web options + * `--es-url=` Elasticsearch url. + * `--bind=` Listen on this address. + * `--port=` Listen on this port. + * `--auth=` Basic auth in user:password format + +### Web examples + +**Single index** +```bash +sist2 web --auth admin:hunter2 --bind 0.0.0.0 --port 8888 my_index +``` + +**Multiple indices** +```bash +# Indices will be displayed in this order in the web interface +sist2 web index1 index2 index3 index4 +``` + +### rewrite_url + +When the `rewrite_url` field is not empty, the web module ignores the `root` +field and will return a HTTP redirect to `/` +instead of serving the file from disk. +Both the `root` and `rewrite_url` fields are safe to manually modify from the +`descriptor.json` file. + +### Link to specific indices + +To link to specific indices, you can add a list of comma-separated index name to +the URL: `?i=,`. By default, indices with `"(nsfw)"` in their name are +not displayed. diff --git a/demo.gif b/demo.gif deleted file mode 100644 index d7832ab..0000000 Binary files a/demo.gif and /dev/null differ diff --git a/scripting/README.md b/scripting/README.md index ad3c2f9..f0b43f7 100644 --- a/scripting/README.md +++ b/scripting/README.md @@ -54,6 +54,11 @@ script.painless.regex.enabled: true ``` Or, if you're using docker add `-e "script.painless.regex.enabled=true"` +**Tag color** + +You can specify the color for an individual tag by appending an +hexadecimal color code (`#RRGGBBAA`) to the tag name. + ### Examples If `(20XX)` is in the file name, add the `year.` tag: @@ -115,3 +120,33 @@ if (ctx._source.path != "") { tags.add("studio." + names[names.length-1]); } ``` + +Parse `EXIF:F Number` tag +```Java +if (ctx._source?.exif_fnumber != null) { + String[] values = ctx._source.exif_fnumber.splitOnToken(' '); + String aperture = String.valueOf(Float.parseFloat(values[0]) / Float.parseFloat(values[1])); + if (aperture == "NaN") { + aperture = "0,0"; + } + tags.add("Aperture.f/" + aperture.replace(".", ",")); +} +``` + +Display year and months from `EXIF:DateTime` tag +```Java +if (ctx._source?.exif_datetime != null) { + SimpleDateFormat parser = new SimpleDateFormat("yyyy:MM:dd HH:mm:ss"); + Date date = parser.parse(ctx._source.exif_datetime); + + SimpleDateFormat yp = new SimpleDateFormat("yyyy"); + SimpleDateFormat mp = new SimpleDateFormat("MMMMMMMMM"); + + String year = yp.format(date); + String month = mp.format(date); + + tags.add("Month." + month); + tags.add("Year." + year); +} + +```