User scripts, bug fixes, docker image

bugfix with invalid/corrupted index path
Image placeholder
2025-12-12 15:08:53 +00:00 · 2019-11-12 20:58:43 -05:00 · 2019-11-11 20:49:38 -05:00 · 2019-11-09 23:26:49 -05:00 · 2019-11-09 17:15:20 -05:00 · 2019-11-09 15:18:44 -05:00
35 changed files with 610 additions and 118 deletions
--- a/.gitmodules
+++ b/.gitmodules
@@ -25,3 +25,9 @@
 [submodule "lib/harfbuzz"]
 	path = lib/harfbuzz
 	url = https://github.com/harfbuzz/harfbuzz
+[submodule "lib/libmagic"]
+	path = lib/libmagic
+	url = https://github.com/threatstack/libmagic
+[submodule "lib/bzip2-1.0.6"]
+	path = lib/bzip2-1.0.6
+	url = https://github.com/enthought/bzip2-1.0.6
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -122,10 +122,10 @@ if (WITH_SIST2)

    target_compile_options(sist2
            PRIVATE
-            -Ofast
+#            -Ofast
 #                    -march=native
-            -fno-stack-protector
-            -fomit-frame-pointer
+#            -fno-stack-protector
+#            -fomit-frame-pointer
            )

    TARGET_LINK_LIBRARIES(
--- a/Docker/Dockerfile
+++ b/Docker/Dockerfile
@@ -0,0 +1,9 @@
+FROM ubuntu:19.10
+MAINTAINER simon987 <me@simon987.net>
+
+RUN apt update
+RUN apt install -y libglib2.0-0 libcurl4 libmagic1 libharfbuzz-bin libopenjp2-7
+
+ADD sist2 /root/sist2
+
+ENTRYPOINT ["/root/sist2"]
--- a/Docker/build.sh
+++ b/Docker/build.sh
@@ -0,0 +1,8 @@
+cp ../sist2 .
+
+version=$(./sist2 --version)
+
+echo "Version ${version}"
+docker build . -t simon987/sist2:${version} -t simon987/sist2:latest
+docker push simon987/sist2:${version}
+docker push simon987/sist2:latest
--- a/README.md
+++ b/README.md
@@ -14,6 +14,7 @@ sist2 (Simple incremental search tool)
 * Extracts text from common file types\*
 * Generates thumbnails\*
 * Incremental scanning
+* Automatic tagging from file attributes via [user scripts](scripting/README.md)


 \* See [format support](#format-support)
@@ -21,11 +22,13 @@ sist2 (Simple incremental search tool)
 ## Getting Started

 1. Have an [Elasticsearch](https://www.elastic.co/downloads/elasticsearch) instance running
-1. Download the [latest sist2 release](https://github.com/simon987/sist2/releases)
+1. 
+    1. Download the [latest sist2 release](https://github.com/simon987/sist2/releases) *
+    1. *(or)* `docker pull simon987/sist2:latest`
+   

-*Windows users*: `sist2` runs under [WSL](https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux)
-
-*Mac users*: See [#1](https://github.com/simon987/sist2/issues/1)
+\* *Windows users*: **sist2** runs under [WSL](https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux)    
+\* *Mac users*: See [#1](https://github.com/simon987/sist2/issues/1)


 ## Example usage
@@ -52,14 +55,40 @@ sist2 index --print ./my_idx > raw_documents.ndjson
 sist2 web --bind 0.0.0.0 --port 4321 ./my_idx1 ./my_idx2 ./my_idx3
 ```

+### Use sist2 with docker
+
+**scan**
+```bash
+docker run -it \
+    -v /path/to/files/:/files \
+    -v $PWD/out/:/out \
+    simon987/sist2 scan -t 4 /files -o /out/my_idx1
+```
+**index**
+```bash
+docker run -it --network host\
+    -v $PWD/out/:/out \
+    simon987/sist2 index /out/my_idx1
+```
+
+**web**
+```bash
+docker run --rm --network host -d --name sist2\
+    -v $PWD/out/my_idx:/idx \
+    -v $PWD/my/files:/files
+    simon987/sist2 web --bind 0.0.0.0 /idx
+docker stop sist2
+```
+
+
 ## Format support

 File type | Library | Content | Thumbnail | Metadata
 :---|:---|:---|:---|:---
-pdf,xps,cbz,cbr,fb2,epub | MuPDF | yes | yes, `png` | title |
+pdf,xps,cbz,fb2,epub | MuPDF | yes | yes, `png` | title |
 `audio/*` | ffmpeg | - | yes, `jpeg` | ID3 tags |
-`video/*` | ffmpeg | - | yes, `jpeg` | title, comment |
-`image/*` | ffmpeg | - | yes, `jpeg` | *planned* |
+`video/*` | ffmpeg | - | yes, `jpeg` | title, comment, artist |
+`image/*` | ffmpeg | - | yes, `jpeg` | `EXIF:Artist`, `EXIF:ImageDescription` |
 ttf,ttc,cff,woff,fnt,otf | Freetype2 | - | yes, `bmp` | Name & style |
 `text/plain` | *(none)* | yes | no | - |
 docx, xlsx, pptx |  | *planned* | no | *planned* |
@@ -79,7 +108,7 @@ binaries.
    apt install git cmake pkg-config libglib2.0-dev\
        libssl-dev uuid-dev libavformat-dev libswscale-dev \
        python3 libmagic-dev libfreetype6-dev libcurl-dev \
-        libbz2-dev yasm libharfbuzz-dev
+        libbz2-dev yasm libharfbuzz-dev ragel
   ```
    *(FreeBSD)*
    ```bash
--- a/lib/bzip2-1.0.6
+++ b/lib/bzip2-1.0.6
--- a/lib/ffmpeg
+++ b/lib/ffmpeg
--- a/lib/harfbuzz
+++ b/lib/harfbuzz
--- a/lib/libmagic
+++ b/lib/libmagic
--- a/lib/mupdf
+++ b/lib/mupdf
--- a/schema/mappings.json
+++ b/schema/mappings.json
@@ -80,6 +80,9 @@
          "analyzer": "my_nGram"
        }
      }
+    },
+    "tag": {
+      "type": "keyword"
    }
  }
 }
--- a/scripting/README.md
+++ b/scripting/README.md
@@ -0,0 +1,117 @@
+## User scripts
+
+*This document is under construction, more in-depth guide coming soon*
+
+During the `index` step, you can use the `--script-file <script>` option to
+modify documents or add user tags. This option is mainly used to
+implement automatic tagging based on file attributes.
+
+The scripting language used 
+([Painless Scripting Language](https://www.elastic.co/guide/en/elasticsearch/painless/7.4/index.html)) 
+is very similar to Java, but you should be able to create user scripts
+without programming experience at all if you're somewhat familiar with
+regex.
+
+This is the base structure of the documents we're working with:
+```json
+{
+  "_id": "e171405c-fdb5-4feb-bb32-82637bc32084",
+  "_index": "sist2",
+  "_type": "_doc",
+  "_source": {
+    "index": "206b3050-e821-421a-891d-12fcf6c2db0d",
+    "mime": "application/json",
+    "size": 1799,
+    "mtime": 1545443685,
+    "extension": "md",
+    "name": "README",
+    "path": "sist2/scripting",
+    "content": "..."
+  }
+}
+```
+
+**Example script**
+
+This script checks if the `genre` attribute exists, if it does
+it adds the `genre.<genre>` tag. 
+```Java
+ArrayList tags = ctx._source.tag = new ArrayList();
+
+if (ctx._source?.genre != null) {
+    tags.add("genre." + ctx._source.genre.toLowerCase())
+}
+```
+
+You can use `.` to create a hierarchical tag tree:
+
+![scripting/genre_example](genre_example.png)
+
+
+To use regular expressions, you need to add this line in `/etc/elasticsearch/elasticsearch.yml`
+```yaml
+script.painless.regex.enabled: true
+```
+Or, if you're using docker add `-e "script.painless.regex.enabled=true"`
+
+### Examples
+
+If `(20XX)` is in the file name, add the `year.<year>` tag:
+```Java
+ArrayList tags = ctx._source.tag = new ArrayList();
+
+Matcher m = /[\(\.+](20[0-9]{2})[\)\.+]/.matcher(ctx._source.name);
+if (m.find()) {
+    tags.add("year." + m.group(1))
+}
+```
+
+Use default *Calibre* folder structure to infer author.
+```Java
+ArrayList tags = ctx._source.tag = new ArrayList();
+
+// We expect the book path to look like this:
+//  /path/to/Calibre Library/Author/Title/Title - Author.pdf
+
+if (ctx._source.name.contains("-") && ctx._source.extension == "pdf") {
+    String[] names = ctx._source.name.splitOnToken('-');
+    tags.add("author." + names[1].strip());
+}
+```
+
+If the file matches a specific pattern `AAAA-000 fName1 lName1, <fName2 lName2>...`, add the `actress.<actress>` and 
+`studio.<studio>` tag:
+```Java
+ArrayList tags = ctx._source.tag = new ArrayList();
+
+Matcher m = /([A-Z]{4})-[0-9]{3} (.*)/.matcher(ctx._source.name);
+if (m.find()) {
+    tags.add("studio." + m.group(1));
+
+    // Take the matched group (.*), and add a tag for
+    //  each name, separated by comma
+    for (String name : m.group(2).splitOnToken(',')) {
+        tags.add("actress." + name);
+    }
+}
+```
+
+Set the name of the last folder (`/path/to/<studio>/file.mp4`) to `studio.<studio>` tag
+```Java
+ArrayList tags = ctx._source.tag = new ArrayList();
+
+if (ctx._source.path != "") {
+    String[] names = ctx._source.path.splitOnToken('/');
+    tags.add("studio." + names[names.length-1]);
+}
+```
+
+Set the name of the last folder (`/path/to/<studio>/file.mp4`) to `studio.<studio>` tag
+```Java
+ArrayList tags = ctx._source.tag = new ArrayList();
+
+if (ctx._source.path != "") {
+    String[] names = ctx._source.path.splitOnToken('/');
+    tags.add("studio." + names[names.length-1]);
+}
+```
--- a/scripting/genre_example.png
+++ b/scripting/genre_example.png
--- a/scripts/get_static_libs.sh
+++ b/scripts/get_static_libs.sh
@@ -54,14 +54,12 @@ cd ../..
 mv onion/build/src/onion/libonion_static.a .

 #bzip2
-git clone https://github.com/enthought/bzip2-1.0.6
 cd bzip2-1.0.6
 make -j 4
 cd ..
 mv bzip2-1.0.6/libbz2.a .

 # magic
-git clone https://github.com/threatstack/libmagic
 cd libmagic
 ./autogen.sh
 ./configure --enable-static --disable-shared
--- a/scripts/get_static_libs_freebsd.sh
+++ b/scripts/get_static_libs_freebsd.sh
@@ -42,14 +42,12 @@ mv ffmpeg/libswresample/libswresample.a .
 mv ffmpeg/libswscale/libswscale.a .

 #bzip2
-git clone https://github.com/enthought/bzip2-1.0.6
 cd bzip2-1.0.6
 make -j 4
 cd ..
 mv bzip2-1.0.6/libbz2.a .

 # magic
-git clone https://github.com/threatstack/libmagic
 cd libmagic
 ./autogen.sh
 ./configure --enable-static --disable-shared
--- a/src/cli.c
+++ b/src/cli.c
@@ -2,8 +2,8 @@

 #define DEFAULT_OUTPUT "index.sist2/"
 #define DEFAULT_CONTENT_SIZE 4096
-#define DEFAULT_QUALITY 15
-#define DEFAULT_SIZE 200
+#define DEFAULT_QUALITY 5
+#define DEFAULT_SIZE 500
 #define DEFAULT_REWRITE_URL ""

 #define DEFAULT_ES_URL "http://localhost:9200"
@@ -25,7 +25,7 @@ int scan_args_validate(scan_args_t *args, int argc, const char **argv) {

    char *abs_path = abspath(argv[1]);
    if (abs_path == NULL) {
-        fprintf(stderr, "File not found: %s", argv[1]);
+        fprintf(stderr, "File not found: %s\n", argv[1]);
        return 1;
    } else {
        args->path = abs_path;
@@ -34,7 +34,7 @@ int scan_args_validate(scan_args_t *args, int argc, const char **argv) {
    if (args->incremental != NULL) {
        abs_path = abspath(args->incremental);
        if (abs_path == NULL) {
-            fprintf(stderr, "File not found: %s", args->incremental);
+            fprintf(stderr, "File not found: %s\n", args->incremental);
            return 1;
        }
    }
@@ -100,7 +100,7 @@ int index_args_validate(index_args_t *args, int argc, const char **argv) {

    char *index_path = abspath(argv[1]);
    if (index_path == NULL) {
-        fprintf(stderr, "File not found: %s", argv[1]);
+        fprintf(stderr, "File not found: %s\n", argv[1]);
        return 1;
    } else {
        args->index_path = argv[1];
@@ -109,6 +109,27 @@ int index_args_validate(index_args_t *args, int argc, const char **argv) {
    if (args->es_url == NULL) {
        args->es_url = DEFAULT_ES_URL;
    }
+
+    if (args->script_path != NULL) {
+        struct stat info;
+        int res = stat(args->script_path, &info);
+
+        if (res == -1) {
+            fprintf(stderr, "Error opening script file '%s': %s\n", args->script_path, strerror(errno));
+            return 1;
+        }
+
+        int fd = open(args->script_path, O_RDONLY);
+        if (fd == -1) {
+            fprintf(stderr, "Error opening script file '%s': %s\n", args->script_path, strerror(errno));
+            return 1;
+        }
+
+        args->script = malloc(info.st_size + 1);
+        read(fd, args->script, info.st_size);
+        *(args->script + info.st_size) = '\0';
+        close(fd);
+    }
    return 0;
 }

@@ -137,7 +158,7 @@ int web_args_validate(web_args_t *args, int argc, const char **argv) {
    for (int i = 0; i < args->index_count; i++) {
        char *abs_path = abspath(args->indices[i]);
        if (abs_path == NULL) {
-            fprintf(stderr, "File not found: %s", abs_path);
+            fprintf(stderr, "File not found: %s\n", abs_path);
            return 1;
        }
    }
--- a/src/cli.h
+++ b/src/cli.h
@@ -22,6 +22,8 @@ int scan_args_validate(scan_args_t *args, int argc, const char **argv);
 typedef struct index_args {
    char *es_url;
    const char *index_path;
+    const char *script_path;
+    char *script;
    int print;
    int force_reset;
 } index_args_t;
--- a/src/index/elastic.c
+++ b/src/index/elastic.c
@@ -6,7 +6,6 @@
 #include <stdio.h>
 #include <string.h>
 #include <cJSON/cJSON.h>
-#include <src/ctx.h>

 #include "static_generated.c"

@@ -54,6 +53,40 @@ void index_json(cJSON *document, const char uuid_str[UUID_STR_LEN]) {
    elastic_index_line(bulk_line);
 }

+void execute_update_script(const char *script, const char index_id[UUID_STR_LEN]) {
+
+    cJSON *body = cJSON_CreateObject();
+    cJSON *script_obj = cJSON_AddObjectToObject(body, "script");
+    cJSON_AddStringToObject(script_obj, "lang", "painless");
+    cJSON_AddStringToObject(script_obj, "source", script);
+
+    cJSON *query = cJSON_AddObjectToObject(body, "query");
+    cJSON *term_obj = cJSON_AddObjectToObject(query, "term");
+    cJSON_AddStringToObject(term_obj, "index", index_id);
+
+    char * str = cJSON_Print(body);
+
+    char bulk_url[4096];
+    snprintf(bulk_url, 4096, "%s/sist2/_update_by_query?pretty", Indexer->es_url);
+    response_t *r = web_post(bulk_url, str, "Content-Type: application/json");
+    printf("Executed user script <%d>\n", r->status_code);
+    cJSON *resp = cJSON_Parse(r->body);
+
+    cJSON_free(str);
+    cJSON_Delete(body);
+    free_response(r);
+
+    cJSON *error = cJSON_GetObjectItem(resp, "error");
+    if (error != NULL) {
+        char *error_str = cJSON_Print(error);
+
+        fprintf(stderr, "User script error: \n%s\n", error_str);
+        cJSON_free(error_str);
+    }
+
+    cJSON_Delete(resp);
+}
+
 void elastic_flush() {

    if (Indexer == NULL) {
@@ -115,6 +148,7 @@ void elastic_flush() {
    cJSON_Delete(ret_json);

    free_response(r);
+    free(buf);
 }

 void elastic_index_line(es_bulk_line_t *line) {
@@ -140,8 +174,7 @@ void elastic_index_line(es_bulk_line_t *line) {

 es_indexer_t *create_indexer(const char *url) {

-    size_t url_len = strlen(url);
-    char *es_url = malloc(url_len);
+    char *es_url = malloc(strlen(url) + 1);
    strcpy(es_url, url);

    es_indexer_t *indexer = malloc(sizeof(es_indexer_t));
@@ -154,7 +187,7 @@ es_indexer_t *create_indexer(const char *url) {
    return indexer;
 }

-void destroy_indexer() {
+void destroy_indexer(char * script, char index_id[UUID_STR_LEN]) {

    char url[4096];

@@ -163,6 +196,15 @@ void destroy_indexer() {
    printf("Refresh index <%d>\n", r->status_code);
    free_response(r);

+    if (script != NULL) {
+        execute_update_script(script, index_id);
+    }
+
+    snprintf(url, sizeof(url), "%s/sist2/_refresh", IndexCtx.es_url);
+    r = web_post(url, "", NULL);
+    printf("Refresh index <%d>\n", r->status_code);
+    free_response(r);
+
    snprintf(url, sizeof(url), "%s/sist2/_forcemerge", IndexCtx.es_url);
    r = web_post(url, "", NULL);
    printf("Merge index <%d>\n", r->status_code);
--- a/src/index/elastic.h
+++ b/src/index/elastic.h
@@ -24,7 +24,7 @@ void index_json(cJSON *document, const char uuid_str[UUID_STR_LEN]);

 es_indexer_t *create_indexer(const char* es_url);

-void destroy_indexer();
+void destroy_indexer(char *script, char index_id[UUID_STR_LEN]);

 void elastic_init(int force_reset);

--- a/src/index/static_generated.c
+++ b/src/index/static_generated.c
--- a/src/io/serialize.c
+++ b/src/io/serialize.c
@@ -54,6 +54,12 @@ index_descriptor_t read_index_descriptor(char *path) {
    struct stat info;
    stat(path, &info);
    int fd = open(path, O_RDONLY);
+
+    if (fd == -1) {
+        fprintf(stderr, "Invalid/corrupt index (Could not find descriptor)\n");
+        exit(1);
+    }
+
    char *buf = malloc(info.st_size + 1);
    read(fd, buf, info.st_size);
    *(buf + info.st_size) = '\0';
@@ -258,8 +264,9 @@ void read_index(const char *path, const char index_id[UUID_STR_LEN], index_func
        }

        func(document, uuid_str);
-        cJSON_free(document);
+        cJSON_Delete(document);
    }
+    dyn_buffer_destroy(&buf);
    fclose(file);
 }

--- a/src/io/store.c
+++ b/src/io/store.c
@@ -15,7 +15,7 @@ store_t *store_create(char *path) {
    );

    if (open_ret != 0) {
-        fprintf(stderr, "Error while opening store: %s", mdb_strerror(open_ret));
+        fprintf(stderr, "Error while opening store: %s (%s)\n", mdb_strerror(open_ret), path);
        exit(1);
    }

--- a/src/main.c
+++ b/src/main.c
@@ -10,7 +10,7 @@
 #define EPILOG "Made by simon987 <me@simon987.net>. Released under GPL-3.0"


-static const char *const Version = "1.1.1";
+static const char *const Version = "1.1.5";
 static const char *const usage[] = {
        "sist2 scan [OPTION]... PATH",
        "sist2 index [OPTION]... INDEX",
@@ -163,10 +163,11 @@ void sist2_index(index_args_t *args) {
            read_index(file_path, desc.uuid, f);
        }
    }
+    closedir(dir);

    if (!args->print) {
        elastic_flush();
-        destroy_indexer();
+        destroy_indexer(args->script, desc.uuid);
    }
 }

@@ -208,16 +209,20 @@ int main(int argc, const char *argv[]) {
    web_args_t *web_args = web_args_create();
    #endif

+    int arg_version = 0;
+
    char * common_es_url = NULL;

    struct argparse_option options[] = {
            OPT_HELP(),

+            OPT_BOOLEAN('v', "version", &arg_version, "Show version and exit"),
+
            OPT_GROUP("Scan options"),
            OPT_INTEGER('t', "threads", &scan_args->threads, "Number of threads. DEFAULT=1"),
            OPT_FLOAT('q', "quality", &scan_args->quality,
-                      "Thumbnail quality, on a scale of 1.0 to 31.0, 1.0 being the best. DEFAULT=15"),
-            OPT_INTEGER(0, "size", &scan_args->size, "Thumbnail size, in pixels. DEFAULT=200"),
+                      "Thumbnail quality, on a scale of 1.0 to 31.0, 1.0 being the best. DEFAULT=5"),
+            OPT_INTEGER(0, "size", &scan_args->size, "Thumbnail size, in pixels. DEFAULT=500"),
            OPT_INTEGER(0, "content-size", &scan_args->content_size,
                        "Number of bytes to be extracted from text documents. DEFAULT=4096"),
            OPT_STRING(0, "incremental", &scan_args->incremental,
@@ -230,6 +235,7 @@ int main(int argc, const char *argv[]) {
            OPT_GROUP("Index options"),
            OPT_STRING(0, "es-url", &common_es_url, "Elasticsearch url. DEFAULT=http://localhost:9200"),
            OPT_BOOLEAN('p', "print", &index_args->print, "Just print JSON documents to stdout."),
+            OPT_STRING(0, "script-file", &index_args->script_path, "Path to user script."),
            OPT_BOOLEAN('f', "force-reset", &index_args->force_reset, "Reset Elasticsearch mappings and settings. "
                                                              "(You must use this option the first time you use the index command)"),

@@ -247,6 +253,11 @@ int main(int argc, const char *argv[]) {
    argparse_describe(&argparse, DESCRIPTION, EPILOG);
    argc = argparse_parse(&argparse, argc, argv);

+    if (arg_version) {
+        printf(Version);
+        exit(0);
+    }
+
    #ifndef SIST_SCAN_ONLY
    web_args->es_url = common_es_url;
    index_args->es_url = common_es_url;
--- a/src/parsing/font.c
+++ b/src/parsing/font.c
@@ -142,6 +142,9 @@ void parse_font(const char *buf, size_t buf_len, document_t *doc) {
    if (library == NULL) {
        FT_Init_FreeType(&library);
    }
+    if (buf == NULL) {
+        return;
+    }

    FT_Face face;
    FT_Error err = FT_New_Memory_Face(library, (unsigned char *) buf, buf_len, 0, &face);
--- a/src/parsing/media.c
+++ b/src/parsing/media.c
@@ -116,9 +116,9 @@ AVFrame *read_frame(AVFormatContext *pFormatCtx, AVCodecContext *decoder, int st
    return frame;
 }

-#define APPEND_TAG_META(doc, tag, keyname) \
+#define APPEND_TAG_META(doc, tag_, keyname) \
    text_buffer_t tex = text_buffer_create(-1); \
-    text_buffer_append_string0(&tex, tag->value); \
+    text_buffer_append_string0(&tex, tag_->value); \
    meta_line_t *meta_tag = malloc(sizeof(meta_line_t) + tex.dyn_buffer.cur); \
    meta_tag->key = keyname; \
    strcpy(meta_tag->strval, tex.dyn_buffer.buf); \
@@ -151,30 +151,39 @@ void append_audio_meta(AVFormatContext *pFormatCtx, document_t *doc) {
 }

 __always_inline
-void append_video_meta(AVFormatContext *pFormatCtx, document_t *doc, int include_audio_tags) {
+void append_video_meta(AVFormatContext *pFormatCtx, AVFrame *frame, document_t *doc, int include_audio_tags, int is_video) {

-    meta_line_t *meta_duration = malloc(sizeof(meta_line_t));
-    meta_duration->key = MetaMediaDuration;
-    meta_duration->longval = pFormatCtx->duration / AV_TIME_BASE;
-    APPEND_META(doc, meta_duration)
+    if (is_video) {
+        meta_line_t *meta_duration = malloc(sizeof(meta_line_t));
+        meta_duration->key = MetaMediaDuration;
+        meta_duration->longval = pFormatCtx->duration / AV_TIME_BASE;
+        APPEND_META(doc, meta_duration)

-    meta_line_t *meta_bitrate = malloc(sizeof(meta_line_t));
-    meta_bitrate->key = MetaMediaBitrate;
-    meta_bitrate->longval = pFormatCtx->bit_rate;
-    APPEND_META(doc, meta_bitrate)
+        meta_line_t *meta_bitrate = malloc(sizeof(meta_line_t));
+        meta_bitrate->key = MetaMediaBitrate;
+        meta_bitrate->longval = pFormatCtx->bit_rate;
+        APPEND_META(doc, meta_bitrate)
+    }

    AVDictionaryEntry *tag = NULL;
-    while ((tag = av_dict_get(pFormatCtx->metadata, "", tag, AV_DICT_IGNORE_SUFFIX))) {
-        char key[32];
-        strncpy(key, tag->key, sizeof(key));
-
-        char *ptr = key;
-        for (; *ptr; ++ptr) *ptr = (char) tolower(*ptr);
-
-        if (strcmp(key, "title") == 0 && include_audio_tags) {
-            APPEND_TAG_META(doc, tag, MetaTitle)
-        } else if (strcmp(key, "comment") == 0) {
-            APPEND_TAG_META(doc, tag, MetaContent)
+    if (is_video) {
+        while ((tag = av_dict_get(pFormatCtx->metadata, "", tag, AV_DICT_IGNORE_SUFFIX))) {
+            if (include_audio_tags && strcmp(tag->key, "title") == 0) {
+                APPEND_TAG_META(doc, tag, MetaTitle)
+            } else if (strcmp(tag->key, "comment") == 0) {
+                APPEND_TAG_META(doc, tag, MetaContent)
+            } else if (include_audio_tags && strcmp(tag->key, "artist") == 0) {
+                APPEND_TAG_META(doc, tag, MetaArtist)
+            }
+        }
+    } else {
+        // EXIF metadata
+        while ((tag = av_dict_get(frame->metadata, "", tag, AV_DICT_IGNORE_SUFFIX))) {
+            if (include_audio_tags && strcmp(tag->key, "Artist") == 0) {
+                APPEND_TAG_META(doc, tag, MetaArtist)
+            } else if (strcmp(tag->key, "ImageDescription") == 0) {
+                APPEND_TAG_META(doc, tag, MetaContent)
+            }
        }
    }
 }
@@ -236,11 +245,6 @@ void parse_media(const char *filepath, document_t *doc) {
    if (video_stream != -1) {
        AVStream *stream = pFormatCtx->streams[video_stream];

-        if (stream->nb_frames > 1) {
-            //This is a video (not a still image)
-            append_video_meta(pFormatCtx, doc, audio_stream == -1);
-        }
-
        if (stream->codecpar->width <= MIN_SIZE || stream->codecpar->height <= MIN_SIZE) {
            avformat_close_input(&pFormatCtx);
            avformat_free_context(pFormatCtx);
@@ -273,6 +277,8 @@ void parse_media(const char *filepath, document_t *doc) {
            return;
        }

+        append_video_meta(pFormatCtx, frame, doc, audio_stream == -1, stream->nb_frames > 1);
+
        // Scale frame
        AVFrame *scaled_frame = scale_frame(decoder, frame, ScanCtx.tn_size);

--- a/src/parsing/parse.c
+++ b/src/parsing/parse.c
@@ -16,7 +16,6 @@ void *read_all(parse_job_t *job, const char *buf, int bytes_read, int *fd) {
            if (*fd == -1) {
                perror("open");
                printf("%s\n", job->filepath);
-                free(job);
                return NULL;
            }
        }
@@ -25,6 +24,7 @@ void *read_all(parse_job_t *job, const char *buf, int bytes_read, int *fd) {
        int ret = read(*fd, full_buf + bytes_read, job->info.st_size - bytes_read);
        if (ret == -1) {
            perror("read");
+            return NULL;
        }
    }

@@ -108,7 +108,7 @@ void parse(void *arg) {
        void *pdf_buf = read_all(job, (char *) buf, bytes_read, &fd);
        parse_pdf(pdf_buf, doc.size, &doc);

-        if (pdf_buf != buf) {
+        if (pdf_buf != buf && pdf_buf != NULL) {
            free(pdf_buf);
        }

@@ -119,7 +119,7 @@ void parse(void *arg) {
        void *font_buf = read_all(job, (char *) buf, bytes_read, &fd);
        parse_font(font_buf, doc.size, &doc);

-        if (font_buf != buf) {
+        if (font_buf != buf && font_buf != NULL) {
            free(font_buf);
        }
    }
--- a/src/parsing/pdf.c
+++ b/src/parsing/pdf.c
@@ -114,6 +114,10 @@ int read_stext_block(fz_stext_block *block, text_buffer_t *tex) {

 void parse_pdf(void *buf, size_t buf_len, document_t *doc) {

+    if (buf == NULL) {
+        return;
+    }
+
    static int mu_is_initialized = 0;
    if (!mu_is_initialized) {
        pthread_mutex_init(&ScanCtx.mupdf_mu, NULL);
--- a/src/util.c
+++ b/src/util.c
@@ -90,7 +90,7 @@ void text_buffer_terminate_string(text_buffer_t *buf) {
 }

 __always_inline
-int utf8_validchr(const char* s) {
+int utf8_validchr(const char *s) {
    if (0x00 == (0x80 & *s)) {
        return TRUE;
    } else if (0xf0 == (0xf8 & *s)) {
@@ -130,7 +130,7 @@ int utf8_validchr(const char* s) {
        if (0 == (0x1e & s[0])) {
            return FALSE;
        }
-    } else  {
+    } else {
        return FALSE;
    }

@@ -140,12 +140,22 @@ int utf8_validchr(const char* s) {
 int text_buffer_append_string(text_buffer_t *buf, char *str, size_t len) {

    utf8_int32_t c;
-    for (void *v = utf8codepoint(str, &c); c != '\0' && ((char*)v - str + 4) < len; v = utf8codepoint(v, &c)) {
+    if (str == NULL || len < 1 ||
+        (0xf0 == (0xf8 & str[0]) && len < 4) ||
+        (0xe0 == (0xf0 & str[0]) && len < 3) ||
+        (0xc0 == (0xe0 & str[0]) && len == 1) ||
+        *(str) == 0) {
+        text_buffer_terminate_string(buf);
+        return 0;
+    }
+
+    for (void *v = utf8codepoint(str, &c); c != '\0' && ((char *) v - str + 4) < len; v = utf8codepoint(v, &c)) {
        if (utf8_validchr(v)) {
            text_buffer_append_char(buf, c);
        }
    }
    text_buffer_terminate_string(buf);
+    return 0;
 }

 int text_buffer_append_string0(text_buffer_t *buf, char *str) {
--- a/src/web/static_generated.c
+++ b/src/web/static_generated.c
--- a/web/css/dark.css
+++ b/web/css/dark.css
@@ -1,3 +1,7 @@
+*:focus {
+    outline: 0;
+}
+
 a {
    color: #00BCD4;
 }
@@ -95,6 +99,11 @@ body {
    margin-right: 3px;
 }

+.badge-user {
+    color: #212529;
+    background-color: #e0e0e0;
+}
+
 .fit {
    display: block;
    min-width: 64px;
@@ -164,6 +173,7 @@ mark {
    margin-top: 1em;
    margin-bottom: 1em;
 }
+
 .custom-select {
    overflow: auto;
    background-color: #37474F;
@@ -239,4 +249,38 @@ option {

 .btn {
    color: #eee;
-}
+}
+
+.nav-tabs .nav-link {
+    color: #e0e0e0;
+}
+
+.nav-tabs .nav-item.show .nav-link, .nav-tabs .nav-link.active {
+    background-color: #212121;
+    border-color: #616161 #616161 #212121;
+    color: #e0e0e0;
+}
+
+.nav-tabs .nav-link:focus, .nav-tabs .nav-link:focus {
+    border-color: #616161 #616161 #212121;
+    color: #e0e0e0;
+}
+
+.nav-tabs .nav-link:focus, .nav-tabs .nav-link:hover {
+    border-color: #e0e0e0 #e0e0e0 #212121;
+    color: #e0e0e0;
+}
+
+.nav-tabs {
+    border-bottom: #616161;
+}
+
+.nav {
+    margin-top: 0.5rem;
+}
+
+@media (min-width: 800px) {
+    .nav {
+        min-width: 800px;
+    }
+}
--- a/web/css/light.css
+++ b/web/css/light.css
@@ -1,3 +1,7 @@
+*:focus {
+    outline: 0;
+}
+
 body {overflow-y:scroll;}

 .progress {
@@ -47,6 +51,11 @@ body {overflow-y:scroll;}
    background-color: #FFC107;
 }

+.badge-user {
+    color: #212529;
+    background-color: #e0e0e0;
+}
+
 .badge-text {
    color: #FFFFFF;
    background-color: #FAAB3C;
@@ -168,4 +177,14 @@ mark {
    padding: .1rem .3rem;
    font-size: .875rem;
    border-radius: .2rem;
-}
+}
+
+.nav {
+    margin-top: 0.5rem;
+}
+
+@media (min-width: 800px) {
+    .nav {
+        min-width: 800px;
+    }
+}
--- a/web/js/dom.js
+++ b/web/js/dom.js
@@ -75,6 +75,18 @@ function shouldPlayVideo(hit) {
    return videoc !== "hevc" && videoc !== "mpeg2video" && videoc !== "wmv3";
 }

+function makePlaceholder(w, h) {
+    const calc = w > h
+        ? (175 / w / h) >= 272
+            ? (175 * w / h)
+            : 175
+        : 175;
+
+    const el = document.createElement("div");
+    el.setAttribute("style", `height: ${calc}px`);
+    return el;
+}
+
 /**
 *
 * @param hit
@@ -119,14 +131,22 @@ function createDocCard(hit) {
            thumbnail = document.createElement("video");
            addVidSrc("f/" + hit["_id"], hit["_source"]["mime"], thumbnail);

+            const placeholder = makePlaceholder(hit["_source"]["width"], hit["_source"]["height"]);
+            imgWrapper.appendChild(placeholder);
+
            thumbnail.setAttribute("class", "fit");
-            thumbnail.setAttribute("loop", "");
            thumbnail.setAttribute("controls", "");
            thumbnail.setAttribute("preload", "none");
            thumbnail.setAttribute("poster", `t/${hit["_source"]["index"]}/${hit["_id"]}`);
            thumbnail.addEventListener("dblclick", function () {
                thumbnail.webkitRequestFullScreen();
            });
+            const poster = new Image();
+            poster.src = thumbnail.getAttribute('poster');
+            poster.addEventListener("load", function () {
+                placeholder.remove();
+                imgWrapper.appendChild(thumbnail);
+            });
        } else if ((hit["_source"].hasOwnProperty("width") && hit["_source"]["width"] > 20 && hit["_source"]["height"] > 20)
            || hit["_source"]["mime"] === "application/pdf"
            || hit["_source"]["mime"] === "application/epub+zip"
@@ -136,9 +156,17 @@ function createDocCard(hit) {
            thumbnail = document.createElement("img");
            thumbnail.setAttribute("class", "card-img-top fit");
            thumbnail.setAttribute("src", `t/${hit["_source"]["index"]}/${hit["_id"]}`);
+
+            const placeholder = makePlaceholder(hit["_source"]["width"], hit["_source"]["height"]);
+            imgWrapper.appendChild(placeholder);
+
            thumbnail.addEventListener("error", () => {
                imgWrapper.remove();
            });
+            thumbnail.addEventListener("load", () => {
+                placeholder.remove();
+                imgWrapper.appendChild(thumbnail);
+            });
        }

        //Thumbnail overlay
@@ -167,7 +195,7 @@ function createDocCard(hit) {
                if (hit["_source"].hasOwnProperty("duration")) {
                    thumbnailOverlay = document.createElement("div");
                    thumbnailOverlay.setAttribute("class", "card-img-overlay");
-                    let durationBadge = document.createElement("span");
+                    const durationBadge = document.createElement("span");
                    durationBadge.setAttribute("class", "badge badge-resolution");
                    durationBadge.appendChild(document.createTextNode(humanTime(hit["_source"]["duration"])));
                    thumbnailOverlay.appendChild(durationBadge);
@@ -179,7 +207,7 @@ function createDocCard(hit) {
            case "video":
            case "image":
                if (hit["_source"].hasOwnProperty("videoc")) {
-                    let formatTag = document.createElement("span");
+                    const formatTag = document.createElement("span");
                    formatTag.setAttribute("class", "badge badge-pill badge-video");
                    formatTag.appendChild(document.createTextNode(hit["_source"]["videoc"].replace(" ", "")));
                    tags.push(formatTag);
@@ -199,14 +227,13 @@ function createDocCard(hit) {
        //Content
        let contentHl = getContentHighlight(hit);
        if (contentHl !== undefined) {
-            let contentDiv = document.createElement("div");
+            const contentDiv = document.createElement("div");
            contentDiv.setAttribute("class", "content-div");
            contentDiv.insertAdjacentHTML('afterbegin', contentHl);
            docCard.appendChild(contentDiv);
        }

        if (thumbnail !== null) {
-            imgWrapper.appendChild(thumbnail);
            docCard.appendChild(imgWrapper);
        }

@@ -227,6 +254,26 @@ function createDocCard(hit) {
            imgWrapper.appendChild(thumbnailOverlay);
        }

+        // User tags
+        if (hit["_source"].hasOwnProperty("tag")) {
+            hit["_source"]["tag"].forEach(tag => {
+                const userTag = document.createElement("span");
+                userTag.setAttribute("class", "badge badge-pill badge-user");
+
+                const tokens = tag.split("#");
+
+                if (tokens.length > 1) {
+                    const bg = "#" + tokens[1];
+                    const fg = lum(tokens[1]) > 40 ? "#000" : "#fff";
+                    userTag.setAttribute("style", `background-color: ${bg}; color: ${fg}`);
+                }
+
+                const name = tokens[0].split(".")[tokens[0].split(".").length - 1];
+                userTag.appendChild(document.createTextNode(name));
+                tags.push(userTag);
+            })
+        }
+
        for (let i = 0; i < tags.length; i++) {
            tagContainer.appendChild(tags[i]);
        }
--- a/web/js/search.js
+++ b/web/js/search.js
@@ -1,6 +1,8 @@
 const SIZE = 40;
 let mimeMap = [];
-let tree;
+let tagMap = [];
+let mimeTree;
+let tagTree;

 let searchBar = document.getElementById("searchBar");
 let pathBar = document.getElementById("pathBar");
@@ -32,7 +34,7 @@ window.onload = () => {
    })
 };

-function toggleSearchBar() {
+function toggleFuzzy() {
    searchDebounced();
 }

@@ -49,6 +51,23 @@ $.jsonPost("i").then(resp => {
    });
 });

+function handleTreeClick (tree) {
+    return (event, node, handler) => {
+        event.preventTreeDefault();
+
+        if (node.id === "any") {
+            if (!node.itree.state.checked) {
+                tree.deselect();
+            }
+        } else {
+            tree.node("any").deselect();
+        }
+
+        handler();
+        searchDebounced();
+    }
+}
+
 $.jsonPost("es", {
    aggs: {
        mimeTypes: {
@@ -85,34 +104,86 @@ $.jsonPost("es", {
    });
    mimeMap.push({"text": "All", "id": "any"});

-    tree = new InspireTree({
+    mimeTree = new InspireTree({
        selection: {
            mode: 'checkbox'
        },
        data: mimeMap
    });
-    new InspireTreeDOM(tree, {
-        target: '.tree'
+    new InspireTreeDOM(mimeTree, {
+        target: '#mimeTree'
    });
-    tree.on("node.click", function (event, node, handler) {
-        event.preventTreeDefault();
+    mimeTree.on("node.click", handleTreeClick(mimeTree));
+    mimeTree.select();
+    mimeTree.node("any").deselect();
+});

-        if (node.id === "any") {
-            if (!node.itree.state.checked) {
-                tree.deselect();
+function leafTag(tag) {
+    const tokens = tag.split(".");
+    return tokens[tokens.length-1]
+}
+
+// Tags tree
+$.jsonPost("es", {
+    aggs: {
+        tags: {
+            terms: {
+                field: "tag",
+                size: 10000
            }
-        } else {
-            tree.node("any").deselect();
        }
-
-        handler();
-        searchDebounced();
+    },
+    size: 0,
+}).then(resp => {
+    resp["aggregations"]["tags"]["buckets"]
+        .sort((a, b) => a["key"].localeCompare(b["key"]))
+        .forEach(bucket => {
+        addTag(tagMap, bucket["key"], bucket["key"], bucket["doc_count"])
    });
-    tree.select();
-    tree.node("any").deselect();
+
+    tagMap.push({"text": "All", "id": "any"});
+    tagTree = new InspireTree({
+        selection: {
+            mode: 'checkbox'
+        },
+        data: tagMap
+    });
+    new InspireTreeDOM(tagTree, {
+        target: '#tagTree'
+    });
+    tagTree.on("node.click", handleTreeClick(tagTree));
+    tagTree.node("any").select();
    searchBusy = false;
 });

+function addTag(map, tag, id, count) {
+    let tags = tag.split("#")[0].split(".");
+
+    let child = {
+        id: id,
+        text: tags.length !== 1 ? tags[0] : `${tags[0]} (${count})`,
+        children: []
+    };
+
+    let found = false;
+    map.forEach(node => {
+        if (node.text === child.text) {
+            found = true;
+            if (tags.length !== 1) {
+                addTag(node.children, tags.slice(1).join("."), id, count);
+            }
+        }
+    });
+    if (!found) {
+        if (tags.length !== 1) {
+            addTag(child.children, tags.slice(1).join("."), id, count);
+            map.push(child);
+        } else {
+            map.push(child);
+        }
+    }
+}
+
 new autoComplete({
    selector: '#pathBar',
    minChars: 1,
@@ -181,8 +252,8 @@ function doScroll() {
        })
 }

-function getSelectedMimeTypes() {
-    let mimeTypes = [];
+function getSelectedNodes(tree) {
+    let selectedNodes = [];

    let selected = tree.selected();

@@ -194,11 +265,11 @@ function getSelectedMimeTypes() {

        //Only get children
        if (selected[i].text.indexOf("(") !== -1) {
-            mimeTypes.push(selected[i].id);
+            selectedNodes.push(selected[i].id);
        }
    }

-    return mimeTypes
+    return selectedNodes
 }

 function search() {
@@ -218,21 +289,37 @@ function search() {

    let query = searchBar.value;
    let empty = query === "";
-    let condition = $("#barToggle").prop("checked") && !empty ? "must" : "should";
+    let condition = empty ? "should" : "must";
    let filters = [
        {range: {size: {gte: size_min, lte: size_max}}},
        {terms: {index: selectedIndices}}
    ];
+    let fields = [
+        "name^8",
+        "content^3",
+        "album^8", "artist^8", "title^8", "genre^2", "album_artist^8",
+        "font_name^6"
+    ];
+
+    if ($("#fuzzyToggle").prop("checked")) {
+        fields.push("content.nGram");
+        fields.push("name.nGram^3");
+    }

    let path = pathBar.value.replace(/\/$/, "").toLowerCase(); //remove trailing slashes
    if (path !== "") {
        filters.push([{term: {path: path}}])
    }
-    let mimeTypes = getSelectedMimeTypes();
+    let mimeTypes = getSelectedNodes(mimeTree);
    if (!mimeTypes.includes("any")) {
        filters.push([{terms: {"mime": mimeTypes}}]);
    }

+    let tags = getSelectedNodes(tagTree);
+    if (!tags.includes("any")) {
+        filters.push([{terms: {"tag": tags}}]);
+    }
+
    $.jsonPost("es?scroll=1", {
        "_source": {
            excludes: ["content"]
@@ -243,12 +330,7 @@ function search() {
                    multi_match: {
                        query: query,
                        type: "most_fields",
-                        fields: [
-                            "name^8", "name.nGram^3", "content^3",
-                            "content.nGram",
-                            "album^8", "artist^8", "title^8", "genre^2", "album_artist^8",
-                            "font_name^6"
-                        ],
+                        fields: fields,
                        operator: "and"
                    }
                },
@@ -265,7 +347,7 @@ function search() {
                content: {},
                name: {},
                "name.nGram": {},
-                // font_name: {},
+                font_name: {},
            }
        },
        aggs: {
--- a/web/js/util.js
+++ b/web/js/util.js
@@ -43,9 +43,9 @@ function humanTime(sec_num) {

 function debounce(func, wait) {
    let timeout;
-    return function() {
+    return function () {
        let context = this, args = arguments;
-        let later = function() {
+        let later = function () {
            timeout = null;
            func.apply(context, args);
        };
@@ -54,3 +54,13 @@ function debounce(func, wait) {
        func.apply(context, args);
    };
 }
+
+function lum(c) {
+    c = c.substring(1);
+    let rgb = parseInt(c, 16);
+    let r = (rgb >> 16) & 0xff;
+    let g = (rgb >> 8) & 0xff;
+    let b = (rgb >> 0) & 0xff;
+
+    return 0.2126 * r + 0.7152 * g + 0.0722 * b;
+}
--- a/web/search.html
+++ b/web/search.html
@@ -24,9 +24,9 @@
            <div class="input-group">
                <div class="input-group-prepend">
                    <div class="input-group-text">
-                        <span onclick="document.getElementById('barToggle').click()">Must match&nbsp</span>
-                        <input title="Toggle between 'Should' and 'Must' match mode" type="checkbox" id="barToggle"
-                               onclick="toggleSearchBar()" checked>
+                        <span title="Toggle fuzzy searching"  onclick="document.getElementById('fuzzyToggle').click()">Fuzzy&nbsp</span>
+                        <input title="Toggle fuzzy searching" type="checkbox" id="fuzzyToggle"
+                               onclick="toggleFuzzy()" checked>
                    </div>
                </div>
                <input id="searchBar" type="search" class="form-control" placeholder="Search">
@@ -42,10 +42,24 @@
                </div>

                <div class="col">
-                    <label>Mime types</label>
-
-                    <div class="tree"></div>
+                    <ul class="nav nav-tabs" role="tablist">
+                        <li class="nav-item">
+                            <a class="nav-link active" data-toggle="tab" href="#mime" role="tab" aria-controls="home" aria-selected="true">Mime Types</a>
+                        </li>
+                        <li class="nav-item">
+                            <a class="nav-link" data-toggle="tab" href="#tag" role="tab" aria-controls="profile" aria-selected="false" title="User-defined tags">Tags</a>
+                        </li>
+                    </ul>
+                    <div class="tab-content" id="myTabContent">
+                        <div class="tab-pane fade show active" id="mime" role="tabpanel" aria-labelledby="home-tab">
+                            <div id="mimeTree" class="tree"></div>
+                        </div>
+                        <div class="tab-pane fade" id="tag" role="tabpanel" aria-labelledby="profile-tab">
+                            <div id="tagTree" class="tree"></div>
+                        </div>
+                    </div>
                </div>
+
            </div>
        </div>
    </div>
Author	SHA1	Message	Date
simon	ebfd7e03ce	User scripts, bug fixes, docker image	2019-11-12 20:58:43 -05:00
simon	6931d320a2	bugfix with invalid/corrupted index path	2019-11-11 20:49:38 -05:00
simon	fc22e52eae	Image placeholder	2019-11-09 23:26:49 -05:00
simon	ba81748a74	Update build	2019-11-09 17:15:20 -05:00
simon	e72fa1587b	EXIF metadata for images	2019-11-09 15:18:44 -05:00
simon	ea4fb7fa0d	Bug fixes	2019-11-09 12:00:07 -05:00
simon	b0a868bb73	remove 'must match'	2019-11-08 21:46:54 -05:00
simon	d761a3b595	update readme	2019-11-08 19:42:36 -05:00
simon	2d7a8a2fdc	fuzzy toggle	2019-11-08 16:15:10 -05:00