diff --git a/README.md b/README.md index 5bacbd0..7f5daca 100644 --- a/README.md +++ b/README.md @@ -157,6 +157,7 @@ indices, but it uses much less memory and is easier to set up. | Manual tagging | ✓ | ✓ | | User scripts | ✓ | ✓ | | Media Type breakdown for search results | | ✓ | +| Embeddings search | ✓ *O(n)* | ✓ *O(logn)* | ### NER diff --git a/docs/USAGE.md b/docs/USAGE.md index 2bc04dd..3277bf3 100644 --- a/docs/USAGE.md +++ b/docs/USAGE.md @@ -175,6 +175,32 @@ Using a version >=7.14.0 is recommended to enable the following features: When using a legacy version of ES, a notice will be displayed next to the sist2 version in the web UI. If you don't care about the features above, you can ignore it or disable it in the configuration page. +# Embeddings search + +Since v3.2.0, User scripts can be used to generate _embeddings_ (vector of float32 numbers) which are stored in the .sist2 index file +(see [scripting](scripting.md)). Embeddings can be used for: + +* Nearest-neighbor queries (e.g. "return the documents most similar to this one") +* Semantic searches (e.g. "return the documents that are most closely related to the given topic") + +In theory, embeddings can be created for any type of documents (image, text, audio etc.). + +For example, the [clip](https://github.com/simon987/sist2-script-clip) User Script, generates 512-d embeddings of images +(videos are also supported using the thumbnails generated by sist2). When the user enters a query in the "Embeddings Search" +textbox, the query's embedding is generated in their browser, leveraging the ONNX web runtime. + +
+ Screenshots + + ![embeddings-1](embeddings-1.png) + ![embeddings-2](embeddings-2.png) + +1. Embeddings search bar. You can select the model using the dropdown on the left. +2. This icon appears for indices with embeddings search enabled. +3. Documents with this icon have embeddings. Click on the icon to perform KNN search. +
+ + # Tagging ### Manual tagging @@ -199,43 +225,4 @@ See [Automatic tagging](#automatic-tagging) for information about tag ### Automatic tagging -See [scripting](scripting.md) documentation. - -# Sidecar files - -When scanning, sist2 will read metadata from `.s2meta` JSON files and overwrite the -original document's indexed metadata (does not modify the actual file). Sidecar metadata files will also work inside archives. -Sidecar files themselves are not saved in the index. - -This feature is useful to leverage third-party applications such as speech-to-text or -OCR to add additional metadata to a file. - -**Example** - -``` -~/Documents/ -├── Video.mp4 -└── Video.mp4.s2meta -``` - -The sidecar file must have exactly the same file path and the `.s2meta` suffix. - -`Video.mp4.s2meta`: -```json -{ - "content": "This sidecar file will overwrite some metadata fields of Video.mp4", - "author": "Some author", - "duration": 12345, - "bitrate": 67890, - "some_arbitrary_field": [1,2,3] -} -``` - -``` -sist2 scan ~/Documents -o ./docs.sist2 -sist2 index ./docs.sist2 -``` - -*NOTE*: It is technically possible to overwrite the `tag` value using sidecar files, however, -it is not currently possible to restore both manual tags and sidecar tags without user scripts -while reindexing. +See [scripting](scripting.md) documentation. \ No newline at end of file diff --git a/docs/embeddings-1.png b/docs/embeddings-1.png new file mode 100644 index 0000000..44609c0 Binary files /dev/null and b/docs/embeddings-1.png differ diff --git a/docs/embeddings-2.png b/docs/embeddings-2.png new file mode 100644 index 0000000..45c0041 Binary files /dev/null and b/docs/embeddings-2.png differ