mirror of
https://github.com/simon987/sist2.git
synced 2025-12-19 10:19:03 +00:00
Compare commits
8 Commits
e1e22fd79a
...
3.3.6
| Author | SHA1 | Date | |
|---|---|---|---|
| 49a21a5a25 | |||
| 560aa82ce7 | |||
| b8c905bd64 | |||
| 8299237ea0 | |||
| 31646a2747 | |||
| d9d77de47f | |||
| 5f0957d029 | |||
| 1cc48f7f33 |
@@ -4,6 +4,8 @@
|
|||||||
|
|
||||||
**Demo**: [sist2.simon987.net](https://sist2.simon987.net/)
|
**Demo**: [sist2.simon987.net](https://sist2.simon987.net/)
|
||||||
|
|
||||||
|
**Community URL:** [Discord](https://discord.gg/2PEjDy3Rfs)
|
||||||
|
|
||||||
# sist2
|
# sist2
|
||||||
|
|
||||||
sist2 (Simple incremental search tool)
|
sist2 (Simple incremental search tool)
|
||||||
@@ -46,7 +48,7 @@ services:
|
|||||||
- "discovery.type=single-node"
|
- "discovery.type=single-node"
|
||||||
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
|
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
|
||||||
sist2-admin:
|
sist2-admin:
|
||||||
image: simon987/sist2:3.1.4-x64-linux
|
image: simon987/sist2:3.3.4-x64-linux
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
volumes:
|
volumes:
|
||||||
- ./sist2-admin-data/:/sist2-admin/
|
- ./sist2-admin-data/:/sist2-admin/
|
||||||
@@ -157,6 +159,7 @@ indices, but it uses much less memory and is easier to set up.
|
|||||||
| Manual tagging | ✓ | ✓ |
|
| Manual tagging | ✓ | ✓ |
|
||||||
| User scripts | ✓ | ✓ |
|
| User scripts | ✓ | ✓ |
|
||||||
| Media Type breakdown for search results | | ✓ |
|
| Media Type breakdown for search results | | ✓ |
|
||||||
|
| Embeddings search | ✓ *O(n)* | ✓ *O(logn)* |
|
||||||
|
|
||||||
### NER
|
### NER
|
||||||
|
|
||||||
|
|||||||
@@ -175,6 +175,32 @@ Using a version >=7.14.0 is recommended to enable the following features:
|
|||||||
When using a legacy version of ES, a notice will be displayed next to the sist2 version in the web UI.
|
When using a legacy version of ES, a notice will be displayed next to the sist2 version in the web UI.
|
||||||
If you don't care about the features above, you can ignore it or disable it in the configuration page.
|
If you don't care about the features above, you can ignore it or disable it in the configuration page.
|
||||||
|
|
||||||
|
# Embeddings search
|
||||||
|
|
||||||
|
Since v3.2.0, User scripts can be used to generate _embeddings_ (vector of float32 numbers) which are stored in the .sist2 index file
|
||||||
|
(see [scripting](scripting.md)). Embeddings can be used for:
|
||||||
|
|
||||||
|
* Nearest-neighbor queries (e.g. "return the documents most similar to this one")
|
||||||
|
* Semantic searches (e.g. "return the documents that are most closely related to the given topic")
|
||||||
|
|
||||||
|
In theory, embeddings can be created for any type of documents (image, text, audio etc.).
|
||||||
|
|
||||||
|
For example, the [clip](https://github.com/simon987/sist2-script-clip) User Script, generates 512-d embeddings of images
|
||||||
|
(videos are also supported using the thumbnails generated by sist2). When the user enters a query in the "Embeddings Search"
|
||||||
|
textbox, the query's embedding is generated in their browser, leveraging the ONNX web runtime.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Screenshots</summary>
|
||||||
|
|
||||||
|

|
||||||
|

|
||||||
|
|
||||||
|
1. Embeddings search bar. You can select the model using the dropdown on the left.
|
||||||
|
2. This icon appears for indices with embeddings search enabled.
|
||||||
|
3. Documents with this icon have embeddings. Click on the icon to perform KNN search.
|
||||||
|
</details>
|
||||||
|
|
||||||
|
|
||||||
# Tagging
|
# Tagging
|
||||||
|
|
||||||
### Manual tagging
|
### Manual tagging
|
||||||
@@ -199,43 +225,4 @@ See [Automatic tagging](#automatic-tagging) for information about tag
|
|||||||
|
|
||||||
### Automatic tagging
|
### Automatic tagging
|
||||||
|
|
||||||
See [scripting](scripting.md) documentation.
|
See [scripting](scripting.md) documentation.
|
||||||
|
|
||||||
# Sidecar files
|
|
||||||
|
|
||||||
When scanning, sist2 will read metadata from `.s2meta` JSON files and overwrite the
|
|
||||||
original document's indexed metadata (does not modify the actual file). Sidecar metadata files will also work inside archives.
|
|
||||||
Sidecar files themselves are not saved in the index.
|
|
||||||
|
|
||||||
This feature is useful to leverage third-party applications such as speech-to-text or
|
|
||||||
OCR to add additional metadata to a file.
|
|
||||||
|
|
||||||
**Example**
|
|
||||||
|
|
||||||
```
|
|
||||||
~/Documents/
|
|
||||||
├── Video.mp4
|
|
||||||
└── Video.mp4.s2meta
|
|
||||||
```
|
|
||||||
|
|
||||||
The sidecar file must have exactly the same file path and the `.s2meta` suffix.
|
|
||||||
|
|
||||||
`Video.mp4.s2meta`:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"content": "This sidecar file will overwrite some metadata fields of Video.mp4",
|
|
||||||
"author": "Some author",
|
|
||||||
"duration": 12345,
|
|
||||||
"bitrate": 67890,
|
|
||||||
"some_arbitrary_field": [1,2,3]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
```
|
|
||||||
sist2 scan ~/Documents -o ./docs.sist2
|
|
||||||
sist2 index ./docs.sist2
|
|
||||||
```
|
|
||||||
|
|
||||||
*NOTE*: It is technically possible to overwrite the `tag` value using sidecar files, however,
|
|
||||||
it is not currently possible to restore both manual tags and sidecar tags without user scripts
|
|
||||||
while reindexing.
|
|
||||||
BIN
docs/embeddings-1.png
Normal file
BIN
docs/embeddings-1.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 90 KiB |
BIN
docs/embeddings-2.png
Normal file
BIN
docs/embeddings-2.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 996 KiB |
@@ -33,18 +33,6 @@ class Sist2Api {
|
|||||||
|
|
||||||
getSist2Info() {
|
getSist2Info() {
|
||||||
return axios.get(`${this.baseUrl}i`).then(resp => {
|
return axios.get(`${this.baseUrl}i`).then(resp => {
|
||||||
const indices = resp.data.indices;
|
|
||||||
|
|
||||||
resp.data.indices = indices.map(idx => {
|
|
||||||
return {
|
|
||||||
id: idx.id,
|
|
||||||
name: idx.name,
|
|
||||||
timestamp: idx.timestamp,
|
|
||||||
version: idx.version,
|
|
||||||
models: idx.models,
|
|
||||||
};
|
|
||||||
});
|
|
||||||
|
|
||||||
this.sist2Info = resp.data;
|
this.sist2Info = resp.data;
|
||||||
|
|
||||||
return resp.data;
|
return resp.data;
|
||||||
@@ -155,6 +143,12 @@ class Sist2Api {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
_getIndexRoot(indexId) {
|
||||||
|
console.log(indexId)
|
||||||
|
console.log(this.sist2Info.indices.find(idx => idx.id === indexId))
|
||||||
|
return this.sist2Info.indices.find(idx => idx.id === indexId).root;
|
||||||
|
}
|
||||||
|
|
||||||
esQuery(query) {
|
esQuery(query) {
|
||||||
return axios.post(`${this.baseUrl}es`, query).then(resp => {
|
return axios.post(`${this.baseUrl}es`, query).then(resp => {
|
||||||
const res = resp.data;
|
const res = resp.data;
|
||||||
@@ -163,6 +157,7 @@ class Sist2Api {
|
|||||||
res.hits.hits.forEach((hit) => {
|
res.hits.hits.forEach((hit) => {
|
||||||
hit["_source"]["name"] = strUnescape(hit["_source"]["name"]);
|
hit["_source"]["name"] = strUnescape(hit["_source"]["name"]);
|
||||||
hit["_source"]["path"] = strUnescape(hit["_source"]["path"]);
|
hit["_source"]["path"] = strUnescape(hit["_source"]["path"]);
|
||||||
|
hit["_source"]["indexRoot"] = this._getIndexRoot(hit["_source"]["index"]);
|
||||||
|
|
||||||
this.setHitProps(hit);
|
this.setHitProps(hit);
|
||||||
this.setHitTags(hit);
|
this.setHitTags(hit);
|
||||||
|
|||||||
@@ -90,6 +90,7 @@ subreq_ctx_t *web_post_async(const char *url, char *data, int insecure) {
|
|||||||
curl_easy_setopt(curl, CURLOPT_USERAGENT, "sist2");
|
curl_easy_setopt(curl, CURLOPT_USERAGENT, "sist2");
|
||||||
if (insecure) {
|
if (insecure) {
|
||||||
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
||||||
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYHOST, 0);
|
||||||
}
|
}
|
||||||
|
|
||||||
curl_easy_setopt(curl, CURLOPT_ERRORBUFFER, req->curl_err_buffer);
|
curl_easy_setopt(curl, CURLOPT_ERRORBUFFER, req->curl_err_buffer);
|
||||||
@@ -123,6 +124,7 @@ response_t *web_get(const char *url, int timeout, int insecure) {
|
|||||||
curl_easy_setopt(curl, CURLOPT_TIMEOUT, timeout);
|
curl_easy_setopt(curl, CURLOPT_TIMEOUT, timeout);
|
||||||
if (insecure) {
|
if (insecure) {
|
||||||
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
||||||
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYHOST, 0);
|
||||||
}
|
}
|
||||||
|
|
||||||
struct curl_slist *headers = NULL;
|
struct curl_slist *headers = NULL;
|
||||||
@@ -162,6 +164,7 @@ response_t *web_post(const char *url, const char *data, int insecure) {
|
|||||||
curl_easy_setopt(curl, CURLOPT_USERAGENT, "sist2");
|
curl_easy_setopt(curl, CURLOPT_USERAGENT, "sist2");
|
||||||
if (insecure) {
|
if (insecure) {
|
||||||
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
||||||
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYHOST, 0);
|
||||||
}
|
}
|
||||||
|
|
||||||
char err_buffer[CURL_ERROR_SIZE + 1] = {};
|
char err_buffer[CURL_ERROR_SIZE + 1] = {};
|
||||||
@@ -207,6 +210,7 @@ response_t *web_put(const char *url, const char *data, int insecure) {
|
|||||||
curl_easy_setopt(curl, CURLOPT_IPRESOLVE, CURLOPT_DNS_LOCAL_IP4);
|
curl_easy_setopt(curl, CURLOPT_IPRESOLVE, CURLOPT_DNS_LOCAL_IP4);
|
||||||
if (insecure) {
|
if (insecure) {
|
||||||
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
||||||
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYHOST, 0);
|
||||||
}
|
}
|
||||||
|
|
||||||
struct curl_slist *headers = NULL;
|
struct curl_slist *headers = NULL;
|
||||||
@@ -241,6 +245,7 @@ response_t *web_delete(const char *url, int insecure) {
|
|||||||
curl_easy_setopt(curl, CURLOPT_USERAGENT, "sist2");
|
curl_easy_setopt(curl, CURLOPT_USERAGENT, "sist2");
|
||||||
if (insecure) {
|
if (insecure) {
|
||||||
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
|
||||||
|
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYHOST, 0);
|
||||||
}
|
}
|
||||||
|
|
||||||
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, "");
|
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, "");
|
||||||
|
|||||||
@@ -51,11 +51,11 @@
|
|||||||
#include <ctype.h>
|
#include <ctype.h>
|
||||||
#include "git_hash.h"
|
#include "git_hash.h"
|
||||||
|
|
||||||
#define VERSION "3.3.2"
|
#define VERSION "3.3.6"
|
||||||
static const char *const Version = VERSION;
|
static const char *const Version = VERSION;
|
||||||
static const int VersionMajor = 3;
|
static const int VersionMajor = 3;
|
||||||
static const int VersionMinor = 3;
|
static const int VersionMinor = 3;
|
||||||
static const int VersionPatch = 2;
|
static const int VersionPatch = 6;
|
||||||
|
|
||||||
#ifndef SIST_PLATFORM
|
#ifndef SIST_PLATFORM
|
||||||
#define SIST_PLATFORM unknown
|
#define SIST_PLATFORM unknown
|
||||||
|
|||||||
@@ -88,7 +88,7 @@ void stats_files(struct mg_connection *nc, struct mg_http_message *hm) {
|
|||||||
|
|
||||||
memcpy(index_id_str, hm->uri.ptr + 3, 8);
|
memcpy(index_id_str, hm->uri.ptr + 3, 8);
|
||||||
*(index_id_str + 8) = '\0';
|
*(index_id_str + 8) = '\0';
|
||||||
int index_id = (int)strtol(index_id_str, NULL, 16);
|
int index_id = (int) strtol(index_id_str, NULL, 16);
|
||||||
|
|
||||||
memcpy(arg_stat_type, hm->uri.ptr + 3 + 9, 4);
|
memcpy(arg_stat_type, hm->uri.ptr + 3 + 9, 4);
|
||||||
*(arg_stat_type + sizeof(arg_stat_type) - 1) = '\0';
|
*(arg_stat_type + sizeof(arg_stat_type) - 1) = '\0';
|
||||||
@@ -368,6 +368,10 @@ void index_info(struct mg_connection *nc) {
|
|||||||
cJSON_AddNumberToObject(idx_json, "timestamp", (double) idx->desc.timestamp);
|
cJSON_AddNumberToObject(idx_json, "timestamp", (double) idx->desc.timestamp);
|
||||||
cJSON_AddItemToArray(arr, idx_json);
|
cJSON_AddItemToArray(arr, idx_json);
|
||||||
|
|
||||||
|
#ifdef SIST_DEBUG_INFO
|
||||||
|
cJSON_AddStringToObject(idx_json, "root", idx->desc.root);
|
||||||
|
#endif
|
||||||
|
|
||||||
cJSON *models = database_get_models(idx->db);
|
cJSON *models = database_get_models(idx->db);
|
||||||
cJSON_AddItemToObject(idx_json, "models", models);
|
cJSON_AddItemToObject(idx_json, "models", models);
|
||||||
}
|
}
|
||||||
@@ -480,7 +484,7 @@ tag_req_t *parse_tag_request(cJSON *json) {
|
|||||||
return req;
|
return req;
|
||||||
}
|
}
|
||||||
|
|
||||||
subreq_ctx_t *elastic_delete_tag(const char* sid, const tag_req_t *req) {
|
subreq_ctx_t *elastic_delete_tag(const char *sid, const tag_req_t *req) {
|
||||||
char *buf = malloc(sizeof(char) * 8192);
|
char *buf = malloc(sizeof(char) * 8192);
|
||||||
snprintf(buf, 8192,
|
snprintf(buf, 8192,
|
||||||
"{"
|
"{"
|
||||||
@@ -500,7 +504,7 @@ subreq_ctx_t *elastic_delete_tag(const char* sid, const tag_req_t *req) {
|
|||||||
return web_post_async(url, buf, WebCtx.es_insecure_ssl);
|
return web_post_async(url, buf, WebCtx.es_insecure_ssl);
|
||||||
}
|
}
|
||||||
|
|
||||||
subreq_ctx_t *elastic_write_tag(const char* sid, const tag_req_t *req) {
|
subreq_ctx_t *elastic_write_tag(const char *sid, const tag_req_t *req) {
|
||||||
char *buf = malloc(sizeof(char) * 8192);
|
char *buf = malloc(sizeof(char) * 8192);
|
||||||
snprintf(buf, 8192,
|
snprintf(buf, 8192,
|
||||||
"{"
|
"{"
|
||||||
|
|||||||
Reference in New Issue
Block a user