This commit is contained in:
simon 2020-02-21 22:26:20 -05:00
parent c67f653d3d
commit b49d75d5a9
10 changed files with 133 additions and 4 deletions

99
content/posts/lg1.md Normal file
View File

@ -0,0 +1,99 @@
---
title: "Running a LibGen cache server"
date: 2020-02-20T15:39:04-05:00
author: simon987
tags: [torrents]
---
# nginx proxy cache
Not too long ago, there was [an initiative](https://www.vice.com/en_us/article/pa7jxb/archivists-are-trying-to-make-sure-a-pirate-bay-of-science-never-goes-down) to secure the books and scientific papers of the *Library Genesis* project.
It attracted a lot of new seeders and project contributors, however, I noticed that the daily database dumps were becoming
slower and slower to download because of the increased traffic.
{{< figure src="/lg/curl1.png" title="43kBit/s from origin server">}}
I decided to try to contribute some of my bandwidth to the project by creating a mirror of the database dump files.
The initial idea was to write a bash script that would periodically download all new dumps, clean up the old ones and
somehow handle connection problems and duplicate files. I quickly realized that this solution could become a hassle to
maintain so I opted for a simpler alternative.
## Basic nginx setup
The following configuration is all that is needed to get a cache server up and running.
{{<highlight nginx >}}
proxy_cache_path /files/.cache/
levels=1:2
keys_zone=libgen_cache:1m
max_size=90g inactive=72h use_temp_path=off;
location / {
proxy_cache libgen_cache;
proxy_ignore_headers X-Accel-Expires Expires Cache-Control;
proxy_cache_valid any 168h;
proxy_cache_revalidate on;
add_header X-Cache-Status $upstream_cache_status;
proxy_pass http://gen.lib.rus.ec/dbdumps/;
}
{{</highlight>}}
The `proxy_cache_path` statement initializes a 90 GB cache folder. Entries that are not
accessed for more than 72h are periodically purged.
See [ngx\_http\_proxy\_module](https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path)
for more details about all its options.
In the `location` block, we tell nginx to ignore the client's headers and to consider all cached
items valid for (an arbitrary value of) one week. After 168 hours, a file is considered *stale*,
but it will still be served from cache if it wasn't modified on the origin server (using the `If-Modified-Since` header).
The `$upstream_cache_status` variable tells the client if they're downloading from
the origin server or from the cache.
{{< figure src="/lg/cachehit.png" title="X-Cache-Status header">}}
## Download speed improvements
The initial download for `libgen.rar` took 3h12m (~310 kBit/s). When I re-downloaded the file immediately after,
I was able to saturate my home connection and finish the download in 3 minutes, about 60 times faster!
You can find the cache server at [lgmirror.simon987.net](https://lgmirror.simon987.net/).
{{<highlight "" >}}
Connecting to lgmirror.simon987.net (lgmirror.simon987.net)|104.31.86.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3431679578 (3.2G) [application/x-rar-compressed]
Saving to: libgen_2020-02-17.rar
2020-02-20 18:16:22 (13.9 MB/s) - libgen_2020-02-17.rar saved [3431679578/3431679578]
{{</highlight>}}
### Limitations and workarounds
I noticed that the file listing has shortcuts pointing to the latest database dump.
Unfortunately, due to the way nginx's `proxy_cache` module works, both files would need
to be pulled from the origin server, even if they are identical.
{{< figure src="/lg/dbdump.png" title="libgen.rar is symlinked to today's dump">}}
Since I'm not aware of a way to create a HTTP redirect based on the current date,
a workaround for now is to force users to use the `*_yyyy-mm-dd.rar` files.
{{<highlight nginx >}}
location /libgen.rar {
add_header Content-Type "text/plain;charset=UTF-8";
return 200 'Please download libgen-mm-dd.rar instead.\nПожалуйста, скачайте libgen_yyyy-mm-dd.rar.\n';
}
{{</highlight>}}
## Full configuration
<noscript>
<a href="https://gist.github.com/simon987/40f8d81878c45e43a6b91db327d8f4c0#file-libgen_mirror-conf">libgen_mirror.conf</a>
</noscript>
<script src="https://gist.github.com/simon987/40f8d81878c45e43a6b91db327d8f4c0.js"></script>

20
content/posts/mg.md Normal file
View File

@ -0,0 +1,20 @@
---
title: "Mg"
date: 2019-12-14T20:14:25-05:00
draft: true
---
<div style="height: 400px">
<img src="file:///home/drone/test.svg" style="width: 100%;height: 100%">
</div>
<script src="/mg/panzoom.min.js"></script>
<script>
const graph = document.getElementById("graph");
const panzoom = Panzoom(graph, {startScale: 1});
graph.addEventListener('wheel', (e) => {
console.log(e);
console.log(panzoom.zoomWithWheel(e));
});
</script>

View File

@ -2,7 +2,7 @@
title: "Indexing your files with sist2" title: "Indexing your files with sist2"
date: 2019-11-04T19:31:45-05:00 date: 2019-11-04T19:31:45-05:00
draft: false draft: false
tags: ["data curation", "misc"] tags: ["data curation"]
author: simon987 author: simon987
--- ---

View File

@ -7,9 +7,9 @@ author: simon987
--- ---
I built a tool to simplify long-running scraping tasks processes. **task_tracker** is a simple job queue I built a tool to simplify long-running scraping tasks processes. **task_tracker** is a simple job queue
with a web frontend. This is a simple demo of a common use-case. with a web frontend. This is a quick demo of a common use-case.
Let's start with a simple script I use to aggregate data from Spotify's API: Let's start with a short script I use to aggregate data from Spotify's API:
{{<highlight python >}} {{<highlight python >}}
import spotipy import spotipy
@ -28,11 +28,13 @@ def search_artist(name, mbid):
res = spotify.search(name, type="artist", limit=20) res = spotify.search(name, type="artist", limit=20)
sys.stdout = sys.__stdout__ sys.stdout = sys.__stdout__
with sqlite3.connect(dbfile) as conn: with psycopg2.connect(CONNSTR) as conn:
conn.execute("INSERT INTO artist (mbid, query, data) VALUES (?,?,?)", (mbid, name, json.dumps(res))) conn.execute("INSERT INTO artist (mbid, query, data) VALUES (?,?,?)", (mbid, name, json.dumps(res)))
conn.commit() conn.commit()
{{</highlight>}} {{</highlight>}}
The `CONNSTR` variable is given
I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and
keeping the script up to date on an arbitrary server so let's integrate it in the tracker. keeping the script up to date on an arbitrary server so let's integrate it in the tracker.
@ -95,6 +97,7 @@ try:
CLIENT_ID = secret["CLIENT_ID"] CLIENT_ID = secret["CLIENT_ID"]
CLIENT_SECRET = secret["CLIENT_SECRET"] CLIENT_SECRET = secret["CLIENT_SECRET"]
DB = secret["DB"]
client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET) client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager) spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

View File

@ -2,6 +2,7 @@
title: "Android phone backups with zpaq" title: "Android phone backups with zpaq"
date: 2019-11-05T13:16:27-05:00 date: 2019-11-05T13:16:27-05:00
draft: true draft: true
tags: ["backup"]
author: simon987 author: simon987
--- ---

BIN
static/lg/cachehit.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 174 KiB

BIN
static/lg/curl1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

BIN
static/lg/dbdump.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

6
static/mg/panzoom.min.js vendored Normal file

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

After

Width:  |  Height:  |  Size: 54 KiB