diff --git a/content/posts/lg1.md b/content/posts/lg1.md new file mode 100644 index 0000000..8f5d18d --- /dev/null +++ b/content/posts/lg1.md @@ -0,0 +1,99 @@ +--- +title: "Running a LibGen cache server" +date: 2020-02-20T15:39:04-05:00 +author: simon987 +tags: [torrents] +--- + +# nginx proxy cache +Not too long ago, there was [an initiative](https://www.vice.com/en_us/article/pa7jxb/archivists-are-trying-to-make-sure-a-pirate-bay-of-science-never-goes-down) to secure the books and scientific papers of the *Library Genesis* project. +It attracted a lot of new seeders and project contributors, however, I noticed that the daily database dumps were becoming +slower and slower to download because of the increased traffic. + + +{{< figure src="/lg/curl1.png" title="43kBit/s from origin server">}} + +I decided to try to contribute some of my bandwidth to the project by creating a mirror of the database dump files. +The initial idea was to write a bash script that would periodically download all new dumps, clean up the old ones and +somehow handle connection problems and duplicate files. I quickly realized that this solution could become a hassle to +maintain so I opted for a simpler alternative. + + +## Basic nginx setup + +The following configuration is all that is needed to get a cache server up and running. + +{{}} +proxy_cache_path /files/.cache/ + levels=1:2 + keys_zone=libgen_cache:1m + max_size=90g inactive=72h use_temp_path=off; + + +location / { + proxy_cache libgen_cache; + + proxy_ignore_headers X-Accel-Expires Expires Cache-Control; + proxy_cache_valid any 168h; + proxy_cache_revalidate on; + + add_header X-Cache-Status $upstream_cache_status; + proxy_pass http://gen.lib.rus.ec/dbdumps/; +} +{{}} + +The `proxy_cache_path` statement initializes a 90 GB cache folder. Entries that are not +accessed for more than 72h are periodically purged. +See [ngx\_http\_proxy\_module](https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path) +for more details about all its options. + +In the `location` block, we tell nginx to ignore the client's headers and to consider all cached +items valid for (an arbitrary value of) one week. After 168 hours, a file is considered *stale*, +but it will still be served from cache if it wasn't modified on the origin server (using the `If-Modified-Since` header). + +The `$upstream_cache_status` variable tells the client if they're downloading from +the origin server or from the cache. + +{{< figure src="/lg/cachehit.png" title="X-Cache-Status header">}} + +## Download speed improvements + +The initial download for `libgen.rar` took 3h12m (~310 kBit/s). When I re-downloaded the file immediately after, +I was able to saturate my home connection and finish the download in 3 minutes, about 60 times faster! + +You can find the cache server at [lgmirror.simon987.net](https://lgmirror.simon987.net/). + +{{}} +Connecting to lgmirror.simon987.net (lgmirror.simon987.net)|104.31.86.142|:443... connected. +HTTP request sent, awaiting response... 200 OK +Length: 3431679578 (3.2G) [application/x-rar-compressed] +Saving to: ‘libgen_2020-02-17.rar’ +2020-02-20 18:16:22 (13.9 MB/s) - ‘libgen_2020-02-17.rar’ saved [3431679578/3431679578] +{{}} + + +### Limitations and workarounds + +I noticed that the file listing has shortcuts pointing to the latest database dump. + Unfortunately, due to the way nginx's `proxy_cache` module works, both files would need +to be pulled from the origin server, even if they are identical. + +{{< figure src="/lg/dbdump.png" title="libgen.rar is symlinked to today's dump">}} + +Since I'm not aware of a way to create a HTTP redirect based on the current date, +a workaround for now is to force users to use the `*_yyyy-mm-dd.rar` files. + +{{}} +location /libgen.rar { + add_header Content-Type "text/plain;charset=UTF-8"; + return 200 'Please download libgen-mm-dd.rar instead.\nПожалуйста, скачайте libgen_yyyy-mm-dd.rar.\n'; +} +{{}} + +## Full configuration + + + + diff --git a/content/posts/mg.md b/content/posts/mg.md new file mode 100644 index 0000000..d66413f --- /dev/null +++ b/content/posts/mg.md @@ -0,0 +1,20 @@ +--- +title: "Mg" +date: 2019-12-14T20:14:25-05:00 +draft: true +--- + +
+ +
+ + + + diff --git a/content/posts/sist2.md b/content/posts/sist2.md index 44fc564..a809e32 100644 --- a/content/posts/sist2.md +++ b/content/posts/sist2.md @@ -2,7 +2,7 @@ title: "Indexing your files with sist2" date: 2019-11-04T19:31:45-05:00 draft: false -tags: ["data curation", "misc"] +tags: ["data curation"] author: simon987 --- diff --git a/content/posts/tt_1.md b/content/posts/tt_1.md index 9ae88d4..0d64f37 100644 --- a/content/posts/tt_1.md +++ b/content/posts/tt_1.md @@ -7,9 +7,9 @@ author: simon987 --- I built a tool to simplify long-running scraping tasks processes. **task_tracker** is a simple job queue -with a web frontend. This is a simple demo of a common use-case. +with a web frontend. This is a quick demo of a common use-case. -Let's start with a simple script I use to aggregate data from Spotify's API: +Let's start with a short script I use to aggregate data from Spotify's API: {{}} import spotipy @@ -28,11 +28,13 @@ def search_artist(name, mbid): res = spotify.search(name, type="artist", limit=20) sys.stdout = sys.__stdout__ - with sqlite3.connect(dbfile) as conn: + with psycopg2.connect(CONNSTR) as conn: conn.execute("INSERT INTO artist (mbid, query, data) VALUES (?,?,?)", (mbid, name, json.dumps(res))) conn.commit() {{}} +The `CONNSTR` variable is given + I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and keeping the script up to date on an arbitrary server so let's integrate it in the tracker. @@ -95,6 +97,7 @@ try: CLIENT_ID = secret["CLIENT_ID"] CLIENT_SECRET = secret["CLIENT_SECRET"] + DB = secret["DB"] client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET) spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager) diff --git a/content/posts/zpaq.md b/content/posts/zpaq.md index 9278583..04dbc4e 100644 --- a/content/posts/zpaq.md +++ b/content/posts/zpaq.md @@ -2,6 +2,7 @@ title: "Android phone backups with zpaq" date: 2019-11-05T13:16:27-05:00 draft: true +tags: ["backup"] author: simon987 --- diff --git a/static/lg/cachehit.png b/static/lg/cachehit.png new file mode 100644 index 0000000..c19c16c Binary files /dev/null and b/static/lg/cachehit.png differ diff --git a/static/lg/curl1.png b/static/lg/curl1.png new file mode 100644 index 0000000..8a930dd Binary files /dev/null and b/static/lg/curl1.png differ diff --git a/static/lg/dbdump.png b/static/lg/dbdump.png new file mode 100644 index 0000000..c945a09 Binary files /dev/null and b/static/lg/dbdump.png differ diff --git a/static/mg/panzoom.min.js b/static/mg/panzoom.min.js new file mode 100644 index 0000000..875d5b5 --- /dev/null +++ b/static/mg/panzoom.min.js @@ -0,0 +1,6 @@ +/** + * Panzoom for panning and zooming elements using CSS transforms + * Copyright Timmy Willison and other contributors + * https://github.com/timmywil/panzoom/blob/master/MIT-License.txt + */ +!function(e,t){"object"==typeof exports&&"undefined"!=typeof module?module.exports=t():"function"==typeof define&&define.amd?define(t):(e=e||self).Panzoom=t()}(this,function(){"use strict";var r,T=function(){return(T=Object.assign||function(e){for(var t,n=1,o=arguments.length;n