mirror of
https://github.com/simon987/dataarchivist.net.git
synced 2025-04-04 08:42:58 +00:00
lg1
This commit is contained in:
parent
c67f653d3d
commit
b49d75d5a9
99
content/posts/lg1.md
Normal file
99
content/posts/lg1.md
Normal file
@ -0,0 +1,99 @@
|
||||
---
|
||||
title: "Running a LibGen cache server"
|
||||
date: 2020-02-20T15:39:04-05:00
|
||||
author: simon987
|
||||
tags: [torrents]
|
||||
---
|
||||
|
||||
# nginx proxy cache
|
||||
Not too long ago, there was [an initiative](https://www.vice.com/en_us/article/pa7jxb/archivists-are-trying-to-make-sure-a-pirate-bay-of-science-never-goes-down) to secure the books and scientific papers of the *Library Genesis* project.
|
||||
It attracted a lot of new seeders and project contributors, however, I noticed that the daily database dumps were becoming
|
||||
slower and slower to download because of the increased traffic.
|
||||
|
||||
|
||||
{{< figure src="/lg/curl1.png" title="43kBit/s from origin server">}}
|
||||
|
||||
I decided to try to contribute some of my bandwidth to the project by creating a mirror of the database dump files.
|
||||
The initial idea was to write a bash script that would periodically download all new dumps, clean up the old ones and
|
||||
somehow handle connection problems and duplicate files. I quickly realized that this solution could become a hassle to
|
||||
maintain so I opted for a simpler alternative.
|
||||
|
||||
|
||||
## Basic nginx setup
|
||||
|
||||
The following configuration is all that is needed to get a cache server up and running.
|
||||
|
||||
{{<highlight nginx >}}
|
||||
proxy_cache_path /files/.cache/
|
||||
levels=1:2
|
||||
keys_zone=libgen_cache:1m
|
||||
max_size=90g inactive=72h use_temp_path=off;
|
||||
|
||||
|
||||
location / {
|
||||
proxy_cache libgen_cache;
|
||||
|
||||
proxy_ignore_headers X-Accel-Expires Expires Cache-Control;
|
||||
proxy_cache_valid any 168h;
|
||||
proxy_cache_revalidate on;
|
||||
|
||||
add_header X-Cache-Status $upstream_cache_status;
|
||||
proxy_pass http://gen.lib.rus.ec/dbdumps/;
|
||||
}
|
||||
{{</highlight>}}
|
||||
|
||||
The `proxy_cache_path` statement initializes a 90 GB cache folder. Entries that are not
|
||||
accessed for more than 72h are periodically purged.
|
||||
See [ngx\_http\_proxy\_module](https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path)
|
||||
for more details about all its options.
|
||||
|
||||
In the `location` block, we tell nginx to ignore the client's headers and to consider all cached
|
||||
items valid for (an arbitrary value of) one week. After 168 hours, a file is considered *stale*,
|
||||
but it will still be served from cache if it wasn't modified on the origin server (using the `If-Modified-Since` header).
|
||||
|
||||
The `$upstream_cache_status` variable tells the client if they're downloading from
|
||||
the origin server or from the cache.
|
||||
|
||||
{{< figure src="/lg/cachehit.png" title="X-Cache-Status header">}}
|
||||
|
||||
## Download speed improvements
|
||||
|
||||
The initial download for `libgen.rar` took 3h12m (~310 kBit/s). When I re-downloaded the file immediately after,
|
||||
I was able to saturate my home connection and finish the download in 3 minutes, about 60 times faster!
|
||||
|
||||
You can find the cache server at [lgmirror.simon987.net](https://lgmirror.simon987.net/).
|
||||
|
||||
{{<highlight "" >}}
|
||||
Connecting to lgmirror.simon987.net (lgmirror.simon987.net)|104.31.86.142|:443... connected.
|
||||
HTTP request sent, awaiting response... 200 OK
|
||||
Length: 3431679578 (3.2G) [application/x-rar-compressed]
|
||||
Saving to: ‘libgen_2020-02-17.rar’
|
||||
2020-02-20 18:16:22 (13.9 MB/s) - ‘libgen_2020-02-17.rar’ saved [3431679578/3431679578]
|
||||
{{</highlight>}}
|
||||
|
||||
|
||||
### Limitations and workarounds
|
||||
|
||||
I noticed that the file listing has shortcuts pointing to the latest database dump.
|
||||
Unfortunately, due to the way nginx's `proxy_cache` module works, both files would need
|
||||
to be pulled from the origin server, even if they are identical.
|
||||
|
||||
{{< figure src="/lg/dbdump.png" title="libgen.rar is symlinked to today's dump">}}
|
||||
|
||||
Since I'm not aware of a way to create a HTTP redirect based on the current date,
|
||||
a workaround for now is to force users to use the `*_yyyy-mm-dd.rar` files.
|
||||
|
||||
{{<highlight nginx >}}
|
||||
location /libgen.rar {
|
||||
add_header Content-Type "text/plain;charset=UTF-8";
|
||||
return 200 'Please download libgen-mm-dd.rar instead.\nПожалуйста, скачайте libgen_yyyy-mm-dd.rar.\n';
|
||||
}
|
||||
{{</highlight>}}
|
||||
|
||||
## Full configuration
|
||||
|
||||
<noscript>
|
||||
<a href="https://gist.github.com/simon987/40f8d81878c45e43a6b91db327d8f4c0#file-libgen_mirror-conf">libgen_mirror.conf</a>
|
||||
</noscript>
|
||||
|
||||
<script src="https://gist.github.com/simon987/40f8d81878c45e43a6b91db327d8f4c0.js"></script>
|
20
content/posts/mg.md
Normal file
20
content/posts/mg.md
Normal file
@ -0,0 +1,20 @@
|
||||
---
|
||||
title: "Mg"
|
||||
date: 2019-12-14T20:14:25-05:00
|
||||
draft: true
|
||||
---
|
||||
|
||||
<div style="height: 400px">
|
||||
<img src="file:///home/drone/test.svg" style="width: 100%;height: 100%">
|
||||
</div>
|
||||
|
||||
|
||||
<script src="/mg/panzoom.min.js"></script>
|
||||
<script>
|
||||
const graph = document.getElementById("graph");
|
||||
const panzoom = Panzoom(graph, {startScale: 1});
|
||||
graph.addEventListener('wheel', (e) => {
|
||||
console.log(e);
|
||||
console.log(panzoom.zoomWithWheel(e));
|
||||
});
|
||||
</script>
|
@ -2,7 +2,7 @@
|
||||
title: "Indexing your files with sist2"
|
||||
date: 2019-11-04T19:31:45-05:00
|
||||
draft: false
|
||||
tags: ["data curation", "misc"]
|
||||
tags: ["data curation"]
|
||||
author: simon987
|
||||
---
|
||||
|
||||
|
@ -7,9 +7,9 @@ author: simon987
|
||||
---
|
||||
|
||||
I built a tool to simplify long-running scraping tasks processes. **task_tracker** is a simple job queue
|
||||
with a web frontend. This is a simple demo of a common use-case.
|
||||
with a web frontend. This is a quick demo of a common use-case.
|
||||
|
||||
Let's start with a simple script I use to aggregate data from Spotify's API:
|
||||
Let's start with a short script I use to aggregate data from Spotify's API:
|
||||
|
||||
{{<highlight python >}}
|
||||
import spotipy
|
||||
@ -28,11 +28,13 @@ def search_artist(name, mbid):
|
||||
res = spotify.search(name, type="artist", limit=20)
|
||||
sys.stdout = sys.__stdout__
|
||||
|
||||
with sqlite3.connect(dbfile) as conn:
|
||||
with psycopg2.connect(CONNSTR) as conn:
|
||||
conn.execute("INSERT INTO artist (mbid, query, data) VALUES (?,?,?)", (mbid, name, json.dumps(res)))
|
||||
conn.commit()
|
||||
{{</highlight>}}
|
||||
|
||||
The `CONNSTR` variable is given
|
||||
|
||||
I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and
|
||||
keeping the script up to date on an arbitrary server so let's integrate it in the tracker.
|
||||
|
||||
@ -95,6 +97,7 @@ try:
|
||||
|
||||
CLIENT_ID = secret["CLIENT_ID"]
|
||||
CLIENT_SECRET = secret["CLIENT_SECRET"]
|
||||
DB = secret["DB"]
|
||||
client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
|
||||
spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
|
||||
|
||||
|
@ -2,6 +2,7 @@
|
||||
title: "Android phone backups with zpaq"
|
||||
date: 2019-11-05T13:16:27-05:00
|
||||
draft: true
|
||||
tags: ["backup"]
|
||||
author: simon987
|
||||
---
|
||||
|
||||
|
BIN
static/lg/cachehit.png
Normal file
BIN
static/lg/cachehit.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 174 KiB |
BIN
static/lg/curl1.png
Normal file
BIN
static/lg/curl1.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 55 KiB |
BIN
static/lg/dbdump.png
Normal file
BIN
static/lg/dbdump.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 101 KiB |
6
static/mg/panzoom.min.js
vendored
Normal file
6
static/mg/panzoom.min.js
vendored
Normal file
File diff suppressed because one or more lines are too long
Binary file not shown.
Before Width: | Height: | Size: 19 KiB After Width: | Height: | Size: 54 KiB |
Loading…
x
Reference in New Issue
Block a user