mirror of
				https://github.com/simon987/dataarchivist.net.git
				synced 2025-10-25 10:56:51 +00:00 
			
		
		
		
	lg1
This commit is contained in:
		
							parent
							
								
									c67f653d3d
								
							
						
					
					
						commit
						b49d75d5a9
					
				
							
								
								
									
										99
									
								
								content/posts/lg1.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										99
									
								
								content/posts/lg1.md
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,99 @@ | |||||||
|  | --- | ||||||
|  | title: "Running a LibGen cache server" | ||||||
|  | date: 2020-02-20T15:39:04-05:00 | ||||||
|  | author: simon987 | ||||||
|  | tags: [torrents] | ||||||
|  | --- | ||||||
|  | 
 | ||||||
|  | # nginx proxy cache | ||||||
|  | Not too long ago, there was [an initiative](https://www.vice.com/en_us/article/pa7jxb/archivists-are-trying-to-make-sure-a-pirate-bay-of-science-never-goes-down) to secure the books and scientific papers of the *Library Genesis* project.  | ||||||
|  | It attracted a lot of new seeders and project contributors, however, I noticed that the daily database dumps were becoming | ||||||
|  | slower and slower to download because of the increased traffic. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | {{< figure src="/lg/curl1.png" title="43kBit/s from origin server">}} | ||||||
|  | 
 | ||||||
|  | I decided to try to contribute some of my bandwidth to the project by creating a mirror of the database dump files. | ||||||
|  | The initial idea was to write a bash script that would periodically download all new dumps, clean up the old ones and  | ||||||
|  | somehow handle connection problems and duplicate files. I quickly realized that this solution could become a hassle to | ||||||
|  | maintain so I opted for a simpler alternative. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | ## Basic nginx setup | ||||||
|  | 
 | ||||||
|  | The following configuration is all that is needed to get a cache server up and running. | ||||||
|  | 
 | ||||||
|  | {{<highlight nginx >}} | ||||||
|  | proxy_cache_path /files/.cache/  | ||||||
|  | 	levels=1:2  | ||||||
|  | 	keys_zone=libgen_cache:1m | ||||||
|  | 	max_size=90g inactive=72h use_temp_path=off; | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | location / { | ||||||
|  | 	proxy_cache libgen_cache; | ||||||
|  | 	 | ||||||
|  | 	proxy_ignore_headers X-Accel-Expires Expires Cache-Control; | ||||||
|  | 	proxy_cache_valid any 168h; | ||||||
|  | 	proxy_cache_revalidate on; | ||||||
|  | 	 | ||||||
|  | 	add_header X-Cache-Status $upstream_cache_status; | ||||||
|  | 	proxy_pass http://gen.lib.rus.ec/dbdumps/; | ||||||
|  | } | ||||||
|  | {{</highlight>}} | ||||||
|  | 
 | ||||||
|  | The `proxy_cache_path` statement initializes a 90 GB cache folder. Entries that are not | ||||||
|  | accessed for more than 72h are periodically purged.  | ||||||
|  | See [ngx\_http\_proxy\_module](https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path)  | ||||||
|  | for more details about all its options. | ||||||
|  | 
 | ||||||
|  | In the `location` block, we tell nginx to ignore the client's headers and to consider all cached | ||||||
|  | items valid for (an arbitrary value of) one week. After 168 hours, a file is considered *stale*, | ||||||
|  | but it will still be served from cache if it wasn't modified on the origin server (using the `If-Modified-Since` header). | ||||||
|  | 
 | ||||||
|  | The `$upstream_cache_status` variable tells the client if they're downloading from | ||||||
|  | the origin server or from the cache. | ||||||
|  | 
 | ||||||
|  | {{< figure src="/lg/cachehit.png" title="X-Cache-Status header">}} | ||||||
|  | 
 | ||||||
|  | ## Download speed improvements | ||||||
|  | 
 | ||||||
|  | The initial download for `libgen.rar` took 3h12m (~310 kBit/s). When I re-downloaded the file immediately after, | ||||||
|  | I was able to saturate my home connection and finish the download in 3 minutes, about 60 times faster! | ||||||
|  | 
 | ||||||
|  | You can find the cache server at [lgmirror.simon987.net](https://lgmirror.simon987.net/). | ||||||
|  | 
 | ||||||
|  | {{<highlight "" >}} | ||||||
|  | Connecting to lgmirror.simon987.net (lgmirror.simon987.net)|104.31.86.142|:443... connected. | ||||||
|  | HTTP request sent, awaiting response... 200 OK | ||||||
|  | Length: 3431679578 (3.2G) [application/x-rar-compressed] | ||||||
|  | Saving to: ‘libgen_2020-02-17.rar’ | ||||||
|  | 2020-02-20 18:16:22 (13.9 MB/s) - ‘libgen_2020-02-17.rar’ saved [3431679578/3431679578] | ||||||
|  | {{</highlight>}} | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | ### Limitations and workarounds | ||||||
|  | 
 | ||||||
|  | I noticed that the file listing has shortcuts pointing to the latest database dump. | ||||||
|  |  Unfortunately, due to the way nginx's `proxy_cache` module works, both files would need | ||||||
|  | to be pulled from the origin server, even if they are identical. | ||||||
|  | 
 | ||||||
|  | {{< figure src="/lg/dbdump.png" title="libgen.rar is symlinked to today's dump">}} | ||||||
|  | 
 | ||||||
|  | Since I'm not aware of a way to create a HTTP redirect based on the current date, | ||||||
|  | a workaround for now is to force users to use the `*_yyyy-mm-dd.rar` files. | ||||||
|  | 
 | ||||||
|  | {{<highlight nginx >}} | ||||||
|  | location /libgen.rar { | ||||||
|  | 	add_header Content-Type "text/plain;charset=UTF-8"; | ||||||
|  | 	return 200 'Please download libgen-mm-dd.rar instead.\nПожалуйста, скачайте libgen_yyyy-mm-dd.rar.\n'; | ||||||
|  | } | ||||||
|  | {{</highlight>}} | ||||||
|  | 
 | ||||||
|  | ## Full configuration | ||||||
|  | 
 | ||||||
|  | <noscript> | ||||||
|  | 	<a href="https://gist.github.com/simon987/40f8d81878c45e43a6b91db327d8f4c0#file-libgen_mirror-conf">libgen_mirror.conf</a> | ||||||
|  | </noscript> | ||||||
|  | 
 | ||||||
|  | <script src="https://gist.github.com/simon987/40f8d81878c45e43a6b91db327d8f4c0.js"></script> | ||||||
							
								
								
									
										20
									
								
								content/posts/mg.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										20
									
								
								content/posts/mg.md
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,20 @@ | |||||||
|  | --- | ||||||
|  | title: "Mg" | ||||||
|  | date: 2019-12-14T20:14:25-05:00 | ||||||
|  | draft: true | ||||||
|  | --- | ||||||
|  | 
 | ||||||
|  | <div style="height: 400px"> | ||||||
|  | <img src="file:///home/drone/test.svg" style="width: 100%;height: 100%"> | ||||||
|  | </div> | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | <script src="/mg/panzoom.min.js"></script> | ||||||
|  | <script> | ||||||
|  | const graph = document.getElementById("graph");  | ||||||
|  | const panzoom = Panzoom(graph, {startScale: 1}); | ||||||
|  | graph.addEventListener('wheel', (e) => { | ||||||
|  |     console.log(e); | ||||||
|  |     console.log(panzoom.zoomWithWheel(e)); | ||||||
|  | }); | ||||||
|  | </script> | ||||||
| @ -2,7 +2,7 @@ | |||||||
| title: "Indexing your files with sist2" | title: "Indexing your files with sist2" | ||||||
| date: 2019-11-04T19:31:45-05:00 | date: 2019-11-04T19:31:45-05:00 | ||||||
| draft: false | draft: false | ||||||
| tags: ["data curation", "misc"] | tags: ["data curation"] | ||||||
| author: simon987 | author: simon987 | ||||||
| --- | --- | ||||||
| 
 | 
 | ||||||
|  | |||||||
| @ -7,9 +7,9 @@ author: simon987 | |||||||
| --- | --- | ||||||
| 
 | 
 | ||||||
| I built a tool to simplify long-running scraping tasks processes. **task_tracker** is a simple job queue | I built a tool to simplify long-running scraping tasks processes. **task_tracker** is a simple job queue | ||||||
| with a web frontend. This is a simple demo of a common use-case. | with a web frontend. This is a quick demo of a common use-case. | ||||||
| 
 | 
 | ||||||
| Let's start with a simple script I use to aggregate data from Spotify's API: | Let's start with a short script I use to aggregate data from Spotify's API: | ||||||
| 
 | 
 | ||||||
| {{<highlight python >}} | {{<highlight python >}} | ||||||
| import spotipy | import spotipy | ||||||
| @ -28,11 +28,13 @@ def search_artist(name, mbid): | |||||||
|         res = spotify.search(name, type="artist", limit=20) |         res = spotify.search(name, type="artist", limit=20) | ||||||
| 	sys.stdout = sys.__stdout__ | 	sys.stdout = sys.__stdout__ | ||||||
| 
 | 
 | ||||||
|     with sqlite3.connect(dbfile) as conn: |     with psycopg2.connect(CONNSTR) as conn: | ||||||
|         conn.execute("INSERT INTO artist (mbid, query, data) VALUES (?,?,?)", (mbid, name, json.dumps(res))) |         conn.execute("INSERT INTO artist (mbid, query, data) VALUES (?,?,?)", (mbid, name, json.dumps(res))) | ||||||
|         conn.commit() |         conn.commit() | ||||||
| {{</highlight>}} | {{</highlight>}} | ||||||
| 
 | 
 | ||||||
|  | The `CONNSTR` variable is given  | ||||||
|  | 
 | ||||||
| I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and | I need to call `search_artist()` about 350'000 times and I don't want to bother setting up multithreading, error handling and | ||||||
| keeping the script up to date on an arbitrary server so let's integrate it in the tracker. | keeping the script up to date on an arbitrary server so let's integrate it in the tracker. | ||||||
| 
 | 
 | ||||||
| @ -95,6 +97,7 @@ try: | |||||||
| 
 | 
 | ||||||
|     CLIENT_ID = secret["CLIENT_ID"] |     CLIENT_ID = secret["CLIENT_ID"] | ||||||
|     CLIENT_SECRET = secret["CLIENT_SECRET"] |     CLIENT_SECRET = secret["CLIENT_SECRET"] | ||||||
|  |     DB = secret["DB"] | ||||||
|     client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET) |     client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET) | ||||||
|     spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager) |     spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager) | ||||||
| 
 | 
 | ||||||
|  | |||||||
| @ -2,6 +2,7 @@ | |||||||
| title: "Android phone backups with zpaq" | title: "Android phone backups with zpaq" | ||||||
| date: 2019-11-05T13:16:27-05:00 | date: 2019-11-05T13:16:27-05:00 | ||||||
| draft: true | draft: true | ||||||
|  | tags: ["backup"] | ||||||
| author: simon987 | author: simon987 | ||||||
| --- | --- | ||||||
| 
 | 
 | ||||||
|  | |||||||
							
								
								
									
										
											BIN
										
									
								
								static/lg/cachehit.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/lg/cachehit.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| After Width: | Height: | Size: 174 KiB | 
							
								
								
									
										
											BIN
										
									
								
								static/lg/curl1.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/lg/curl1.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| After Width: | Height: | Size: 55 KiB | 
							
								
								
									
										
											BIN
										
									
								
								static/lg/dbdump.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								static/lg/dbdump.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| After Width: | Height: | Size: 101 KiB | 
							
								
								
									
										6
									
								
								static/mg/panzoom.min.js
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
							
						
						
									
										6
									
								
								static/mg/panzoom.min.js
									
									
									
									
										vendored
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because one or more lines are too long
											
										
									
								
							
										
											Binary file not shown.
										
									
								
							| Before Width: | Height: | Size: 19 KiB After Width: | Height: | Size: 54 KiB | 
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user