mirror of
				https://github.com/terorie/od-database-crawler.git
				synced 2025-10-25 03:16:52 +00:00 
			
		
		
		
	Bumps [github.com/valyala/fasthttp](https://github.com/valyala/fasthttp) from 1.2.0 to 1.5.0. - [Release notes](https://github.com/valyala/fasthttp/releases) - [Commits](https://github.com/valyala/fasthttp/compare/v1.2.0...v1.5.0) Signed-off-by: dependabot-preview[bot] <support@dependabot.com>
OD-Database Crawler 🕷
- Crawler for OD-Database
- In production at https://od-db.the-eye.eu/
- Over 880 TB actively crawled
- Crawls HTTP open directories (standard Web Server Listings)
- Gets name, path, size and modification time of all files
- Lightweight and fast
Usage
Deploys
- 
With Config File (if config.ymlfound in working dir)- Download default config
- Set server.urlandserver.token
- Start with ./od-database-crawler server --config <file>
 
- 
With Flags or env - Override config file if it exists
- --helpfor list of flags
- Every flag is available as an environment variable:
--server.crawl_stats➡️OD_SERVER_CRAWL_STATS
- Start with ./od-database-crawler server <flags>
 
- 
With Docker docker run \ -e OD_SERVER_URL=xxx \ -e OD_SERVER_TOKEN=xxx \ terorie/od-database-crawler
Flag reference
Here are the most important config flags. For more fine control, take a look at /config.yml.
| Flag/Environment | Description | Example | 
|---|---|---|
| server.urlOD_SERVER_URL | OD-DB Server URL | https://od-db.mine.the-eye.eu/api | 
| server.tokenOD_SERVER_TOKEN | OD-DB Server Access Token | Ask Hexa TM | 
| server.recheckOD_SERVER_RECHECK | Job Fetching Interval | 3s | 
| output.crawl_statsOD_OUTPUT_CRAWL_STATS | Crawl Stats Logging Interval (0 = disabled) | 500ms | 
| output.resource_statsOD_OUTPUT_RESORUCE_STATS | Resource Stats Logging Interval (0 = disabled) | 8s | 
| output.logOD_OUTPUT_LOG | Log File (none = disabled) | crawler.log | 
| crawl.tasksOD_CRAWL_TASKS | Max number of sites to crawl concurrently | 500 | 
| crawl.connectionsOD_CRAWL_CONNECTIONS | HTTP connections per site | 1 | 
| crawl.retriesOD_CRAWL_RETRIES | How often to retry after a temporary failure (e.g. HTTP 429or timeouts) | 5 | 
| crawl.dial_timeoutOD_CRAWL_DIAL_TIMEOUT | TCP Connect timeout | 5s | 
| crawl.timeoutOD_CRAWL_TIMEOUT | HTTP request timeout | 20s | 
| crawl.user-agentOD_CRAWL_USER_AGENT | HTTP Crawler User-Agent | googlebot/1.2.3 | 
| crawl.job_bufferOD_CRAWL_JOB_BUFFER | Number of URLs to keep in memory/cache, per job. The rest is offloaded to disk. Decrease this value if the crawler uses too much RAM. (0 = Disable Cache, -1 = Only use Cache) | 5000 | 
					Languages
				
				
								
								
									Go
								
								99.9%