mirror of
https://github.com/terorie/od-database-crawler.git
synced 2025-04-03 14:33:00 +00:00
OD-Database Crawler 🕷
- Crawler for OD-Database
- In production at https://od-db.the-eye.eu/
- Over 880 TB actively crawled
- Crawls HTTP open directories (standard Web Server Listings)
- Gets name, path, size and modification time of all files
- Lightweight and fast
Usage
Deploys
-
With Config File (if
config.yml
found in working dir)- Download default config
- Set
server.url
andserver.token
- Start with
./od-database-crawler server --config <file>
-
With Flags or env
- Override config file if it exists
--help
for list of flags- Every flag is available as an environment variable:
--server.crawl_stats
➡️OD_SERVER_CRAWL_STATS
- Start with
./od-database-crawler server <flags>
-
With Docker
docker run \ -e OD_SERVER_URL=xxx \ -e OD_SERVER_TOKEN=xxx \ terorie/od-database-crawler
Flag reference
Here are the most important config flags. For more fine control, take a look at /config.yml
.
Flag/Environment | Description | Example |
---|---|---|
server.url OD_SERVER_URL |
OD-DB Server URL | https://od-db.mine.the-eye.eu/api |
server.token OD_SERVER_TOKEN |
OD-DB Server Access Token | Ask Hexa TM |
server.recheck OD_SERVER_RECHECK |
Job Fetching Interval | 3s |
output.crawl_stats OD_OUTPUT_CRAWL_STATS |
Crawl Stats Logging Interval (0 = disabled) | 500ms |
output.resource_stats OD_OUTPUT_RESORUCE_STATS |
Resource Stats Logging Interval (0 = disabled) | 8s |
output.log OD_OUTPUT_LOG |
Log File (none = disabled) | crawler.log |
crawl.tasks OD_CRAWL_TASKS |
Max number of sites to crawl concurrently | 500 |
crawl.connections OD_CRAWL_CONNECTIONS |
HTTP connections per site | 1 |
crawl.retries OD_CRAWL_RETRIES |
How often to retry after a temporary failure (e.g. HTTP 429 or timeouts) |
5 |
crawl.dial_timeout OD_CRAWL_DIAL_TIMEOUT |
TCP Connect timeout | 5s |
crawl.timeout OD_CRAWL_TIMEOUT |
HTTP request timeout | 20s |
crawl.user-agent OD_CRAWL_USER_AGENT |
HTTP Crawler User-Agent | googlebot/1.2.3 |
crawl.job_buffer OD_CRAWL_JOB_BUFFER |
Number of URLs to keep in memory/cache, per job. The rest is offloaded to disk. Decrease this value if the crawler uses too much RAM. (0 = Disable Cache, -1 = Only use Cache) | 5000 |
Languages
Go
99.9%