mirror of
https://github.com/terorie/od-database-crawler.git
synced 2025-04-19 10:26:43 +00:00
Flag explanation in README.md
This commit is contained in:
parent
9e9b606250
commit
88856c1c19
34
README.md
34
README.md
@ -9,7 +9,9 @@
|
|||||||
|
|
||||||
https://od-db.the-eye.eu/
|
https://od-db.the-eye.eu/
|
||||||
|
|
||||||
#### Usage
|
## Usage
|
||||||
|
|
||||||
|
### Deploys
|
||||||
|
|
||||||
1. With Config File (if `config.yml` found in working dir)
|
1. With Config File (if `config.yml` found in working dir)
|
||||||
- Download [default config](https://github.com/terorie/od-database-crawler/blob/master/config.yml)
|
- Download [default config](https://github.com/terorie/od-database-crawler/blob/master/config.yml)
|
||||||
@ -22,3 +24,33 @@ https://od-db.the-eye.eu/
|
|||||||
- Every flag is available as an environment variable:
|
- Every flag is available as an environment variable:
|
||||||
`--server.crawl_stats` ➡️ `OD_SERVER_CRAWL_STATS`
|
`--server.crawl_stats` ➡️ `OD_SERVER_CRAWL_STATS`
|
||||||
- Start with `./od-database-crawler server <flags>`
|
- Start with `./od-database-crawler server <flags>`
|
||||||
|
|
||||||
|
3. With Docker
|
||||||
|
```dockerfile
|
||||||
|
docker run \
|
||||||
|
-e OD_SERVER_URL=xxx \
|
||||||
|
-e OD_SERVER_TOKEN=xxx \
|
||||||
|
terorie/od-database-crawler
|
||||||
|
```
|
||||||
|
|
||||||
|
### Flag reference
|
||||||
|
|
||||||
|
Here are the most important config flags. For more fine control, take a look at `/config.yml`.
|
||||||
|
|
||||||
|
| Flag/Config | Environment/Docker | Description | Example |
|
||||||
|
| ----------------------- | -------------------------- | ------------------------------------------------------------ | ----------------------------------- |
|
||||||
|
| `server.url` | `OD_SERVER_URL` | OD-DB Server URL | `https://od-db.mine.the-eye.eu/api` |
|
||||||
|
| `server.token` | `OD_SERVER_TOKEN` | OD-DB Server Access Token | _Ask Hexa **TM**_ |
|
||||||
|
| `server.recheck` | `OD_SERVER_RECHECK` | Job Fetching Interval | `3s` |
|
||||||
|
| `output.crawl_stats` | `OD_OUTPUT_CRAWL_STATS` | Crawl Stats Logging Interval (0 = disabled) | `500ms` |
|
||||||
|
| `output.resource_stats` | `OD_OUTPUT_RESORUCE_STATS` | Resource Stats Logging Interval (0 = disabled) | `8s` |
|
||||||
|
| `output.log` | `OD_OUTPUT_LOG` | Log File (none = disabled) | `crawler.log` |
|
||||||
|
| `crawl.tasks` | `OD_CRAWL_TASKS` | Max number of sites to crawl concurrently | `500` |
|
||||||
|
| `crawl.connections` | `OD_CRAWL_CONNECTIONS` | HTTP connections per site | `1` |
|
||||||
|
| `crawl.retries` | `OD_CRAWL_RETRIES` | How often to retry after a temporary failure (e.g. `HTTP 429` or timeouts) | `5` |
|
||||||
|
| `crawl.dial_timeout` | `OD_CRAWL_DIAL_TIMEOUT` | TCP Connect timeout | `5s` |
|
||||||
|
| `crawl.timeout` | `OD_CRAWL_TIMEOUT` | HTTP request timeout | `20s` |
|
||||||
|
| `crawl.user-agent` | `OD_CRAWL_USER_AGENT` | HTTP Crawler User-Agent | `googlebot/1.2.3` |
|
||||||
|
| `crawl.job_buffer` | `OD_CRAWL_JOB_BUFFER` | Number of URLs to keep in memory/cache, per job. The rest is offloaded to disk. Decrease this value if the crawler uses too much RAM. (0 = Disable Cache, -1 = Only use Cache) | `5000` |
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user