add kibana & update README.md

This commit is contained in:
simon 2020-01-22 16:04:53 -05:00
parent 7f121d2ac0
commit c61f51cb08
2 changed files with 18 additions and 80 deletions

View File

@ -1,7 +1,5 @@
# OD-Database
[![Build Status](https://ci.simon987.net/buildStatus/icon?job=od-database_qa)](https://ci.simon987.net/job/od-database_qa/)
OD-Database is a web-crawling project that aims to index a very large number of file links and their basic metadata from open directories (misconfigured Apache/Nginx/FTP servers, or more often, mirrors of various public services).
Each crawler instance fetches tasks from the central server and pushes the result once completed. A single instance can crawl hundreds of websites at the same time (Both FTP and HTTP(S)) and the central server is capable of ingesting thousands of new documents per second.
@ -14,82 +12,22 @@ The data is indexed into elasticsearch and made available via the web frontend (
### Contributing
Suggestions/concerns/PRs are welcome
## Installation
Assuming you have Python 3 and git installed:
## Installation (Docker)
```bash
sudo apt install libssl-dev libcurl4-openssl-dev
git clone https://github.com/simon987/od-database
cd od-database
git submodule update --init --recursive
sudo pip3 install -r requirements.txt
docker-compose up
```
Create `/config.py` and fill out the parameters. Sample config:
```python
# Leave default values for no CAPTCHAs
CAPTCHA_LOGIN = False
CAPTCHA_SUBMIT = False
CAPTCHA_SEARCH = False
CAPTCHA_EVERY = 10
# Flask secret key for sessions
FLASK_SECRET = ""
RESULTS_PER_PAGE = (25, 50, 100, 250, 500, 1000)
# Allow ftp websites in /submit
SUBMIT_FTP = False
# Allow http(s) websites in /submit
SUBMIT_HTTP = True
## Architecture
# Number of re-crawl tasks to keep in the queue
RECRAWL_POOL_SIZE = 10000
# task_tracker API url
TT_API = "http://localhost:3010"
# task_tracker crawl project id
TT_CRAWL_PROJECT = 3
# task_tracker indexing project id
TT_INDEX_PROJECT = 9
# Number of threads to use for ES indexing
INDEXER_THREADS = 4
# ws_bucket API url
WSB_API = "http://localhost:3020"
# ws_bucket secret
WSB_SECRET = "default_secret"
# ws_bucket data directory
WSB_PATH = "/mnt/data/github.com/simon987/ws_bucket/data"
# od-database PostgreSQL connection string
DB_CONN_STR = "dbname=od-database user=od-database password=xxx"
```
![diag](high_level_diagram.png)
## Running the crawl server
The python crawler that was a part of this project is discontinued,
[the go implementation](https://github.com/terorie/od-database-crawler) is currently in use.
## Running the web server (debug)
```bash
cd od-database
python3 app.py
```
## Running the web server with Nginx (production)
* Install dependencies:
```bash
sudo apt install build-essential python-dev redis-server uwsgi-plugin-python3
```
* Configure nginx (on Debian 9: `/etc/nginx/sites-enabled/default`):
```nginx
server {
...
include uwsgi_params;
location / {
uwsgi_pass 127.0.0.1:3031;
}
...
}
```
* Configure Elasticsearch
### Configure Elasticsearch
```
PUT _template/default
{
@ -102,9 +40,3 @@ PUT _template/default
"routing_partition_size" : 5
}
}
```
* Start uwsgi:
```bash
uwsgi od-database.ini
```

View File

@ -39,8 +39,6 @@ services:
environment:
- "POSTGRES_USER=od_database"
- "POSTGRES_PASSWORD=changeme"
ports:
- 5021:5432
healthcheck:
test: ["CMD-SHELL", "pg_isready -U od_database"]
interval: 5s
@ -77,19 +75,27 @@ services:
- 3010:80
depends_on:
tt_db:
condition: service_healthy
condition: service_healthy
es:
image: docker.elastic.co/elasticsearch/elasticsearch:7.4.2
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms1G -Xmx10G"
volumes:
- /usr/share/elasticsearch/data
- ./es_data:/usr/share/elasticsearch/data
healthcheck:
test: ["CMD-SHELL", "curl --silent --fail localhost:9200/_cluster/health || exit 1"]
interval: 5s
timeout: 5s
retries: 5
# (Optional)
kibana:
image: docker.elastic.co/kibana/kibana:7.4.2
environment:
- ELASTICSEARCH_HOSTS=http://es:9200
- xpack.monitoring.collection.enabled=true
ports:
- 5021:5601
depends_on:
es:
condition: service_healthy