|
df8ab7727b
|
docker-compose setup (wip)
|
2019-11-13 13:03:43 -05:00 |
|
simon987
|
d69ed65a0c
|
Rewrite export.py, add diagram
|
2019-03-27 22:09:08 -04:00 |
|
simon987
|
b9f25630b4
|
Switch to postgresql, finish minimum viable task_tracker/ws_bucket integration
|
2019-03-27 19:34:05 -04:00 |
|
simon987
|
4ffe805b8d
|
Use task_tracker for task tracking
|
2019-03-24 20:23:05 -04:00 |
|
simon987
|
8ced4859f3
|
hotfix attempt 1
|
2019-02-02 09:17:59 -05:00 |
|
simon987
|
7f857d641f
|
Change ES settings, big refactor, removed recaptcha
|
2019-01-13 12:48:39 -05:00 |
|
Simon
|
85c3aa918d
|
replaced requests by pycurl
|
2018-08-23 11:47:09 -04:00 |
|
Simon
|
49206af566
|
Updated requirements
|
2018-07-25 11:35:41 -04:00 |
|
Simon
|
8e937e69c0
|
Should fix some FTP errors
|
2018-06-24 13:50:55 -04:00 |
|
Simon
|
348914aba9
|
Removing unused module
|
2018-06-22 17:34:10 -04:00 |
|
Simon
|
14d384e366
|
Decentralised crawling should work in theory + temporary fix for going further than the maximum 10k results elasticsearch allows by default
|
2018-06-21 19:44:27 -04:00 |
|
Simon
|
344e7274d7
|
Simplified url joining and splitting, switched from lxml to html.parser, various memory usage optimizations
|
2018-06-17 22:10:46 -04:00 |
|
Simon
|
81fde6cc30
|
Bug fixes with html parsing
|
2018-06-14 20:02:06 -04:00 |
|
Simon
|
f3c7b551d2
|
Some adjustments to make it work on Stretch server
|
2018-06-14 17:09:05 -04:00 |
|
Simon
|
dffd032659
|
Indexing after crawling is a bit more efficient
|
2018-06-14 16:41:43 -04:00 |
|
Simon
|
83ca579ec7
|
Started working on post-crawl callbacks and basic auth for crawl servers
|
2018-06-14 15:05:56 -04:00 |
|
Simon
|
011b8455a7
|
Elasticsearch search engine (search & scroll)
|
2018-06-11 23:06:41 -04:00 |
|
Simon
|
72495275b0
|
Elasticsearch search engine (import from json)
|
2018-06-11 22:35:49 -04:00 |
|
Simon
|
d849227798
|
barebones crawl_server microservice
|
2018-06-11 19:00:43 -04:00 |
|
Simon
|
7f496ce7a8
|
Slowly losing my sanity part 1: Removed scrapy dependency and moved to custom solution. Added multi-threaded ftp crawler
|
2018-06-11 15:46:55 -04:00 |
|
Simon
|
f2d914060b
|
Removed unsuitable scrapy spider and implemented custom crawler
|
2018-06-10 20:08:59 -04:00 |
|
Simon
|
0304c98a31
|
Added basic ftp spider for scrapy
|
2018-06-10 14:12:55 -04:00 |
|
Simon
|
dc0cde61a0
|
Basic admin page
|
2018-06-08 11:40:54 -04:00 |
|
Simon
|
cfa6a9f02f
|
Added requirement
|
2018-06-03 10:49:11 -04:00 |
|
Simon
|
ad645490f6
|
Initial commit
|
2018-05-28 20:35:04 -04:00 |
|