25 Commits

Author SHA1 Message Date
df8ab7727b docker-compose setup (wip) 2019-11-13 13:03:43 -05:00
simon987
d69ed65a0c Rewrite export.py, add diagram 2019-03-27 22:09:08 -04:00
simon987
b9f25630b4 Switch to postgresql, finish minimum viable task_tracker/ws_bucket integration 2019-03-27 19:34:05 -04:00
simon987
4ffe805b8d Use task_tracker for task tracking 2019-03-24 20:23:05 -04:00
simon987
8ced4859f3 hotfix attempt 1 2019-02-02 09:17:59 -05:00
simon987
7f857d641f Change ES settings, big refactor, removed recaptcha 2019-01-13 12:48:39 -05:00
Simon
85c3aa918d replaced requests by pycurl 2018-08-23 11:47:09 -04:00
Simon
49206af566 Updated requirements 2018-07-25 11:35:41 -04:00
Simon
8e937e69c0 Should fix some FTP errors 2018-06-24 13:50:55 -04:00
Simon
348914aba9 Removing unused module 2018-06-22 17:34:10 -04:00
Simon
14d384e366 Decentralised crawling should work in theory + temporary fix for going further than the maximum 10k results elasticsearch allows by default 2018-06-21 19:44:27 -04:00
Simon
344e7274d7 Simplified url joining and splitting, switched from lxml to html.parser, various memory usage optimizations 2018-06-17 22:10:46 -04:00
Simon
81fde6cc30 Bug fixes with html parsing 2018-06-14 20:02:06 -04:00
Simon
f3c7b551d2 Some adjustments to make it work on Stretch server 2018-06-14 17:09:05 -04:00
Simon
dffd032659 Indexing after crawling is a bit more efficient 2018-06-14 16:41:43 -04:00
Simon
83ca579ec7 Started working on post-crawl callbacks and basic auth for crawl servers 2018-06-14 15:05:56 -04:00
Simon
011b8455a7 Elasticsearch search engine (search & scroll) 2018-06-11 23:06:41 -04:00
Simon
72495275b0 Elasticsearch search engine (import from json) 2018-06-11 22:35:49 -04:00
Simon
d849227798 barebones crawl_server microservice 2018-06-11 19:00:43 -04:00
Simon
7f496ce7a8 Slowly losing my sanity part 1: Removed scrapy dependency and moved to custom solution. Added multi-threaded ftp crawler 2018-06-11 15:46:55 -04:00
Simon
f2d914060b Removed unsuitable scrapy spider and implemented custom crawler 2018-06-10 20:08:59 -04:00
Simon
0304c98a31 Added basic ftp spider for scrapy 2018-06-10 14:12:55 -04:00
Simon
dc0cde61a0 Basic admin page 2018-06-08 11:40:54 -04:00
Simon
cfa6a9f02f Added requirement 2018-06-03 10:49:11 -04:00
Simon
ad645490f6 Initial commit 2018-05-28 20:35:04 -04:00