25 Commits

Author SHA1 Message Date
Simon
5ff198b88a Fix for negative sizes 2018-07-25 11:37:12 -04:00
Simon
34d1f375a8 Crawler performance improvements 2018-07-25 11:27:50 -04:00
Simon
d43cf3b0ce Empty queue timeout increased to avoid that all workers die before the website is dropped 2018-07-20 14:11:17 -04:00
Simon
1df5d194d2 Very slow websites are skipped. Should fix infinite waiting bug 2018-07-20 13:34:40 -04:00
Simon
d138db8f06 Added filter to check if a website can be scanned from its parent directory 2018-07-10 10:14:23 -04:00
Simon
d7ce1670a8 Logging and bugfix for http crawler 2018-06-25 14:36:16 -04:00
Simon
e11343de23 More FTP crawler bug fixes 2018-06-24 18:05:30 -04:00
Simon
ab35ce96cc FTP crawler bug fixes 2018-06-24 16:44:21 -04:00
Simon
8e937e69c0 Should fix some FTP errors 2018-06-24 13:50:55 -04:00
Simon
1ac510ff53 Slots can be updated without removing & adding 2018-06-24 09:39:44 -04:00
Simon
348914aba9 Removing unused module 2018-06-22 17:34:10 -04:00
Simon
e824b2bf3c Updated readme and UI fixes 2018-06-22 13:22:58 -04:00
Simon
7a4432e4d0 More bugfixes for looping directories, some work on task dispatching 2018-06-21 20:50:26 -04:00
Simon
14d384e366 Decentralised crawling should work in theory + temporary fix for going further than the maximum 10k results elasticsearch allows by default 2018-06-21 19:44:27 -04:00
Simon
073551df3c Attempt to handle looping directories 2018-06-21 11:54:40 -04:00
Simon
8a73142ff8 Support for more than just utf-8 and removed some debug info 2018-06-18 13:44:19 -04:00
Simon
99d64b658b Disabled thread pool for headers requests in listing 2018-06-18 10:33:33 -04:00
Simon
344e7274d7 Simplified url joining and splitting, switched from lxml to html.parser, various memory usage optimizations 2018-06-17 22:10:46 -04:00
Simon
07d51a75cc Increased queue.get() timeouts 2018-06-17 10:07:06 -04:00
Simon
9d0a0a8b42 Should fix memory usage problem when crawling (part two) 2018-06-16 14:53:48 -04:00
Simon
adb94cf326 Should fix memory usage problem when crawling 2018-06-14 23:36:54 -04:00
Simon
f3c7b551d2 Some adjustments to make it work on Stretch server 2018-06-14 17:09:05 -04:00
Simon
1bd58468eb Bug fixes for FTP crawler 2018-06-13 15:54:45 -04:00
Simon
af2601ee70 Fixed file duplication problem 2018-06-12 15:55:52 -04:00
Simon
d61fd75890 Tasks can now be queued from the web interface. Tasks are dispatched to the crawl server(s) 2018-06-12 13:44:03 -04:00