268 Commits

Author SHA1 Message Date
Simon
c309aa25c8 Attempt to fix unicode decode errors 2018-06-16 15:20:23 -04:00
Simon
9d0a0a8b42 Should fix memory usage problem when crawling (part two) 2018-06-16 14:53:48 -04:00
Simon
adb94cf326 Should fix memory usage problem when crawling 2018-06-14 23:36:54 -04:00
Simon
9aed18c2d2 Should fix timeout error when indexing 2018-06-14 20:07:50 -04:00
Simon
81fde6cc30 Bug fixes with html parsing 2018-06-14 20:02:06 -04:00
Simon
f3c7b551d2 Some adjustments to make it work on Stretch server 2018-06-14 17:09:05 -04:00
Simon
dffd032659 Indexing after crawling is a bit more efficient 2018-06-14 16:41:43 -04:00
Simon
83ca579ec7 Started working on post-crawl callbacks and basic auth for crawl servers 2018-06-14 15:05:56 -04:00
Simon
1bd58468eb Bug fixes for FTP crawler 2018-06-13 15:54:45 -04:00
Simon
9bde8cb629 uWSGI config and bugfix with file extensions 2018-06-13 14:11:27 -04:00
Simon
e91572a06f Homepage stats now work with elasticsearch 2018-06-12 23:19:57 -04:00
Simon
2fe81e4b06 Crawl server now holds at most max_workers + 1 tasks in pool to minimize waiting time and to avoid loss of too many tasks in case of crash/restart 2018-06-12 22:28:36 -04:00
Simon
24ef493245 Websites being indexed now show up on the homepage 2018-06-12 21:51:02 -04:00
Simon
bccb1d0dfd Website link list works with elasticsearch 2018-06-12 21:26:44 -04:00
Simon
e266a50197 Website stats now works with elasticsearch 2018-06-12 20:17:30 -04:00
Simon
4b60ac62fc Added website url & date in search results & fixed threading problem 2018-06-12 17:48:15 -04:00
Simon
0127b3a51d Basic searching integrated with elasticsearch + highlighting 2018-06-12 16:29:05 -04:00
Simon
af2601ee70 Fixed file duplication problem 2018-06-12 15:55:52 -04:00
Simon
1718bb91ca Files are indexed into ES when task is complete 2018-06-12 15:45:00 -04:00
Simon
6c912ea8c5 Completed tasks are now fetched by the TaskDispatcher 2018-06-12 14:16:05 -04:00
Simon
d61fd75890 Tasks can now be queued from the web interface. Tasks are dispatched to the crawl server(s) 2018-06-12 13:44:03 -04:00
Simon
6d48f1f780 Task crawl result now logged in a database 2018-06-12 11:03:45 -04:00
Simon
011b8455a7 Elasticsearch search engine (search & scroll) 2018-06-11 23:06:41 -04:00
Simon
72495275b0 Elasticsearch search engine (import from json) 2018-06-11 22:35:49 -04:00
Simon
fcfd7d4acc Bug fixes + export to json 2018-06-11 20:02:30 -04:00
Simon
d849227798 barebones crawl_server microservice 2018-06-11 19:00:43 -04:00
Simon
8421cc0885 Refactoring on http crawler 2018-06-11 16:06:56 -04:00
Simon
7f496ce7a8 Slowly losing my sanity part 1: Removed scrapy dependency and moved to custom solution. Added multi-threaded ftp crawler 2018-06-11 15:46:55 -04:00
Simon
b649b82854 Cleanup of custom crawler 2018-06-10 21:32:08 -04:00
Simon
f2d914060b Removed unsuitable scrapy spider and implemented custom crawler 2018-06-10 20:08:59 -04:00
Simon
d8c16d53e6 FTP url validation 2018-06-10 14:32:19 -04:00
Simon
0304c98a31 Added basic ftp spider for scrapy 2018-06-10 14:12:55 -04:00
Simon
f1e8183cdf Bulk insert captcha 2018-06-10 07:21:44 -04:00
Simon
4523a4335c Added bulk insert feature 2018-06-10 07:20:58 -04:00
Simon
1bd8a5fc22 Designed form for bulk insert 2018-06-10 07:03:59 -04:00
Simon
a25976d24a Generate and delete API tokens 2018-06-09 12:41:28 -04:00
Simon
de717d3992 Blacklisted skyarchive.info 2018-06-09 11:12:03 -04:00
Simon
20d0f97ffb Logout button 2018-06-08 11:48:11 -04:00
Simon
dc0cde61a0 Basic admin page 2018-06-08 11:40:54 -04:00
Simon
537228444b Duplicate website w/ reddit post + refactor 2018-06-08 10:40:58 -04:00
Simon
7f1e12cc3c Blacklisted https://oss.jfrog.org 2018-06-07 18:18:58 -04:00
Simon
b79b0ca58c Results per page now configurable 2018-06-07 13:49:08 -04:00
Simon
306b0ed0fe Added option to choose results per page 2018-06-07 13:19:41 -04:00
Simon
ab25d821c6 Export job no longer let the user download incomplete archives 2018-06-07 11:35:56 -04:00
Simon
06d3a09e11 Quick hack for search order options 2018-06-07 11:22:35 -04:00
Simon
221a16697b Changed user agent 2018-06-07 10:44:43 -04:00
Simon Fortier
92a91606ea
Create LICENSE 2018-06-07 10:36:26 -04:00
Simon
4f6d7f32ad Option to turn off SSL, moved secret keys to config.py, switched to sqlite WAL mode to avoid locked database problems 2018-06-07 10:33:35 -04:00
Simon Fortier
460357f183
Create README.md 2018-06-07 10:01:33 -04:00
Simon
a2835bbbcf Should fix problem when comment/post with subdir doesn't exist pt. 2 2018-06-06 19:40:08 -04:00