Commit Graph

  • ab35ce96cc FTP crawler bug fixes Simon 2018-06-24 16:44:21 -04:00
  • f603f41754 Updated readme Simon 2018-06-24 14:27:44 -04:00
  • 8e937e69c0 Should fix some FTP errors Simon 2018-06-24 13:50:55 -04:00
  • a6d753c6ee Added redispatch button and fixed typo in load balancing code Simon 2018-06-24 10:07:46 -04:00
  • 1ac510ff53 Slots can be updated without removing & adding Simon 2018-06-24 09:39:44 -04:00
  • 348914aba9 Removing unused module Simon 2018-06-22 17:34:10 -04:00
  • e824b2bf3c Updated readme and UI fixes Simon 2018-06-22 13:22:58 -04:00
  • 9d3fc2d71b typo (again) Simon 2018-06-21 21:26:44 -04:00
  • efd1981e6f typo Simon 2018-06-21 21:00:50 -04:00
  • 7a4432e4d0 More bugfixes for looping directories, some work on task dispatching Simon 2018-06-21 20:50:26 -04:00
  • 14d384e366 Decentralised crawling should work in theory + temporary fix for going further than the maximum 10k results elasticsearch allows by default Simon 2018-06-21 19:44:27 -04:00
  • 098ad2be72 Should fix unknown encoding errors + removed https warnings Simon 2018-06-21 19:23:01 -04:00
  • 80aa8933e6 Added rescan button Simon 2018-06-21 13:02:16 -04:00
  • 073551df3c Attempt to handle looping directories Simon 2018-06-21 11:54:40 -04:00
  • dd93d40a55 Small bugfix for ftp crawler Simon 2018-06-20 21:56:38 -04:00
  • 4ca56d2317 Fixes #2 Simon 2018-06-20 21:52:45 -04:00
  • a7e4e3ae1f Fixed stats bug Simon 2018-06-20 18:07:55 -04:00
  • c5deafbea5 Should fix some odd http listings Simon 2018-06-20 13:34:41 -04:00
  • cf51bb381c Added top websites scatter graph Simon 2018-06-20 12:21:34 -04:00
  • 7400bdc2a9 Added admin blacklist control in dashboard Simon 2018-06-20 11:28:06 -04:00
  • 35837463cd Added admin clear & delete buttons for websites Simon 2018-06-20 10:48:51 -04:00
  • cef9e2c8a1 Added some file types association Simon 2018-06-19 22:41:43 -04:00
  • 5f07e7d340 File types color based on type Simon 2018-06-19 22:34:44 -04:00
  • 5afdfb2b3c fixed navbar icon for mobile Simon 2018-06-19 21:13:36 -04:00
  • c99400994b Modified graph of file types Simon 2018-06-19 20:17:20 -04:00
  • 76ed03a82e Dates and sizes graphs styling Simon 2018-06-19 19:44:04 -04:00
  • 8236b04c2e Dates and sizes graphs Simon 2018-06-19 19:04:12 -04:00
  • e0b5aad654 Preview icon for images Simon 2018-06-19 13:56:00 -04:00
  • d8486104b4 Fix for odd html listings Simon 2018-06-19 12:14:50 -04:00
  • e54609972c Overwrite document on re-index, update website last_modified on task complete, delete website files on index complete Simon 2018-06-19 11:24:28 -04:00
  • 8486555426 Ignore 'parent directory' links Simon 2018-06-19 10:36:09 -04:00
  • 8f311e52ee Typo in csv export Simon 2018-06-19 10:17:15 -04:00
  • 5bdfa9985c Small adjustments for csv export (again) Simon 2018-06-19 10:04:55 -04:00
  • 4f5f0f76be Small adjustments for csv export Simon 2018-06-19 10:01:15 -04:00
  • e5e38a6faf Elasticsearch export to csv Simon 2018-06-19 09:48:44 -04:00
  • 81d52a4551 Changed UI to fit the-eye.eu Simon 2018-06-18 22:37:05 -04:00
  • 677bfa03ea Another fix for encoding problems Simon 2018-06-18 20:30:18 -04:00
  • 788d3749d4 Homepage now compatible with new stats Simon 2018-06-18 20:04:49 -04:00
  • 8768e39f08 Added stats page Simon 2018-06-18 19:56:25 -04:00
  • 7923647ea3 Made the ftp crawler work with the latest changes Simon 2018-06-18 15:46:03 -04:00
  • 83f4b8def9 Enhanced search results page Simon 2018-06-18 15:01:49 -04:00
  • 8a73142ff8 Support for more than just utf-8 and removed some debug info Simon 2018-06-18 13:44:19 -04:00
  • 7c47b0f00c Added delta column in crawl logs Simon 2018-06-18 12:21:00 -04:00
  • b63c7190c3 Improved external link detection Simon 2018-06-18 12:14:05 -04:00
  • 400abc9a3c Added crawl logs page Simon 2018-06-18 11:41:26 -04:00
  • 99d64b658b Disabled thread pool for headers requests in listing Simon 2018-06-18 10:33:33 -04:00
  • b97b8f6784 Temporary fix for decoding errors Simon 2018-06-17 22:17:21 -04:00
  • 344e7274d7 Simplified url joining and splitting, switched from lxml to html.parser, various memory usage optimizations Simon 2018-06-17 22:10:46 -04:00
  • 07d51a75cc Increased queue.get() timeouts Simon 2018-06-17 10:07:06 -04:00
  • e6175c84c9 Re-added timeout that was accidentally deleted Simon 2018-06-16 22:20:15 -04:00
  • 1283cc9599 Should fix memory usage problem when crawling (part three) Simon 2018-06-16 20:32:50 -04:00
  • 86144935e3 Attempt to fix Unicode errors part two Simon 2018-06-16 15:30:44 -04:00
  • c309aa25c8 Attempt to fix unicode decode errors Simon 2018-06-16 15:20:23 -04:00
  • 9d0a0a8b42 Should fix memory usage problem when crawling (part two) Simon 2018-06-16 14:53:48 -04:00
  • adb94cf326 Should fix memory usage problem when crawling Simon 2018-06-14 23:36:54 -04:00
  • 9aed18c2d2 Should fix timeout error when indexing Simon 2018-06-14 20:07:50 -04:00
  • 81fde6cc30 Bug fixes with html parsing Simon 2018-06-14 20:02:06 -04:00
  • f3c7b551d2 Some adjustments to make it work on Stretch server Simon 2018-06-14 17:09:05 -04:00
  • dffd032659 Indexing after crawling is a bit more efficient Simon 2018-06-14 16:41:43 -04:00
  • 83ca579ec7 Started working on post-crawl callbacks and basic auth for crawl servers Simon 2018-06-14 15:05:56 -04:00
  • 1bd58468eb Bug fixes for FTP crawler Simon 2018-06-13 15:54:45 -04:00
  • 9bde8cb629 uWSGI config and bugfix with file extensions Simon 2018-06-13 14:11:27 -04:00
  • e91572a06f Homepage stats now work with elasticsearch Simon 2018-06-12 23:19:57 -04:00
  • 2fe81e4b06 Crawl server now holds at most max_workers + 1 tasks in pool to minimize waiting time and to avoid loss of too many tasks in case of crash/restart Simon 2018-06-12 22:28:36 -04:00
  • 24ef493245 Websites being indexed now show up on the homepage Simon 2018-06-12 21:51:02 -04:00
  • bccb1d0dfd Website link list works with elasticsearch Simon 2018-06-12 21:26:44 -04:00
  • e266a50197 Website stats now works with elasticsearch Simon 2018-06-12 20:17:30 -04:00
  • 4b60ac62fc Added website url & date in search results & fixed threading problem Simon 2018-06-12 17:48:15 -04:00
  • 0127b3a51d Basic searching integrated with elasticsearch + highlighting Simon 2018-06-12 16:29:05 -04:00
  • af2601ee70 Fixed file duplication problem Simon 2018-06-12 15:55:52 -04:00
  • 1718bb91ca Files are indexed into ES when task is complete Simon 2018-06-12 15:45:00 -04:00
  • 6c912ea8c5 Completed tasks are now fetched by the TaskDispatcher Simon 2018-06-12 14:16:05 -04:00
  • d61fd75890 Tasks can now be queued from the web interface. Tasks are dispatched to the crawl server(s) Simon 2018-06-12 13:44:03 -04:00
  • 6d48f1f780 Task crawl result now logged in a database Simon 2018-06-12 11:03:45 -04:00
  • 011b8455a7 Elasticsearch search engine (search & scroll) Simon 2018-06-11 23:06:41 -04:00
  • 72495275b0 Elasticsearch search engine (import from json) Simon 2018-06-11 22:35:49 -04:00
  • fcfd7d4acc Bug fixes + export to json Simon 2018-06-11 20:02:30 -04:00
  • d849227798 barebones crawl_server microservice Simon 2018-06-11 19:00:43 -04:00
  • 8421cc0885 Refactoring on http crawler Simon 2018-06-11 16:06:56 -04:00
  • 7f496ce7a8 Slowly losing my sanity part 1: Removed scrapy dependency and moved to custom solution. Added multi-threaded ftp crawler Simon 2018-06-11 15:46:55 -04:00
  • b649b82854 Cleanup of custom crawler Simon 2018-06-10 21:32:08 -04:00
  • f2d914060b Removed unsuitable scrapy spider and implemented custom crawler Simon 2018-06-10 20:08:59 -04:00
  • d8c16d53e6 FTP url validation Simon 2018-06-10 14:32:19 -04:00
  • 0304c98a31 Added basic ftp spider for scrapy Simon 2018-06-10 14:12:55 -04:00
  • f1e8183cdf Bulk insert captcha Simon 2018-06-10 07:21:44 -04:00
  • 4523a4335c Added bulk insert feature Simon 2018-06-10 07:20:58 -04:00
  • 1bd8a5fc22 Designed form for bulk insert Simon 2018-06-10 07:03:59 -04:00
  • a25976d24a Generate and delete API tokens Simon 2018-06-09 12:41:28 -04:00
  • de717d3992 Blacklisted skyarchive.info Simon 2018-06-09 11:12:03 -04:00
  • 20d0f97ffb Logout button Simon 2018-06-08 11:48:11 -04:00
  • dc0cde61a0 Basic admin page Simon 2018-06-08 11:40:54 -04:00
  • 537228444b Duplicate website w/ reddit post + refactor Simon 2018-06-08 10:40:58 -04:00
  • 7f1e12cc3c Blacklisted https://oss.jfrog.org Simon 2018-06-07 18:18:58 -04:00
  • b79b0ca58c Results per page now configurable Simon 2018-06-07 13:49:08 -04:00
  • 306b0ed0fe Added option to choose results per page Simon 2018-06-07 13:19:41 -04:00
  • ab25d821c6 Export job no longer let the user download incomplete archives Simon 2018-06-07 11:35:56 -04:00
  • 06d3a09e11 Quick hack for search order options Simon 2018-06-07 11:22:35 -04:00
  • 221a16697b Changed user agent Simon 2018-06-07 10:44:43 -04:00
  • 92a91606ea
    Create LICENSE Simon Fortier 2018-06-07 10:36:26 -04:00
  • 4f6d7f32ad Option to turn off SSL, moved secret keys to config.py, switched to sqlite WAL mode to avoid locked database problems Simon 2018-06-07 10:33:35 -04:00