31 Commits

Author SHA1 Message Date
Richard Patel
9bc3455ee0 Fix missing port 2019-02-09 16:58:25 +01:00
Richard Patel
d69cd4400e Use fasthttp.PipelineClient 2019-02-09 16:46:36 +01:00
Richard Patel
4b8275c7bf Add parser tests 2018-12-18 15:31:09 +01:00
Richard Patel
86ec78cae1 Add TCP timeout option 2018-11-20 03:29:10 +01:00
Richard Patel
03a487f393 Fix crawl loop 2018-11-18 18:45:06 +01:00
Richard Patel
a71157b4d8 Add User-Agent parameter 2018-11-18 14:24:04 +01:00
Richard Patel
6793086c22 Ignore HTTPS errors 2018-11-18 00:37:30 +01:00
Richard Patel
d596882b40 Fix ton of bugs 2018-11-17 04:18:22 +01:00
Richard Patel
718f9d7fbc Rename project 2018-11-17 01:33:15 +01:00
Richard Patel
f1687679ab Unescape results & don't recrawl 404 2018-11-17 01:21:20 +01:00
Simon
1e78cea7e7 Saved path should not contain file name 2018-11-16 13:58:12 -05:00
Richard Patel
82234f949e Less tokenizer allocations 2018-11-16 00:22:40 +01:00
Richard Patel
084b3a5903 Optimizing with hexa :P 2018-11-15 23:51:31 +01:00
Richard Patel
ac0b8d2d0b Blacklist all paths with a query parameter 2018-11-15 23:36:41 +01:00
Richard Patel
ffde1a9e5d Timeout and results saving 2018-11-15 20:14:31 +01:00
Richard Patel
4c071171eb Exclude dups in dir instead of keeping hashes of links 2018-11-11 23:11:30 +01:00
Richard Patel
9c8174dd8d Fix header parsing 2018-11-11 18:53:17 +01:00
Richard Patel
a8c27b2d21 Hash links 2018-11-06 02:01:53 +01:00
Richard Patel
ed5e35f005 Performance improvements 2018-11-06 00:34:22 +01:00
Richard Patel
77cb45dbec Detect directory symlinks 2018-10-28 18:37:18 +01:00
Richard Patel
bfd7302be8 Add urfave/cli app 2018-10-28 17:59:46 +01:00
Richard Patel
b1c40767e0 Remember scanned URLs 2018-10-28 17:07:30 +01:00
Richard Patel
ddfdce9d0f Refactor a bit 2018-10-28 13:43:45 +01:00
Richard Patel
79f540bf29 Scheduler 2018-10-28 02:40:12 +02:00
Richard Patel
3fb4d4bde9 More logs 2018-10-27 17:25:32 +02:00
Richard Patel
76c8c13d49 Use finite state machine 2018-10-27 16:55:00 +02:00
Richard Patel
442a2cf8a7 Compare finite state machine and Regex 2018-10-27 16:53:45 +02:00
Richard Patel
9e090d109d Header state machine 2018-10-27 16:29:10 +02:00
Richard Patel
d748be72cd File HEAD requests 2018-10-27 16:22:01 +02:00
Richard Patel
2844d344ec Working listing 2018-10-27 15:00:20 +02:00
Richard Patel
f2d2b620fa Simple queue crawler 2018-10-27 04:08:32 +02:00