29 Commits

Author SHA1 Message Date
Richard Patel
4b8275c7bf
Add parser tests 2018-12-18 15:31:09 +01:00
Richard Patel
86ec78cae1
Add TCP timeout option 2018-11-20 03:29:10 +01:00
Richard Patel
03a487f393
Fix crawl loop 2018-11-18 18:45:06 +01:00
Richard Patel
a71157b4d8
Add User-Agent parameter 2018-11-18 14:24:04 +01:00
Richard Patel
6793086c22
Ignore HTTPS errors 2018-11-18 00:37:30 +01:00
Richard Patel
d596882b40
Fix ton of bugs 2018-11-17 04:18:22 +01:00
Richard Patel
718f9d7fbc
Rename project 2018-11-17 01:33:15 +01:00
Richard Patel
f1687679ab
Unescape results & don't recrawl 404 2018-11-17 01:21:20 +01:00
Simon
1e78cea7e7 Saved path should not contain file name 2018-11-16 13:58:12 -05:00
Richard Patel
82234f949e
Less tokenizer allocations 2018-11-16 00:22:40 +01:00
Richard Patel
084b3a5903
Optimizing with hexa :P 2018-11-15 23:51:31 +01:00
Richard Patel
ac0b8d2d0b
Blacklist all paths with a query parameter 2018-11-15 23:36:41 +01:00
Richard Patel
ffde1a9e5d
Timeout and results saving 2018-11-15 20:14:31 +01:00
Richard Patel
4c071171eb
Exclude dups in dir instead of keeping hashes of links 2018-11-11 23:11:30 +01:00
Richard Patel
9c8174dd8d
Fix header parsing 2018-11-11 18:53:17 +01:00
Richard Patel
a8c27b2d21
Hash links 2018-11-06 02:01:53 +01:00
Richard Patel
ed5e35f005
Performance improvements 2018-11-06 00:34:22 +01:00
Richard Patel
77cb45dbec
Detect directory symlinks 2018-10-28 18:37:18 +01:00
Richard Patel
bfd7302be8
Add urfave/cli app 2018-10-28 17:59:46 +01:00
Richard Patel
b1c40767e0
Remember scanned URLs 2018-10-28 17:07:30 +01:00
Richard Patel
ddfdce9d0f
Refactor a bit 2018-10-28 13:43:45 +01:00
Richard Patel
79f540bf29
Scheduler 2018-10-28 02:40:12 +02:00
Richard Patel
3fb4d4bde9
More logs 2018-10-27 17:25:32 +02:00
Richard Patel
76c8c13d49
Use finite state machine 2018-10-27 16:55:00 +02:00
Richard Patel
442a2cf8a7
Compare finite state machine and Regex 2018-10-27 16:53:45 +02:00
Richard Patel
9e090d109d
Header state machine 2018-10-27 16:29:10 +02:00
Richard Patel
d748be72cd
File HEAD requests 2018-10-27 16:22:01 +02:00
Richard Patel
2844d344ec
Working listing 2018-10-27 15:00:20 +02:00
Richard Patel
f2d2b620fa
Simple queue crawler 2018-10-27 04:08:32 +02:00