Mickaël Serneels
0155c6f2ad
ppf: check content-type (once) before trying to download/extract proxies
...
avoid trying to extract stuff from pdf and such (only accept text/*)
REQUIRES:
sqlite3 websites.sqlite "alter table uris add content_type text"
Don't test known uris:
sqlite3 websites.sqlite "update uris set content_type='text/manual' WHERE error=0"
2019-05-01 17:43:28 +02:00
mickael
61c3ae6130
fix: define retrievals on import
2019-05-01 17:43:28 +02:00
rofl0r
2bacf77c8c
split ppf into two programs, ppf/scraper
2019-01-18 22:53:35 +00:00
rofl0r
8be5ab1567
ppf: move insert function into dbs.py
2019-01-18 21:43:17 +00:00
rofl0r
5fd693a4a2
ppf: remove more unneeded stuff
2019-01-18 19:55:54 +00:00
rofl0r
d926e66092
ppf: remove unneeded stuff
2019-01-18 19:53:55 +00:00
rofl0r
b0f92fcdcd
ppf.py: improve urignore code readability
2019-01-18 19:52:15 +00:00
rofl0r
4a41796b19
factor out http related code from ppf.py
2019-01-18 19:30:42 +00:00
rofl0r
0dad0176f3
ppf: add new field proxies_added to be able to rate sites
...
sqlite3 urls.sqlite "alter table uris add proxies_added INT"
sqlite3 urls.sqlite "update uris set proxies_added=0"
2019-01-18 15:44:09 +00:00
mickael
f489f0c4dd
set retrievals to 0 for new uris
2019-01-13 16:50:54 +00:00
rofl0r
69d366f7eb
ppf: add retrievals field so we know whether an url is new
...
use
sqlite3 urls.sqlite "alter table uris add retrievals INT"
sqlite3 urls.sqlite "update uris set retrievals=0"
2019-01-13 16:40:12 +00:00
rofl0r
54e2c2a702
ppf: simplify statement
2019-01-13 16:40:12 +00:00
rofl0r
2f7a730311
ppf: use slice for the 500 rows limitation
2019-01-13 16:40:12 +00:00
mickael
7c7fa8836a
patch: 1y4C
2019-01-13 16:40:12 +00:00
rofl0r
24d2c08c9f
ppf: make it possible to import a file containing proxies directly
...
using --file filename.html
2019-01-11 05:45:13 +00:00
rofl0r
ecf587c8f7
ppf: set newly added sites to 0,0 (err/stale)
...
we use the tuple 0,0 later on to detect whether a site is new or not.
2019-01-11 05:23:05 +00:00
rofl0r
8b10df9c1b
ppf.py: start using stale_count
2019-01-11 05:08:32 +00:00
rofl0r
d2cb7441a8
ppf: add optional debug output
2019-01-11 05:03:40 +00:00
rofl0r
b6dba08cf0
ppf: only extract ips with port >= 10
2019-01-11 03:29:13 +00:00
rofl0r
122847d888
ppf: fix bug referencing removed db field
2019-01-11 02:53:16 +00:00
mickael
4c6a83373f
split databases
2019-01-11 00:25:01 +00:00
rofl0r
087559637e
ppf: improve cleanhtml() and cache compiled re's
...
now it transforms e.g. '<td>118.114.116.36</td>\n<td>1080</td>'
correctly.
(the newline was formerly preventing success)
2019-01-10 19:22:21 +00:00
mickael
383ae6f431
fix: no uris were tested because commented"
2019-01-10 00:21:57 +00:00
mickael
da4f228479
discard urls who fail at first test
2019-01-09 23:38:59 +00:00
mickael
15dee0cd73
add -intitle:pdf to searx query
2019-01-09 23:30:55 +00:00
mickael
e94644a60e
searx: loop for 10 pages on each searx instance
2019-01-09 22:55:55 +00:00
mickael
8993727f03
changed regex
2019-01-09 20:07:28 +00:00
mickael
33887385f0
is_usable_proxy: group the 2 firsts lines
2019-01-09 19:23:09 +00:00
mickael
9828db79d4
is_usable_proxy(): dont check twice if A < 1
2019-01-09 19:11:05 +00:00
mickael
6f0d5c1ffa
modify and rename should_i_... function
...
> remove :port from D
> check if octets are within a correct range
2019-01-09 19:01:55 +00:00
mickael
a74d6dfce8
do not save invalid IPs
2019-01-09 00:42:28 +00:00
rofl0r
6e4c45175e
ppf: add safeguards against tor outage
2019-01-08 15:48:38 +00:00
rofl0r
1f3179de48
ppf: check for valid ports
2019-01-08 04:30:50 +00:00
rofl0r
9ccf8b7854
ppf: write dates as int
2019-01-08 04:19:09 +00:00
rofl0r
38d89f5bd9
ppf: add option for number of http retries
2019-01-08 03:30:31 +00:00
rofl0r
115c4a56f5
ppf: honor timeout
2019-01-08 03:25:52 +00:00
rofl0r
f16f754b0e
implement combo config parser
...
allows all options to be overridden by command line.
e.g.
[watchd]
threads=10
debug=false
--watch.threads=50 --debug=true
2019-01-08 02:17:10 +00:00
rofl0r
e7b8d526c0
ppf: print url if fetching failed
2019-01-08 00:46:41 +00:00
mickael
1b3ce72efc
add and use combining class
2019-01-07 23:19:14 +00:00
mickael
1288dca38f
fixme: change var names
2019-01-07 21:41:41 +00:00
mickael
aeff09d2b3
move math function inside the sql statement
2019-01-07 21:11:08 +00:00
rofl0r
898c8f36ee
ppf: fix cpu hogs
2019-01-07 15:38:51 +00:00
rofl0r
ad7c7fce67
ppf: use timeout and only 1 try for http
2019-01-07 05:37:44 +00:00
mickael
8b15faf84d
ppf: change user-agent; use headers
2019-01-06 23:29:30 +00:00
mickael
3223cc82c4
use http2.py instead of requests
2019-01-06 22:22:42 +00:00
mickael
1a025f102f
only load search/bad terms when "search" arg is enabled
2019-01-06 18:31:42 +00:00
mickael
5e9f8baf56
remove unused imports
2019-01-06 18:27:06 +00:00
mickael
64d9da9156
sleep even when no proxies are added
2019-01-06 02:58:58 +00:00
mickael
63b77043ac
minor changes
...
remove comments, minimal code reorganization
2019-01-06 01:35:18 +00:00
mickael
84a1de26c3
sqlite: do not create tables with "duration" column
2019-01-06 00:50:35 +00:00