Commit Graph

71 Commits

Author SHA1 Message Date
Mickaël Serneels
c8d594fb73 add url extraction
url get extracted from webpage when page contains proxies

this allows to "learn" as much links as possible from a working website
2019-05-01 22:58:23 +02:00
Mickaël Serneels
0fb706eeae clean code 2019-05-01 17:43:29 +02:00
Mickaël Serneels
9a624819d3 check content type 2019-05-01 17:43:29 +02:00
Mickaël Serneels
0155c6f2ad ppf: check content-type (once) before trying to download/extract proxies
avoid trying to extract stuff from pdf and such (only accept text/*)

REQUIRES:
sqlite3 websites.sqlite "alter table uris add content_type text"

Don't test known uris:
sqlite3 websites.sqlite "update uris set content_type='text/manual' WHERE error=0"
2019-05-01 17:43:28 +02:00
mickael
61c3ae6130 fix: define retrievals on import 2019-05-01 17:43:28 +02:00
rofl0r
2bacf77c8c split ppf into two programs, ppf/scraper 2019-01-18 22:53:35 +00:00
rofl0r
8be5ab1567 ppf: move insert function into dbs.py 2019-01-18 21:43:17 +00:00
rofl0r
5fd693a4a2 ppf: remove more unneeded stuff 2019-01-18 19:55:54 +00:00
rofl0r
d926e66092 ppf: remove unneeded stuff 2019-01-18 19:53:55 +00:00
rofl0r
b0f92fcdcd ppf.py: improve urignore code readability 2019-01-18 19:52:15 +00:00
rofl0r
4a41796b19 factor out http related code from ppf.py 2019-01-18 19:30:42 +00:00
rofl0r
0dad0176f3 ppf: add new field proxies_added to be able to rate sites
sqlite3 urls.sqlite "alter table uris add proxies_added INT"
sqlite3 urls.sqlite "update uris set proxies_added=0"
2019-01-18 15:44:09 +00:00
mickael
f489f0c4dd set retrievals to 0 for new uris 2019-01-13 16:50:54 +00:00
rofl0r
69d366f7eb ppf: add retrievals field so we know whether an url is new
use

sqlite3 urls.sqlite "alter table uris add retrievals INT"
sqlite3 urls.sqlite "update uris set retrievals=0"
2019-01-13 16:40:12 +00:00
rofl0r
54e2c2a702 ppf: simplify statement 2019-01-13 16:40:12 +00:00
rofl0r
2f7a730311 ppf: use slice for the 500 rows limitation 2019-01-13 16:40:12 +00:00
mickael
7c7fa8836a patch: 1y4C 2019-01-13 16:40:12 +00:00
rofl0r
24d2c08c9f ppf: make it possible to import a file containing proxies directly
using --file filename.html
2019-01-11 05:45:13 +00:00
rofl0r
ecf587c8f7 ppf: set newly added sites to 0,0 (err/stale)
we use the tuple 0,0 later on to detect whether a site is new or not.
2019-01-11 05:23:05 +00:00
rofl0r
8b10df9c1b ppf.py: start using stale_count 2019-01-11 05:08:32 +00:00
rofl0r
d2cb7441a8 ppf: add optional debug output 2019-01-11 05:03:40 +00:00
rofl0r
b6dba08cf0 ppf: only extract ips with port >= 10 2019-01-11 03:29:13 +00:00
rofl0r
122847d888 ppf: fix bug referencing removed db field 2019-01-11 02:53:16 +00:00
mickael
4c6a83373f split databases 2019-01-11 00:25:01 +00:00
rofl0r
087559637e ppf: improve cleanhtml() and cache compiled re's
now it transforms e.g. '<td>118.114.116.36</td>\n<td>1080</td>'
correctly.
(the newline was formerly preventing success)
2019-01-10 19:22:21 +00:00
mickael
383ae6f431 fix: no uris were tested because commented" 2019-01-10 00:21:57 +00:00
mickael
da4f228479 discard urls who fail at first test 2019-01-09 23:38:59 +00:00
mickael
15dee0cd73 add -intitle:pdf to searx query 2019-01-09 23:30:55 +00:00
mickael
e94644a60e searx: loop for 10 pages on each searx instance 2019-01-09 22:55:55 +00:00
mickael
8993727f03 changed regex 2019-01-09 20:07:28 +00:00
mickael
33887385f0 is_usable_proxy: group the 2 firsts lines 2019-01-09 19:23:09 +00:00
mickael
9828db79d4 is_usable_proxy(): dont check twice if A < 1 2019-01-09 19:11:05 +00:00
mickael
6f0d5c1ffa modify and rename should_i_... function
> remove :port from D
> check if octets are within a correct range
2019-01-09 19:01:55 +00:00
mickael
a74d6dfce8 do not save invalid IPs 2019-01-09 00:42:28 +00:00
rofl0r
6e4c45175e ppf: add safeguards against tor outage 2019-01-08 15:48:38 +00:00
rofl0r
1f3179de48 ppf: check for valid ports 2019-01-08 04:30:50 +00:00
rofl0r
9ccf8b7854 ppf: write dates as int 2019-01-08 04:19:09 +00:00
rofl0r
38d89f5bd9 ppf: add option for number of http retries 2019-01-08 03:30:31 +00:00
rofl0r
115c4a56f5 ppf: honor timeout 2019-01-08 03:25:52 +00:00
rofl0r
f16f754b0e implement combo config parser
allows all options to be overridden by command line.

e.g.
[watchd]
threads=10
debug=false

--watch.threads=50 --debug=true
2019-01-08 02:17:10 +00:00
rofl0r
e7b8d526c0 ppf: print url if fetching failed 2019-01-08 00:46:41 +00:00
mickael
1b3ce72efc add and use combining class 2019-01-07 23:19:14 +00:00
mickael
1288dca38f fixme: change var names 2019-01-07 21:41:41 +00:00
mickael
aeff09d2b3 move math function inside the sql statement 2019-01-07 21:11:08 +00:00
rofl0r
898c8f36ee ppf: fix cpu hogs 2019-01-07 15:38:51 +00:00
rofl0r
ad7c7fce67 ppf: use timeout and only 1 try for http 2019-01-07 05:37:44 +00:00
mickael
8b15faf84d ppf: change user-agent; use headers 2019-01-06 23:29:30 +00:00
mickael
3223cc82c4 use http2.py instead of requests 2019-01-06 22:22:42 +00:00
mickael
1a025f102f only load search/bad terms when "search" arg is enabled 2019-01-06 18:31:42 +00:00
mickael
5e9f8baf56 remove unused imports 2019-01-06 18:27:06 +00:00