Commit Graph

81 Commits

Author SHA1 Message Date
Your Name
dd3d3c3518 fix: always check if is_bad_url 2021-02-06 12:20:34 +01:00
Your Name
01bded472f tabs to space 2021-02-06 12:14:22 +01:00
Your Name
78b29a1187 some changes 2021-01-24 03:52:56 +01:00
Mickaël Serneels
eeedf9d0a1 extract url only from same domains ? (default: False)
setting this option will make ppf not follow external links when extracting uris
2019-05-14 21:24:29 +02:00
Mickaël Serneels
b226bc0b03 check if bad url *after* building the url 2019-05-14 19:31:19 +02:00
Mickaël Serneels
eeae849e12 space2tab 2019-05-14 19:29:30 +02:00
Mickaël Serneels
bcaf7af0e7 extract_urls(): only when stale_count = 0 2019-05-13 23:49:35 +02:00
Mickaël Serneels
e2122a27d9 ppf: strip extraced uris 2019-05-13 23:48:55 +02:00
Mickaël Serneels
225b76462c import_from_file: don't add empty url 2019-05-13 23:48:55 +02:00
Mickaël Serneels
c241f1a766 make use of dbs.insert_urls() 2019-05-01 23:19:50 +02:00
Mickaël Serneels
c8d594fb73 add url extraction
url get extracted from webpage when page contains proxies

this allows to "learn" as much links as possible from a working website
2019-05-01 22:58:23 +02:00
Mickaël Serneels
0fb706eeae clean code 2019-05-01 17:43:29 +02:00
Mickaël Serneels
9a624819d3 check content type 2019-05-01 17:43:29 +02:00
Mickaël Serneels
0155c6f2ad ppf: check content-type (once) before trying to download/extract proxies
avoid trying to extract stuff from pdf and such (only accept text/*)

REQUIRES:
sqlite3 websites.sqlite "alter table uris add content_type text"

Don't test known uris:
sqlite3 websites.sqlite "update uris set content_type='text/manual' WHERE error=0"
2019-05-01 17:43:28 +02:00
mickael
61c3ae6130 fix: define retrievals on import 2019-05-01 17:43:28 +02:00
rofl0r
2bacf77c8c split ppf into two programs, ppf/scraper 2019-01-18 22:53:35 +00:00
rofl0r
8be5ab1567 ppf: move insert function into dbs.py 2019-01-18 21:43:17 +00:00
rofl0r
5fd693a4a2 ppf: remove more unneeded stuff 2019-01-18 19:55:54 +00:00
rofl0r
d926e66092 ppf: remove unneeded stuff 2019-01-18 19:53:55 +00:00
rofl0r
b0f92fcdcd ppf.py: improve urignore code readability 2019-01-18 19:52:15 +00:00
rofl0r
4a41796b19 factor out http related code from ppf.py 2019-01-18 19:30:42 +00:00
rofl0r
0dad0176f3 ppf: add new field proxies_added to be able to rate sites
sqlite3 urls.sqlite "alter table uris add proxies_added INT"
sqlite3 urls.sqlite "update uris set proxies_added=0"
2019-01-18 15:44:09 +00:00
mickael
f489f0c4dd set retrievals to 0 for new uris 2019-01-13 16:50:54 +00:00
rofl0r
69d366f7eb ppf: add retrievals field so we know whether an url is new
use

sqlite3 urls.sqlite "alter table uris add retrievals INT"
sqlite3 urls.sqlite "update uris set retrievals=0"
2019-01-13 16:40:12 +00:00
rofl0r
54e2c2a702 ppf: simplify statement 2019-01-13 16:40:12 +00:00
rofl0r
2f7a730311 ppf: use slice for the 500 rows limitation 2019-01-13 16:40:12 +00:00
mickael
7c7fa8836a patch: 1y4C 2019-01-13 16:40:12 +00:00
rofl0r
24d2c08c9f ppf: make it possible to import a file containing proxies directly
using --file filename.html
2019-01-11 05:45:13 +00:00
rofl0r
ecf587c8f7 ppf: set newly added sites to 0,0 (err/stale)
we use the tuple 0,0 later on to detect whether a site is new or not.
2019-01-11 05:23:05 +00:00
rofl0r
8b10df9c1b ppf.py: start using stale_count 2019-01-11 05:08:32 +00:00
rofl0r
d2cb7441a8 ppf: add optional debug output 2019-01-11 05:03:40 +00:00
rofl0r
b6dba08cf0 ppf: only extract ips with port >= 10 2019-01-11 03:29:13 +00:00
rofl0r
122847d888 ppf: fix bug referencing removed db field 2019-01-11 02:53:16 +00:00
mickael
4c6a83373f split databases 2019-01-11 00:25:01 +00:00
rofl0r
087559637e ppf: improve cleanhtml() and cache compiled re's
now it transforms e.g. '<td>118.114.116.36</td>\n<td>1080</td>'
correctly.
(the newline was formerly preventing success)
2019-01-10 19:22:21 +00:00
mickael
383ae6f431 fix: no uris were tested because commented" 2019-01-10 00:21:57 +00:00
mickael
da4f228479 discard urls who fail at first test 2019-01-09 23:38:59 +00:00
mickael
15dee0cd73 add -intitle:pdf to searx query 2019-01-09 23:30:55 +00:00
mickael
e94644a60e searx: loop for 10 pages on each searx instance 2019-01-09 22:55:55 +00:00
mickael
8993727f03 changed regex 2019-01-09 20:07:28 +00:00
mickael
33887385f0 is_usable_proxy: group the 2 firsts lines 2019-01-09 19:23:09 +00:00
mickael
9828db79d4 is_usable_proxy(): dont check twice if A < 1 2019-01-09 19:11:05 +00:00
mickael
6f0d5c1ffa modify and rename should_i_... function
> remove :port from D
> check if octets are within a correct range
2019-01-09 19:01:55 +00:00
mickael
a74d6dfce8 do not save invalid IPs 2019-01-09 00:42:28 +00:00
rofl0r
6e4c45175e ppf: add safeguards against tor outage 2019-01-08 15:48:38 +00:00
rofl0r
1f3179de48 ppf: check for valid ports 2019-01-08 04:30:50 +00:00
rofl0r
9ccf8b7854 ppf: write dates as int 2019-01-08 04:19:09 +00:00
rofl0r
38d89f5bd9 ppf: add option for number of http retries 2019-01-08 03:30:31 +00:00
rofl0r
115c4a56f5 ppf: honor timeout 2019-01-08 03:25:52 +00:00
rofl0r
f16f754b0e implement combo config parser
allows all options to be overridden by command line.

e.g.
[watchd]
threads=10
debug=false

--watch.threads=50 --debug=true
2019-01-08 02:17:10 +00:00