Commit Graph

92 Commits

Author SHA1 Message Date
Your Name
15ff16b8d6 force py2 usage 2021-10-30 07:13:04 +02:00
Your Name
ee481ea31e ppf: make scraper use extra proxies if available 2021-07-27 22:36:15 +02:00
Your Name
6b6cd94cec spaces to tabs 2021-06-27 12:31:15 +02:00
Your Name
d3d83e1d90 changes 2021-05-12 08:06:03 +02:00
Your Name
cae6f75643 changs 2021-05-02 00:22:12 +02:00
Your Name
1a4d51f08c ppf: play nice with cpu 2021-02-10 22:26:27 +01:00
Your Name
60c78be3fb import new url as bulk list, misc cleansing 2021-02-06 23:25:12 +01:00
Your Name
7e91ae5237 changes 2021-02-06 21:50:08 +01:00
Your Name
68394da9ab misc changes and fixes and 2021-02-06 15:36:14 +01:00
Your Name
b29c734002 fix: url → self.url, make thread option configurable 2021-02-06 14:33:44 +01:00
Your Name
5965312a9a make leeching multithreaded, misc changes 2021-02-06 14:30:07 +01:00
Your Name
dd3d3c3518 fix: always check if is_bad_url 2021-02-06 12:20:34 +01:00
Your Name
01bded472f tabs to space 2021-02-06 12:14:22 +01:00
Your Name
78b29a1187 some changes 2021-01-24 03:52:56 +01:00
Mickaël Serneels
eeedf9d0a1 extract url only from same domains ? (default: False)
setting this option will make ppf not follow external links when extracting uris
2019-05-14 21:24:29 +02:00
Mickaël Serneels
b226bc0b03 check if bad url *after* building the url 2019-05-14 19:31:19 +02:00
Mickaël Serneels
eeae849e12 space2tab 2019-05-14 19:29:30 +02:00
Mickaël Serneels
bcaf7af0e7 extract_urls(): only when stale_count = 0 2019-05-13 23:49:35 +02:00
Mickaël Serneels
e2122a27d9 ppf: strip extraced uris 2019-05-13 23:48:55 +02:00
Mickaël Serneels
225b76462c import_from_file: don't add empty url 2019-05-13 23:48:55 +02:00
Mickaël Serneels
c241f1a766 make use of dbs.insert_urls() 2019-05-01 23:19:50 +02:00
Mickaël Serneels
c8d594fb73 add url extraction
url get extracted from webpage when page contains proxies

this allows to "learn" as much links as possible from a working website
2019-05-01 22:58:23 +02:00
Mickaël Serneels
0fb706eeae clean code 2019-05-01 17:43:29 +02:00
Mickaël Serneels
9a624819d3 check content type 2019-05-01 17:43:29 +02:00
Mickaël Serneels
0155c6f2ad ppf: check content-type (once) before trying to download/extract proxies
avoid trying to extract stuff from pdf and such (only accept text/*)

REQUIRES:
sqlite3 websites.sqlite "alter table uris add content_type text"

Don't test known uris:
sqlite3 websites.sqlite "update uris set content_type='text/manual' WHERE error=0"
2019-05-01 17:43:28 +02:00
mickael
61c3ae6130 fix: define retrievals on import 2019-05-01 17:43:28 +02:00
rofl0r
2bacf77c8c split ppf into two programs, ppf/scraper 2019-01-18 22:53:35 +00:00
rofl0r
8be5ab1567 ppf: move insert function into dbs.py 2019-01-18 21:43:17 +00:00
rofl0r
5fd693a4a2 ppf: remove more unneeded stuff 2019-01-18 19:55:54 +00:00
rofl0r
d926e66092 ppf: remove unneeded stuff 2019-01-18 19:53:55 +00:00
rofl0r
b0f92fcdcd ppf.py: improve urignore code readability 2019-01-18 19:52:15 +00:00
rofl0r
4a41796b19 factor out http related code from ppf.py 2019-01-18 19:30:42 +00:00
rofl0r
0dad0176f3 ppf: add new field proxies_added to be able to rate sites
sqlite3 urls.sqlite "alter table uris add proxies_added INT"
sqlite3 urls.sqlite "update uris set proxies_added=0"
2019-01-18 15:44:09 +00:00
mickael
f489f0c4dd set retrievals to 0 for new uris 2019-01-13 16:50:54 +00:00
rofl0r
69d366f7eb ppf: add retrievals field so we know whether an url is new
use

sqlite3 urls.sqlite "alter table uris add retrievals INT"
sqlite3 urls.sqlite "update uris set retrievals=0"
2019-01-13 16:40:12 +00:00
rofl0r
54e2c2a702 ppf: simplify statement 2019-01-13 16:40:12 +00:00
rofl0r
2f7a730311 ppf: use slice for the 500 rows limitation 2019-01-13 16:40:12 +00:00
mickael
7c7fa8836a patch: 1y4C 2019-01-13 16:40:12 +00:00
rofl0r
24d2c08c9f ppf: make it possible to import a file containing proxies directly
using --file filename.html
2019-01-11 05:45:13 +00:00
rofl0r
ecf587c8f7 ppf: set newly added sites to 0,0 (err/stale)
we use the tuple 0,0 later on to detect whether a site is new or not.
2019-01-11 05:23:05 +00:00
rofl0r
8b10df9c1b ppf.py: start using stale_count 2019-01-11 05:08:32 +00:00
rofl0r
d2cb7441a8 ppf: add optional debug output 2019-01-11 05:03:40 +00:00
rofl0r
b6dba08cf0 ppf: only extract ips with port >= 10 2019-01-11 03:29:13 +00:00
rofl0r
122847d888 ppf: fix bug referencing removed db field 2019-01-11 02:53:16 +00:00
mickael
4c6a83373f split databases 2019-01-11 00:25:01 +00:00
rofl0r
087559637e ppf: improve cleanhtml() and cache compiled re's
now it transforms e.g. '<td>118.114.116.36</td>\n<td>1080</td>'
correctly.
(the newline was formerly preventing success)
2019-01-10 19:22:21 +00:00
mickael
383ae6f431 fix: no uris were tested because commented" 2019-01-10 00:21:57 +00:00
mickael
da4f228479 discard urls who fail at first test 2019-01-09 23:38:59 +00:00
mickael
15dee0cd73 add -intitle:pdf to searx query 2019-01-09 23:30:55 +00:00
mickael
e94644a60e searx: loop for 10 pages on each searx instance 2019-01-09 22:55:55 +00:00