username/ppf - ppf - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Your Name	15ff16b8d6	force py2 usage	2021-10-30 07:13:04 +02:00
Your Name	ee481ea31e	ppf: make scraper use extra proxies if available	2021-07-27 22:36:15 +02:00
Your Name	6b6cd94cec	spaces to tabs	2021-06-27 12:31:15 +02:00
Your Name	d3d83e1d90	changes	2021-05-12 08:06:03 +02:00
Your Name	cae6f75643	changs	2021-05-02 00:22:12 +02:00
Your Name	1a4d51f08c	ppf: play nice with cpu	2021-02-10 22:26:27 +01:00
Your Name	60c78be3fb	import new url as bulk list, misc cleansing	2021-02-06 23:25:12 +01:00
Your Name	7e91ae5237	changes	2021-02-06 21:50:08 +01:00
Your Name	68394da9ab	misc changes and fixes and	2021-02-06 15:36:14 +01:00
Your Name	b29c734002	fix: url → self.url, make thread option configurable	2021-02-06 14:33:44 +01:00
Your Name	5965312a9a	make leeching multithreaded, misc changes	2021-02-06 14:30:07 +01:00
Your Name	dd3d3c3518	fix: always check if is_bad_url	2021-02-06 12:20:34 +01:00
Your Name	01bded472f	tabs to space	2021-02-06 12:14:22 +01:00
Your Name	78b29a1187	some changes	2021-01-24 03:52:56 +01:00
Mickaël Serneels	eeedf9d0a1	extract url only from same domains ? (default: False) setting this option will make ppf not follow external links when extracting uris	2019-05-14 21:24:29 +02:00
Mickaël Serneels	b226bc0b03	check if bad url after building the url	2019-05-14 19:31:19 +02:00
Mickaël Serneels	eeae849e12	space2tab	2019-05-14 19:29:30 +02:00
Mickaël Serneels	bcaf7af0e7	extract_urls(): only when stale_count = 0	2019-05-13 23:49:35 +02:00
Mickaël Serneels	e2122a27d9	ppf: strip extraced uris	2019-05-13 23:48:55 +02:00
Mickaël Serneels	225b76462c	import_from_file: don't add empty url	2019-05-13 23:48:55 +02:00
Mickaël Serneels	c241f1a766	make use of dbs.insert_urls()	2019-05-01 23:19:50 +02:00
Mickaël Serneels	c8d594fb73	add url extraction url get extracted from webpage when page contains proxies this allows to "learn" as much links as possible from a working website	2019-05-01 22:58:23 +02:00
Mickaël Serneels	0fb706eeae	clean code	2019-05-01 17:43:29 +02:00
Mickaël Serneels	9a624819d3	check content type	2019-05-01 17:43:29 +02:00
Mickaël Serneels	0155c6f2ad	ppf: check content-type (once) before trying to download/extract proxies avoid trying to extract stuff from pdf and such (only accept text/*) REQUIRES: sqlite3 websites.sqlite "alter table uris add content_type text" Don't test known uris: sqlite3 websites.sqlite "update uris set content_type='text/manual' WHERE error=0"	2019-05-01 17:43:28 +02:00
mickael	61c3ae6130	fix: define retrievals on import	2019-05-01 17:43:28 +02:00
rofl0r	2bacf77c8c	split ppf into two programs, ppf/scraper	2019-01-18 22:53:35 +00:00
rofl0r	8be5ab1567	ppf: move insert function into dbs.py	2019-01-18 21:43:17 +00:00
rofl0r	5fd693a4a2	ppf: remove more unneeded stuff	2019-01-18 19:55:54 +00:00
rofl0r	d926e66092	ppf: remove unneeded stuff	2019-01-18 19:53:55 +00:00
rofl0r	b0f92fcdcd	ppf.py: improve urignore code readability	2019-01-18 19:52:15 +00:00
rofl0r	4a41796b19	factor out http related code from ppf.py	2019-01-18 19:30:42 +00:00
rofl0r	0dad0176f3	ppf: add new field proxies_added to be able to rate sites sqlite3 urls.sqlite "alter table uris add proxies_added INT" sqlite3 urls.sqlite "update uris set proxies_added=0"	2019-01-18 15:44:09 +00:00
mickael	f489f0c4dd	set retrievals to 0 for new uris	2019-01-13 16:50:54 +00:00
rofl0r	69d366f7eb	ppf: add retrievals field so we know whether an url is new use sqlite3 urls.sqlite "alter table uris add retrievals INT" sqlite3 urls.sqlite "update uris set retrievals=0"	2019-01-13 16:40:12 +00:00
rofl0r	54e2c2a702	ppf: simplify statement	2019-01-13 16:40:12 +00:00
rofl0r	2f7a730311	ppf: use slice for the 500 rows limitation	2019-01-13 16:40:12 +00:00
mickael	7c7fa8836a	patch: 1y4C	2019-01-13 16:40:12 +00:00
rofl0r	24d2c08c9f	ppf: make it possible to import a file containing proxies directly using --file filename.html	2019-01-11 05:45:13 +00:00
rofl0r	ecf587c8f7	ppf: set newly added sites to 0,0 (err/stale) we use the tuple 0,0 later on to detect whether a site is new or not.	2019-01-11 05:23:05 +00:00
rofl0r	8b10df9c1b	ppf.py: start using stale_count	2019-01-11 05:08:32 +00:00
rofl0r	d2cb7441a8	ppf: add optional debug output	2019-01-11 05:03:40 +00:00
rofl0r	b6dba08cf0	ppf: only extract ips with port >= 10	2019-01-11 03:29:13 +00:00
rofl0r	122847d888	ppf: fix bug referencing removed db field	2019-01-11 02:53:16 +00:00
mickael	4c6a83373f	split databases	2019-01-11 00:25:01 +00:00
rofl0r	087559637e	ppf: improve cleanhtml() and cache compiled re's now it transforms e.g. '<td>118.114.116.36</td>\n<td>1080</td>' correctly. (the newline was formerly preventing success)	2019-01-10 19:22:21 +00:00
mickael	383ae6f431	fix: no uris were tested because commented"	2019-01-10 00:21:57 +00:00
mickael	da4f228479	discard urls who fail at first test	2019-01-09 23:38:59 +00:00
mickael	15dee0cd73	add -intitle:pdf to searx query	2019-01-09 23:30:55 +00:00
mickael	e94644a60e	searx: loop for 10 pages on each searx instance	2019-01-09 22:55:55 +00:00

1 2

92 Commits