Commit Graph

103 Commits

Author SHA1 Message Date
Username
269fed55ff refactor core modules, integrate network stats 2025-12-25 11:13:20 +01:00
Username
9360c35add ppf: add format_duration helper and stale log improvements
- Add format_duration() for compact time display
- Improve stale proxy logging with duration info
2025-12-24 00:20:13 +01:00
Username
68a34f2638 fetch: detect proxy protocol from source URL path
- detect_proto_from_path() infers socks4/socks5/http from URL
- extract_proxies() now returns (address, proto) tuples
- ppf.py updated to handle protocol-tagged proxies
- profiler signal handler for SIGTERM stats dump
2025-12-23 17:23:17 +01:00
Username
267035802a ppf: reset stale_count when content hash changes 2025-12-22 00:05:06 +01:00
Username
f382a4ab6a ppf: add content hash for duplicate proxy list detection 2025-12-22 00:03:12 +01:00
Username
747e6dd7aa ppf: improve exception handling and logging 2025-12-21 23:37:57 +01:00
Username
e24f68500c style: normalize indentation and improve code style
- convert tabs to 4-space indentation
- add docstrings to modules and classes
- remove unused import (copy)
- use explicit object inheritance
- use 'while True' over 'while 1'
- use 'while args' over 'while len(args)'
- use '{}' over 'dict()'
- consistent string formatting
- Python 2/3 compatible Queue import
2025-12-20 23:18:45 +01:00
Username
4780b6f095 fetch: consolidate extract_proxies into single implementation 2025-12-20 22:50:39 +01:00
Username
c759f7197e ppf: use shared proxy cache from fetch module 2025-12-20 22:28:42 +01:00
Username
1d865d5250 ppf: use soup_parser instead of direct bs4 import 2025-12-20 17:33:40 +01:00
Username
57a7687b08 ppf: remove dead http server code 2025-12-20 16:46:08 +01:00
Your Name
15ff16b8d6 force py2 usage 2021-10-30 07:13:04 +02:00
Your Name
ee481ea31e ppf: make scraper use extra proxies if available 2021-07-27 22:36:15 +02:00
Your Name
6b6cd94cec spaces to tabs 2021-06-27 12:31:15 +02:00
Your Name
d3d83e1d90 changes 2021-05-12 08:06:03 +02:00
Your Name
cae6f75643 changs 2021-05-02 00:22:12 +02:00
Your Name
1a4d51f08c ppf: play nice with cpu 2021-02-10 22:26:27 +01:00
Your Name
60c78be3fb import new url as bulk list, misc cleansing 2021-02-06 23:25:12 +01:00
Your Name
7e91ae5237 changes 2021-02-06 21:50:08 +01:00
Your Name
68394da9ab misc changes and fixes and 2021-02-06 15:36:14 +01:00
Your Name
b29c734002 fix: url → self.url, make thread option configurable 2021-02-06 14:33:44 +01:00
Your Name
5965312a9a make leeching multithreaded, misc changes 2021-02-06 14:30:07 +01:00
Your Name
dd3d3c3518 fix: always check if is_bad_url 2021-02-06 12:20:34 +01:00
Your Name
01bded472f tabs to space 2021-02-06 12:14:22 +01:00
Your Name
78b29a1187 some changes 2021-01-24 03:52:56 +01:00
Mickaël Serneels
eeedf9d0a1 extract url only from same domains ? (default: False)
setting this option will make ppf not follow external links when extracting uris
2019-05-14 21:24:29 +02:00
Mickaël Serneels
b226bc0b03 check if bad url *after* building the url 2019-05-14 19:31:19 +02:00
Mickaël Serneels
eeae849e12 space2tab 2019-05-14 19:29:30 +02:00
Mickaël Serneels
bcaf7af0e7 extract_urls(): only when stale_count = 0 2019-05-13 23:49:35 +02:00
Mickaël Serneels
e2122a27d9 ppf: strip extraced uris 2019-05-13 23:48:55 +02:00
Mickaël Serneels
225b76462c import_from_file: don't add empty url 2019-05-13 23:48:55 +02:00
Mickaël Serneels
c241f1a766 make use of dbs.insert_urls() 2019-05-01 23:19:50 +02:00
Mickaël Serneels
c8d594fb73 add url extraction
url get extracted from webpage when page contains proxies

this allows to "learn" as much links as possible from a working website
2019-05-01 22:58:23 +02:00
Mickaël Serneels
0fb706eeae clean code 2019-05-01 17:43:29 +02:00
Mickaël Serneels
9a624819d3 check content type 2019-05-01 17:43:29 +02:00
Mickaël Serneels
0155c6f2ad ppf: check content-type (once) before trying to download/extract proxies
avoid trying to extract stuff from pdf and such (only accept text/*)

REQUIRES:
sqlite3 websites.sqlite "alter table uris add content_type text"

Don't test known uris:
sqlite3 websites.sqlite "update uris set content_type='text/manual' WHERE error=0"
2019-05-01 17:43:28 +02:00
mickael
61c3ae6130 fix: define retrievals on import 2019-05-01 17:43:28 +02:00
rofl0r
2bacf77c8c split ppf into two programs, ppf/scraper 2019-01-18 22:53:35 +00:00
rofl0r
8be5ab1567 ppf: move insert function into dbs.py 2019-01-18 21:43:17 +00:00
rofl0r
5fd693a4a2 ppf: remove more unneeded stuff 2019-01-18 19:55:54 +00:00
rofl0r
d926e66092 ppf: remove unneeded stuff 2019-01-18 19:53:55 +00:00
rofl0r
b0f92fcdcd ppf.py: improve urignore code readability 2019-01-18 19:52:15 +00:00
rofl0r
4a41796b19 factor out http related code from ppf.py 2019-01-18 19:30:42 +00:00
rofl0r
0dad0176f3 ppf: add new field proxies_added to be able to rate sites
sqlite3 urls.sqlite "alter table uris add proxies_added INT"
sqlite3 urls.sqlite "update uris set proxies_added=0"
2019-01-18 15:44:09 +00:00
mickael
f489f0c4dd set retrievals to 0 for new uris 2019-01-13 16:50:54 +00:00
rofl0r
69d366f7eb ppf: add retrievals field so we know whether an url is new
use

sqlite3 urls.sqlite "alter table uris add retrievals INT"
sqlite3 urls.sqlite "update uris set retrievals=0"
2019-01-13 16:40:12 +00:00
rofl0r
54e2c2a702 ppf: simplify statement 2019-01-13 16:40:12 +00:00
rofl0r
2f7a730311 ppf: use slice for the 500 rows limitation 2019-01-13 16:40:12 +00:00
mickael
7c7fa8836a patch: 1y4C 2019-01-13 16:40:12 +00:00
rofl0r
24d2c08c9f ppf: make it possible to import a file containing proxies directly
using --file filename.html
2019-01-11 05:45:13 +00:00