Username
269fed55ff
refactor core modules, integrate network stats
2025-12-25 11:13:20 +01:00
Username
9360c35add
ppf: add format_duration helper and stale log improvements
...
- Add format_duration() for compact time display
- Improve stale proxy logging with duration info
2025-12-24 00:20:13 +01:00
Username
68a34f2638
fetch: detect proxy protocol from source URL path
...
- detect_proto_from_path() infers socks4/socks5/http from URL
- extract_proxies() now returns (address, proto) tuples
- ppf.py updated to handle protocol-tagged proxies
- profiler signal handler for SIGTERM stats dump
2025-12-23 17:23:17 +01:00
Username
267035802a
ppf: reset stale_count when content hash changes
2025-12-22 00:05:06 +01:00
Username
f382a4ab6a
ppf: add content hash for duplicate proxy list detection
2025-12-22 00:03:12 +01:00
Username
747e6dd7aa
ppf: improve exception handling and logging
2025-12-21 23:37:57 +01:00
Username
e24f68500c
style: normalize indentation and improve code style
...
- convert tabs to 4-space indentation
- add docstrings to modules and classes
- remove unused import (copy)
- use explicit object inheritance
- use 'while True' over 'while 1'
- use 'while args' over 'while len(args)'
- use '{}' over 'dict()'
- consistent string formatting
- Python 2/3 compatible Queue import
2025-12-20 23:18:45 +01:00
Username
4780b6f095
fetch: consolidate extract_proxies into single implementation
2025-12-20 22:50:39 +01:00
Username
c759f7197e
ppf: use shared proxy cache from fetch module
2025-12-20 22:28:42 +01:00
Username
1d865d5250
ppf: use soup_parser instead of direct bs4 import
2025-12-20 17:33:40 +01:00
Username
57a7687b08
ppf: remove dead http server code
2025-12-20 16:46:08 +01:00
Your Name
15ff16b8d6
force py2 usage
2021-10-30 07:13:04 +02:00
Your Name
ee481ea31e
ppf: make scraper use extra proxies if available
2021-07-27 22:36:15 +02:00
Your Name
6b6cd94cec
spaces to tabs
2021-06-27 12:31:15 +02:00
Your Name
d3d83e1d90
changes
2021-05-12 08:06:03 +02:00
Your Name
cae6f75643
changs
2021-05-02 00:22:12 +02:00
Your Name
1a4d51f08c
ppf: play nice with cpu
2021-02-10 22:26:27 +01:00
Your Name
60c78be3fb
import new url as bulk list, misc cleansing
2021-02-06 23:25:12 +01:00
Your Name
7e91ae5237
changes
2021-02-06 21:50:08 +01:00
Your Name
68394da9ab
misc changes and fixes and
2021-02-06 15:36:14 +01:00
Your Name
b29c734002
fix: url → self.url, make thread option configurable
2021-02-06 14:33:44 +01:00
Your Name
5965312a9a
make leeching multithreaded, misc changes
2021-02-06 14:30:07 +01:00
Your Name
dd3d3c3518
fix: always check if is_bad_url
2021-02-06 12:20:34 +01:00
Your Name
01bded472f
tabs to space
2021-02-06 12:14:22 +01:00
Your Name
78b29a1187
some changes
2021-01-24 03:52:56 +01:00
Mickaël Serneels
eeedf9d0a1
extract url only from same domains ? (default: False)
...
setting this option will make ppf not follow external links when extracting uris
2019-05-14 21:24:29 +02:00
Mickaël Serneels
b226bc0b03
check if bad url *after* building the url
2019-05-14 19:31:19 +02:00
Mickaël Serneels
eeae849e12
space2tab
2019-05-14 19:29:30 +02:00
Mickaël Serneels
bcaf7af0e7
extract_urls(): only when stale_count = 0
2019-05-13 23:49:35 +02:00
Mickaël Serneels
e2122a27d9
ppf: strip extraced uris
2019-05-13 23:48:55 +02:00
Mickaël Serneels
225b76462c
import_from_file: don't add empty url
2019-05-13 23:48:55 +02:00
Mickaël Serneels
c241f1a766
make use of dbs.insert_urls()
2019-05-01 23:19:50 +02:00
Mickaël Serneels
c8d594fb73
add url extraction
...
url get extracted from webpage when page contains proxies
this allows to "learn" as much links as possible from a working website
2019-05-01 22:58:23 +02:00
Mickaël Serneels
0fb706eeae
clean code
2019-05-01 17:43:29 +02:00
Mickaël Serneels
9a624819d3
check content type
2019-05-01 17:43:29 +02:00
Mickaël Serneels
0155c6f2ad
ppf: check content-type (once) before trying to download/extract proxies
...
avoid trying to extract stuff from pdf and such (only accept text/*)
REQUIRES:
sqlite3 websites.sqlite "alter table uris add content_type text"
Don't test known uris:
sqlite3 websites.sqlite "update uris set content_type='text/manual' WHERE error=0"
2019-05-01 17:43:28 +02:00
mickael
61c3ae6130
fix: define retrievals on import
2019-05-01 17:43:28 +02:00
rofl0r
2bacf77c8c
split ppf into two programs, ppf/scraper
2019-01-18 22:53:35 +00:00
rofl0r
8be5ab1567
ppf: move insert function into dbs.py
2019-01-18 21:43:17 +00:00
rofl0r
5fd693a4a2
ppf: remove more unneeded stuff
2019-01-18 19:55:54 +00:00
rofl0r
d926e66092
ppf: remove unneeded stuff
2019-01-18 19:53:55 +00:00
rofl0r
b0f92fcdcd
ppf.py: improve urignore code readability
2019-01-18 19:52:15 +00:00
rofl0r
4a41796b19
factor out http related code from ppf.py
2019-01-18 19:30:42 +00:00
rofl0r
0dad0176f3
ppf: add new field proxies_added to be able to rate sites
...
sqlite3 urls.sqlite "alter table uris add proxies_added INT"
sqlite3 urls.sqlite "update uris set proxies_added=0"
2019-01-18 15:44:09 +00:00
mickael
f489f0c4dd
set retrievals to 0 for new uris
2019-01-13 16:50:54 +00:00
rofl0r
69d366f7eb
ppf: add retrievals field so we know whether an url is new
...
use
sqlite3 urls.sqlite "alter table uris add retrievals INT"
sqlite3 urls.sqlite "update uris set retrievals=0"
2019-01-13 16:40:12 +00:00
rofl0r
54e2c2a702
ppf: simplify statement
2019-01-13 16:40:12 +00:00
rofl0r
2f7a730311
ppf: use slice for the 500 rows limitation
2019-01-13 16:40:12 +00:00
mickael
7c7fa8836a
patch: 1y4C
2019-01-13 16:40:12 +00:00
rofl0r
24d2c08c9f
ppf: make it possible to import a file containing proxies directly
...
using --file filename.html
2019-01-11 05:45:13 +00:00