28 Commits

Author SHA1 Message Date
Username
98b232f3d3 fetch: add short-circuit guards to extraction functions
Skip expensive regex scans when content lacks required markers:
- extract_auth_proxies: skip if no '@' in content
- extract_proxies_from_table: skip if no '<table' tag
- extract_proxies_from_json: skip if no '{' or '['
- Hoist table regexes to module-level precompiled constants
2026-02-22 13:50:29 +01:00
Username
0311abb46a fetch: encode unicode URLs to bytes before HTTP/SOCKS ops
When URLs arrive as unicode (e.g. from JSON API responses), the unicode
type propagates through _parse_url into the SOCKS5 packet construction
in rocksock. Port bytes > 127 formatted via %c in a unicode string
produce non-ASCII characters that fail on socket sendall() implicit
ASCII encode.

Encode URLs to UTF-8 bytes at fetch entry points to keep the entire
request pipeline in str (bytes) domain.
2026-02-17 16:43:26 +01:00
Username
12174b0d9d fetch: fix LRU cache for python 2 compatibility 2026-01-08 09:05:59 +01:00
Username
e758ce7178 dashboard: add keyboard shortcuts and optimize polling
- fetch.py: convert proxy validation cache to LRU with OrderedDict
  - thread-safe lock, move_to_end() on hits, evict oldest when full
- dashboard.js: add keyboard shortcuts (r=refresh, 1-9=tabs, t=theme, p=pause)
- dashboard.js: skip chart rendering for inactive tabs (reduces CPU)
2025-12-28 16:52:52 +01:00
Username
3b361916fa fetch, dbs: minor refactoring 2025-12-28 15:18:42 +01:00
Username
d2bd7d4f34 fetch: retry with different Tor circuit on failure
All checks were successful
CI / syntax-check (push) Successful in 3s
CI / memory-leak-check (push) Successful in 12s
2025-12-26 20:57:28 +01:00
Username
906d1b33ae fetch: cache is_usable_proxy results
All checks were successful
CI / syntax-check (push) Successful in 3s
CI / memory-leak-check (push) Successful in 11s
2025-12-26 20:04:01 +01:00
Username
481dc514fb fetch: add IPv6, auth proxy, and confidence scoring support 2025-12-26 19:13:36 +01:00
Username
272eba0f05 scraper: reuse connections, cycle circuit on block
All checks were successful
CI / syntax-check (push) Successful in 6s
CI / memory-leak-check (push) Successful in 15s
2025-12-25 19:26:23 +01:00
Username
68e8b88afa tor: use random credentials for circuit isolation
All checks were successful
CI / syntax-check (push) Successful in 6s
CI / memory-leak-check (push) Successful in 14s
2025-12-25 19:18:25 +01:00
Username
269fed55ff refactor core modules, integrate network stats 2025-12-25 11:13:20 +01:00
Username
97a7dc3316 fetch: use raw strings for regex patterns
All checks were successful
CI / syntax-check (push) Successful in 6s
CI / memory-leak-check (push) Successful in 14s
2025-12-24 01:06:49 +01:00
Username
5e788c06d1 fetch: precompile proxy extraction regex
Move regex pattern compilation to module load time
for better performance in repeated calls.
2025-12-24 00:20:06 +01:00
Username
68a34f2638 fetch: detect proxy protocol from source URL path
- detect_proto_from_path() infers socks4/socks5/http from URL
- extract_proxies() now returns (address, proto) tuples
- ppf.py updated to handle protocol-tagged proxies
- profiler signal handler for SIGTERM stats dump
2025-12-23 17:23:17 +01:00
Username
6b5eb83bf4 fetch: add robust proxy string validation 2025-12-21 23:49:02 +01:00
Username
9e7c8d78b3 fetch: unify known proxies cache 2025-12-21 23:37:58 +01:00
Username
e24f68500c style: normalize indentation and improve code style
- convert tabs to 4-space indentation
- add docstrings to modules and classes
- remove unused import (copy)
- use explicit object inheritance
- use 'while True' over 'while 1'
- use 'while args' over 'while len(args)'
- use '{}' over 'dict()'
- consistent string formatting
- Python 2/3 compatible Queue import
2025-12-20 23:18:45 +01:00
Username
4780b6f095 fetch: consolidate extract_proxies into single implementation 2025-12-20 22:50:39 +01:00
Username
3c88bc3298 fetch: add unified proxy cache functions 2025-12-20 22:28:37 +01:00
Your Name
d7db366857 split to ip/port, "cleanse" ips and ports, bugfixes 2021-08-22 20:39:50 +02:00
Your Name
ee481ea31e ppf: make scraper use extra proxies if available 2021-07-27 22:36:15 +02:00
Your Name
6b6cd94cec spaces to tabs 2021-06-27 12:31:15 +02:00
Your Name
f321e5a934 fetch: more describing debug message 2021-02-06 23:23:47 +01:00
Your Name
abd9b5bb9f tabs to spaces 2021-02-06 14:30:18 +01:00
Mickaël Serneels
0155c6f2ad ppf: check content-type (once) before trying to download/extract proxies
avoid trying to extract stuff from pdf and such (only accept text/*)

REQUIRES:
sqlite3 websites.sqlite "alter table uris add content_type text"

Don't test known uris:
sqlite3 websites.sqlite "update uris set content_type='text/manual' WHERE error=0"
2019-05-01 17:43:28 +02:00
rofl0r
bf7ec03fbf fetch.py: factor out twice used var 2019-05-01 17:43:28 +02:00
rofl0r
b99f83a991 fetch.py: improve readability of extract_urls 2019-01-18 19:32:37 +00:00
rofl0r
4a41796b19 factor out http related code from ppf.py 2019-01-18 19:30:42 +00:00