username/ppf

Fork 0

Files

Username 4547ec3188 roadmap: update completed work

2025-12-20 18:25:55 +01:00

13 KiB

Raw Blame History

PPF Project Roadmap

Project Purpose

PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:

Discover proxy addresses by crawling websites and search engines
Validate proxies through multi-target testing via Tor
Maintain a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              PPF Architecture                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                     │
│  │ scraper.py  │    │   ppf.py    │    │proxywatchd  │                     │
│  │             │    │             │    │             │                     │
│  │ Searx query │───>│ URL harvest │───>│ Proxy test  │                     │
│  │ URL finding │    │ Proxy extract│   │ Validation  │                     │
│  └─────────────┘    └─────────────┘    └─────────────┘                     │
│         │                  │                  │                             │
│         v                  v                  v                             │
│  ┌─────────────────────────────────────────────────────────────────┐       │
│  │                        SQLite Databases                          │       │
│  │  uris.db (URLs)                    proxies.db (proxy list)       │       │
│  └─────────────────────────────────────────────────────────────────┘       │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────┐       │
│  │                         Network Layer                            │       │
│  │  rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server      │       │
│  └─────────────────────────────────────────────────────────────────┘       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Constraints

Python 2.7 compatibility required
Minimal external dependencies (avoid adding new modules)
Current dependencies: beautifulsoup4
Optional: IP2Location (for proxy geolocation)

Phase 1: Stability & Code Quality

Objective: Establish a solid, maintainable codebase

1.1 Error Handling Improvements

Task	Description	File(s)
Add connection retry logic	Implement exponential backoff for failed connections	rocksock.py, fetch.py
Graceful database errors	Handle SQLite lock/busy errors with retry	mysqlite.py
Timeout standardization	Consistent timeout handling across all network ops	proxywatchd.py, fetch.py
Exception logging	Log exceptions with context, not just silently catch	all files

1.2 Code Consolidation

Task	Description	File(s)
Unify _known_proxies	Single source of truth for known proxy cache	ppf.py, fetch.py
Extract proxy utils	Create proxy_utils.py with cleanse/validate functions	new file
Remove global config pattern	Pass config explicitly instead of set_config()	fetch.py
Standardize logging	Consistent _log() usage with levels across all modules	all files

1.3 Testing Infrastructure

Task	Description	File(s)
Add unit tests	Test proxy parsing, URL extraction, IP validation	tests/
Mock network layer	Allow testing without live network/Tor	tests/
Validation test suite	Verify multi-target voting logic	tests/

Phase 2: Performance Optimization

Objective: Improve throughput and resource efficiency

2.1 Connection Pooling

Task	Description	File(s)
Tor connection reuse	Pool Tor SOCKS connections instead of reconnecting	proxywatchd.py
HTTP keep-alive	Reuse connections to same target servers	http2.py
Connection warm-up	Pre-establish connections before job assignment	proxywatchd.py

2.2 Database Optimization

Task	Description	File(s)
Batch inserts	Group INSERT operations (already partial)	dbs.py
Index optimization	Add indexes for frequent query patterns	dbs.py
WAL mode	Enable Write-Ahead Logging for better concurrency	mysqlite.py
Prepared statements	Cache compiled SQL statements	mysqlite.py

2.3 Threading Improvements

Task	Description	File(s)
Dynamic thread scaling	Adjust thread count based on success rate	proxywatchd.py
Priority queue	Test high-value proxies (low fail count) first	proxywatchd.py
Stale proxy cleanup	Background thread to remove long-dead proxies	proxywatchd.py

Phase 3: Reliability & Accuracy

Objective: Improve proxy validation accuracy and system reliability

3.1 Enhanced Validation

Task	Description	File(s)
Latency tracking	Store and use connection latency metrics	proxywatchd.py, dbs.py
Geographic validation	Verify proxy actually routes through claimed location	proxywatchd.py
Protocol fingerprinting	Better SOCKS4/SOCKS5/HTTP detection	rocksock.py
HTTPS/SSL testing	Validate SSL proxy capabilities	proxywatchd.py

3.2 Target Management

Task	Description	File(s)
Dynamic target pool	Auto-discover and rotate validation targets	proxywatchd.py
Target health tracking	Remove unresponsive targets from pool	proxywatchd.py
Geographic target spread	Ensure targets span multiple regions	config.py

3.3 Failure Analysis

Task	Description	File(s)
Failure categorization	Distinguish timeout vs refused vs auth-fail	proxywatchd.py
Retry strategies	Different retry logic per failure type	proxywatchd.py
Dead proxy quarantine	Separate storage for likely-dead proxies	dbs.py

Phase 4: Features & Usability

Objective: Add useful features while maintaining simplicity

4.1 Reporting & Monitoring

Task	Description	File(s)
Statistics collection	Track success rates, throughput, latency	proxywatchd.py
Periodic status output	Log summary stats every N minutes	ppf.py, proxywatchd.py
Export functionality	Export working proxies to file (txt, json)	new: export.py

4.2 Configuration

Task	Description	File(s)
Config validation	Validate config.ini on startup	config.py
Runtime reconfiguration	Reload config without restart (SIGHUP)	proxywatchd.py
Sensible defaults	Document and improve default values	config.py

4.3 Proxy Source Expansion

Task	Description	File(s)
Additional scrapers	Support more search engines beyond Searx	scraper.py
API sources	Integrate free proxy API endpoints	new: api_sources.py
Import formats	Support various proxy list formats	ppf.py

Implementation Priority

┌─────────────────────────────────────────────────────────────────────────────┐
│ Priority Matrix                                                             │
├──────────────────────────┬──────────────────────────────────────────────────┤
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT                        │
│                          │                                                  │
│ ● Unify _known_proxies   │ ● Connection pooling                             │
│ ● Graceful DB errors     │ ● Dynamic thread scaling                         │
│ ● Batch inserts          │ ● Unit test infrastructure                       │
│ ● WAL mode for SQLite    │ ● Latency tracking                               │
│                          │                                                  │
├──────────────────────────┼──────────────────────────────────────────────────┤
│ LOW IMPACT / LOW EFFORT  │ LOW IMPACT / HIGH EFFORT                         │
│                          │                                                  │
│ ● Standardize logging    │ ● Geographic validation                          │
│ ● Config validation      │ ● Additional scrapers                            │
│ ● Export functionality   │ ● API sources                                    │
│ ● Status output          │ ● Protocol fingerprinting                        │
│                          │                                                  │
└──────────────────────────┴──────────────────────────────────────────────────┘

Completed Work

Multi-Target Validation (Done)

Work-stealing queue with shared Queue.Queue()
Multi-target validation (2/3 majority voting)
Interleaved testing (jobs shuffled across proxies)
ProxyTestState and TargetTestJob classes

Code Cleanup (Done)

Removed dead HTTP server code from ppf.py
Removed dead gumbo code from soup_parser.py
Removed test code from comboparse.py
Removed unused functions from misc.py
Fixed IP/port cleansing in ppf.py extract_proxies()
Updated .gitignore, removed .pyc files

Database Optimization (Done)

Enable SQLite WAL mode for better concurrency
Add indexes for common query patterns (failed, tested, proto, error, check_time)
Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE)

Dependency Reduction (Done)

Make lxml optional (removed from requirements)
Make IP2Location optional (graceful fallback)
Add --nobs flag for stdlib HTMLParser fallback (bs4 optional)

Technical Debt

Item	Description	Risk
Dual _known_proxies	ppf.py and fetch.py maintain separate caches	Medium - duplicates possible
Global config in fetch.py	set_config() pattern is fragile	Low - works but not clean
No input validation	Proxy strings parsed without validation	Medium - could crash on bad data
Silent exception catching	Some except: pass patterns hide errors	High - hard to debug
Hardcoded timeouts	Various timeout values scattered in code	Low - works but not configurable

File Reference

File	Purpose	Status
ppf.py	Main URL harvester daemon	Active, cleaned
proxywatchd.py	Proxy validation daemon	Active, enhanced
scraper.py	Searx search integration	Active, cleaned
fetch.py	HTTP fetching with proxy support	Active
dbs.py	Database schema and inserts	Active
mysqlite.py	SQLite wrapper	Active
rocksock.py	Socket/proxy abstraction (3rd party)	Stable
http2.py	HTTP client implementation	Stable
config.py	Configuration management	Active
comboparse.py	Config/arg parser framework	Stable, cleaned
soup_parser.py	BeautifulSoup wrapper	Stable, cleaned
misc.py	Utilities (timestamp, logging)	Stable, cleaned

13 KiB Raw Blame History