Files
ppf/ROADMAP.md
2025-12-20 18:25:55 +01:00

13 KiB

PPF Project Roadmap

Project Purpose

PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:

  1. Discover proxy addresses by crawling websites and search engines
  2. Validate proxies through multi-target testing via Tor
  3. Maintain a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              PPF Architecture                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                     │
│  │ scraper.py  │    │   ppf.py    │    │proxywatchd  │                     │
│  │             │    │             │    │             │                     │
│  │ Searx query │───>│ URL harvest │───>│ Proxy test  │                     │
│  │ URL finding │    │ Proxy extract│   │ Validation  │                     │
│  └─────────────┘    └─────────────┘    └─────────────┘                     │
│         │                  │                  │                             │
│         v                  v                  v                             │
│  ┌─────────────────────────────────────────────────────────────────┐       │
│  │                        SQLite Databases                          │       │
│  │  uris.db (URLs)                    proxies.db (proxy list)       │       │
│  └─────────────────────────────────────────────────────────────────┘       │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────┐       │
│  │                         Network Layer                            │       │
│  │  rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server      │       │
│  └─────────────────────────────────────────────────────────────────┘       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Constraints

  • Python 2.7 compatibility required
  • Minimal external dependencies (avoid adding new modules)
  • Current dependencies: beautifulsoup4
  • Optional: IP2Location (for proxy geolocation)

Phase 1: Stability & Code Quality

Objective: Establish a solid, maintainable codebase

1.1 Error Handling Improvements

Task Description File(s)
Add connection retry logic Implement exponential backoff for failed connections rocksock.py, fetch.py
Graceful database errors Handle SQLite lock/busy errors with retry mysqlite.py
Timeout standardization Consistent timeout handling across all network ops proxywatchd.py, fetch.py
Exception logging Log exceptions with context, not just silently catch all files

1.2 Code Consolidation

Task Description File(s)
Unify _known_proxies Single source of truth for known proxy cache ppf.py, fetch.py
Extract proxy utils Create proxy_utils.py with cleanse/validate functions new file
Remove global config pattern Pass config explicitly instead of set_config() fetch.py
Standardize logging Consistent _log() usage with levels across all modules all files

1.3 Testing Infrastructure

Task Description File(s)
Add unit tests Test proxy parsing, URL extraction, IP validation tests/
Mock network layer Allow testing without live network/Tor tests/
Validation test suite Verify multi-target voting logic tests/

Phase 2: Performance Optimization

Objective: Improve throughput and resource efficiency

2.1 Connection Pooling

Task Description File(s)
Tor connection reuse Pool Tor SOCKS connections instead of reconnecting proxywatchd.py
HTTP keep-alive Reuse connections to same target servers http2.py
Connection warm-up Pre-establish connections before job assignment proxywatchd.py

2.2 Database Optimization

Task Description File(s)
Batch inserts Group INSERT operations (already partial) dbs.py
Index optimization Add indexes for frequent query patterns dbs.py
WAL mode Enable Write-Ahead Logging for better concurrency mysqlite.py
Prepared statements Cache compiled SQL statements mysqlite.py

2.3 Threading Improvements

Task Description File(s)
Dynamic thread scaling Adjust thread count based on success rate proxywatchd.py
Priority queue Test high-value proxies (low fail count) first proxywatchd.py
Stale proxy cleanup Background thread to remove long-dead proxies proxywatchd.py

Phase 3: Reliability & Accuracy

Objective: Improve proxy validation accuracy and system reliability

3.1 Enhanced Validation

Task Description File(s)
Latency tracking Store and use connection latency metrics proxywatchd.py, dbs.py
Geographic validation Verify proxy actually routes through claimed location proxywatchd.py
Protocol fingerprinting Better SOCKS4/SOCKS5/HTTP detection rocksock.py
HTTPS/SSL testing Validate SSL proxy capabilities proxywatchd.py

3.2 Target Management

Task Description File(s)
Dynamic target pool Auto-discover and rotate validation targets proxywatchd.py
Target health tracking Remove unresponsive targets from pool proxywatchd.py
Geographic target spread Ensure targets span multiple regions config.py

3.3 Failure Analysis

Task Description File(s)
Failure categorization Distinguish timeout vs refused vs auth-fail proxywatchd.py
Retry strategies Different retry logic per failure type proxywatchd.py
Dead proxy quarantine Separate storage for likely-dead proxies dbs.py

Phase 4: Features & Usability

Objective: Add useful features while maintaining simplicity

4.1 Reporting & Monitoring

Task Description File(s)
Statistics collection Track success rates, throughput, latency proxywatchd.py
Periodic status output Log summary stats every N minutes ppf.py, proxywatchd.py
Export functionality Export working proxies to file (txt, json) new: export.py

4.2 Configuration

Task Description File(s)
Config validation Validate config.ini on startup config.py
Runtime reconfiguration Reload config without restart (SIGHUP) proxywatchd.py
Sensible defaults Document and improve default values config.py

4.3 Proxy Source Expansion

Task Description File(s)
Additional scrapers Support more search engines beyond Searx scraper.py
API sources Integrate free proxy API endpoints new: api_sources.py
Import formats Support various proxy list formats ppf.py

Implementation Priority

┌─────────────────────────────────────────────────────────────────────────────┐
│ Priority Matrix                                                             │
├──────────────────────────┬──────────────────────────────────────────────────┤
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT                        │
│                          │                                                  │
│ ● Unify _known_proxies   │ ● Connection pooling                             │
│ ● Graceful DB errors     │ ● Dynamic thread scaling                         │
│ ● Batch inserts          │ ● Unit test infrastructure                       │
│ ● WAL mode for SQLite    │ ● Latency tracking                               │
│                          │                                                  │
├──────────────────────────┼──────────────────────────────────────────────────┤
│ LOW IMPACT / LOW EFFORT  │ LOW IMPACT / HIGH EFFORT                         │
│                          │                                                  │
│ ● Standardize logging    │ ● Geographic validation                          │
│ ● Config validation      │ ● Additional scrapers                            │
│ ● Export functionality   │ ● API sources                                    │
│ ● Status output          │ ● Protocol fingerprinting                        │
│                          │                                                  │
└──────────────────────────┴──────────────────────────────────────────────────┘

Completed Work

Multi-Target Validation (Done)

  • Work-stealing queue with shared Queue.Queue()
  • Multi-target validation (2/3 majority voting)
  • Interleaved testing (jobs shuffled across proxies)
  • ProxyTestState and TargetTestJob classes

Code Cleanup (Done)

  • Removed dead HTTP server code from ppf.py
  • Removed dead gumbo code from soup_parser.py
  • Removed test code from comboparse.py
  • Removed unused functions from misc.py
  • Fixed IP/port cleansing in ppf.py extract_proxies()
  • Updated .gitignore, removed .pyc files

Database Optimization (Done)

  • Enable SQLite WAL mode for better concurrency
  • Add indexes for common query patterns (failed, tested, proto, error, check_time)
  • Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE)

Dependency Reduction (Done)

  • Make lxml optional (removed from requirements)
  • Make IP2Location optional (graceful fallback)
  • Add --nobs flag for stdlib HTMLParser fallback (bs4 optional)

Technical Debt

Item Description Risk
Dual _known_proxies ppf.py and fetch.py maintain separate caches Medium - duplicates possible
Global config in fetch.py set_config() pattern is fragile Low - works but not clean
No input validation Proxy strings parsed without validation Medium - could crash on bad data
Silent exception catching Some except: pass patterns hide errors High - hard to debug
Hardcoded timeouts Various timeout values scattered in code Low - works but not configurable

File Reference

File Purpose Status
ppf.py Main URL harvester daemon Active, cleaned
proxywatchd.py Proxy validation daemon Active, enhanced
scraper.py Searx search integration Active, cleaned
fetch.py HTTP fetching with proxy support Active
dbs.py Database schema and inserts Active
mysqlite.py SQLite wrapper Active
rocksock.py Socket/proxy abstraction (3rd party) Stable
http2.py HTTP client implementation Stable
config.py Configuration management Active
comboparse.py Config/arg parser framework Stable, cleaned
soup_parser.py BeautifulSoup wrapper Stable, cleaned
misc.py Utilities (timestamp, logging) Stable, cleaned