Files
ppf/ROADMAP.md
Username eb1bba0e13 docs: update roadmap and task tracking
- README: update feature list
- ROADMAP: add completed features, update priorities
- TODO: mark completed tasks, add new items
- config.ini.sample: update example values
- http2: minor cleanup
2025-12-23 17:24:25 +01:00

16 KiB

PPF Project Roadmap

Project Purpose

PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:

  1. Discover proxy addresses by crawling websites and search engines
  2. Validate proxies through multi-target testing via Tor
  3. Maintain a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              PPF Architecture                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                     │
│  │ scraper.py  │    │   ppf.py    │    │proxywatchd  │                     │
│  │             │    │             │    │             │                     │
│  │ Searx query │───>│ URL harvest │───>│ Proxy test  │                     │
│  │ URL finding │    │ Proxy extract│   │ Validation  │                     │
│  └─────────────┘    └─────────────┘    └─────────────┘                     │
│         │                  │                  │                             │
│         v                  v                  v                             │
│  ┌─────────────────────────────────────────────────────────────────┐       │
│  │                        SQLite Databases                          │       │
│  │  uris.db (URLs)                    proxies.db (proxy list)       │       │
│  └─────────────────────────────────────────────────────────────────┘       │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────┐       │
│  │                         Network Layer                            │       │
│  │  rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server      │       │
│  └─────────────────────────────────────────────────────────────────┘       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Constraints

  • Python 2.7 compatibility required
  • Minimal external dependencies (avoid adding new modules)
  • Current dependencies: beautifulsoup4, pyasn, IP2Location
  • Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)

Phase 1: Stability & Code Quality

Objective: Establish a solid, maintainable codebase

1.1 Error Handling Improvements

Task Description File(s)
Add connection retry logic Implement exponential backoff for failed connections rocksock.py, fetch.py
Graceful database errors Handle SQLite lock/busy errors with retry mysqlite.py
Timeout standardization Consistent timeout handling across all network ops proxywatchd.py, fetch.py
Exception logging Log exceptions with context, not just silently catch all files

1.2 Code Consolidation

Task Description File(s)
Unify _known_proxies Single source of truth for known proxy cache ppf.py, fetch.py
Extract proxy utils Create proxy_utils.py with cleanse/validate functions new file
Remove global config pattern Pass config explicitly instead of set_config() fetch.py
Standardize logging Consistent _log() usage with levels across all modules all files

1.3 Testing Infrastructure

Task Description File(s)
Add unit tests Test proxy parsing, URL extraction, IP validation tests/
Mock network layer Allow testing without live network/Tor tests/
Validation test suite Verify multi-target voting logic tests/

Phase 2: Performance Optimization

Objective: Improve throughput and resource efficiency

2.1 Connection Pooling

Task Description File(s)
Tor connection reuse Pool Tor SOCKS connections instead of reconnecting proxywatchd.py
HTTP keep-alive Reuse connections to same target servers http2.py
Connection warm-up Pre-establish connections before job assignment proxywatchd.py

2.2 Database Optimization

Task Description File(s)
Batch inserts Group INSERT operations (already partial) dbs.py
Index optimization Add indexes for frequent query patterns dbs.py
WAL mode Enable Write-Ahead Logging for better concurrency mysqlite.py
Prepared statements Cache compiled SQL statements mysqlite.py

2.3 Threading Improvements

Task Description File(s)
Dynamic thread scaling Adjust thread count based on success rate proxywatchd.py
Priority queue Test high-value proxies (low fail count) first proxywatchd.py
Stale proxy cleanup Background thread to remove long-dead proxies proxywatchd.py

Phase 3: Reliability & Accuracy

Objective: Improve proxy validation accuracy and system reliability

3.1 Enhanced Validation

Task Description File(s)
Latency tracking Store and use connection latency metrics proxywatchd.py, dbs.py
Geographic validation Verify proxy actually routes through claimed location proxywatchd.py
Protocol fingerprinting Better SOCKS4/SOCKS5/HTTP detection rocksock.py
HTTPS/SSL testing Validate SSL proxy capabilities proxywatchd.py

3.2 Target Management

Task Description File(s)
Dynamic target pool Auto-discover and rotate validation targets proxywatchd.py
Target health tracking Remove unresponsive targets from pool proxywatchd.py
Geographic target spread Ensure targets span multiple regions config.py

3.3 Failure Analysis

Task Description File(s)
Failure categorization Distinguish timeout vs refused vs auth-fail proxywatchd.py
Retry strategies Different retry logic per failure type proxywatchd.py
Dead proxy quarantine Separate storage for likely-dead proxies dbs.py

Phase 4: Features & Usability

Objective: Add useful features while maintaining simplicity

4.1 Reporting & Monitoring

Task Description File(s)
Statistics collection Track success rates, throughput, latency proxywatchd.py
Periodic status output Log summary stats every N minutes ppf.py, proxywatchd.py
Export functionality Export working proxies to file (txt, json) new: export.py

4.2 Configuration

Task Description File(s)
Config validation Validate config.ini on startup config.py
Runtime reconfiguration Reload config without restart (SIGHUP) proxywatchd.py
Sensible defaults Document and improve default values config.py

4.3 Proxy Source Expansion

Task Description File(s)
Additional scrapers Support more search engines beyond Searx scraper.py
API sources Integrate free proxy API endpoints new: api_sources.py
Import formats Support various proxy list formats ppf.py

Implementation Priority

┌─────────────────────────────────────────────────────────────────────────────┐
│ Priority Matrix                                                             │
├──────────────────────────┬──────────────────────────────────────────────────┤
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT                        │
│                          │                                                  │
│ [x] Unify _known_proxies │ [x] Connection pooling                           │
│ [x] Graceful DB errors   │ [x] Dynamic thread scaling                       │
│ [x] Batch inserts        │ [ ] Unit test infrastructure                     │
│ [x] WAL mode for SQLite  │ [x] Latency tracking                             │
│                          │                                                  │
├──────────────────────────┼──────────────────────────────────────────────────┤
│ LOW IMPACT / LOW EFFORT  │ LOW IMPACT / HIGH EFFORT                         │
│                          │                                                  │
│ [x] Standardize logging  │ [x] Geographic validation                        │
│ [x] Config validation    │ [x] Additional scrapers                          │
│ [x] Export functionality │ [ ] API sources                                  │
│ [x] Status output        │ [ ] Protocol fingerprinting                      │
│                          │                                                  │
└──────────────────────────┴──────────────────────────────────────────────────┘

Completed Work

Multi-Target Validation (Done)

  • Work-stealing queue with shared Queue.Queue()
  • Multi-target validation (2/3 majority voting)
  • Interleaved testing (jobs shuffled across proxies)
  • ProxyTestState and TargetTestJob classes

Code Cleanup (Done)

  • Removed dead HTTP server code from ppf.py
  • Removed dead gumbo code from soup_parser.py
  • Removed test code from comboparse.py
  • Removed unused functions from misc.py
  • Fixed IP/port cleansing in ppf.py extract_proxies()
  • Updated .gitignore, removed .pyc files

Database Optimization (Done)

  • Enable SQLite WAL mode for better concurrency
  • Add indexes for common query patterns (failed, tested, proto, error, check_time)
  • Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE)

Dependency Reduction (Done)

  • Make lxml optional (removed from requirements)
  • Make IP2Location optional (graceful fallback)
  • Add --nobs flag for stdlib HTMLParser fallback (bs4 optional)

Rate Limiting & Stability (Done)

  • InstanceTracker class in scraper.py with exponential backoff
  • Configurable backoff_base, backoff_max, fail_threshold
  • Exception logging with context (replaced bare except blocks)
  • Unified _known_proxies cache in fetch.py

Monitoring & Maintenance (Done)

  • Stats class in proxywatchd.py (tested/passed/failed tracking)
  • Periodic stats reporting (configurable stats_interval)
  • Stale proxy cleanup (cleanup_stale() with configurable stale_days)
  • Timeout config options (timeout_connect, timeout_read)

Connection Pooling (Done)

  • TorHostState class tracking per-host health and latency
  • TorConnectionPool with worker affinity for circuit reuse
  • Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures
  • Pool warmup and health status reporting

Priority Queue (Done)

  • PriorityJobQueue class with heap-based ordering
  • calculate_priority() assigns priority 0-4 by proxy state
  • New proxies tested first, high-fail proxies last

Dynamic Thread Scaling (Done)

  • ThreadScaler class adjusts thread count dynamically
  • Scales up when queue deep and success rate acceptable
  • Scales down when queue shallow or success rate drops
  • Respects min/max bounds with cooldown period

Latency Tracking (Done)

  • avg_latency, latency_samples columns in proxylist
  • Exponential moving average calculation
  • Migration function for existing databases
  • Latency recorded for successful proxy tests

Container Support (Done)

  • Dockerfile with Python 2.7-slim base
  • docker-compose.yml for local development
  • Rootless podman deployment documentation
  • Volume mounts for persistent data

Code Style (Done)

  • Normalized indentation (4-space, no tabs)
  • Removed dead code and unused imports
  • Added docstrings to classes and functions
  • Python 2/3 compatible imports (Queue/queue)

Geographic Validation (Done)

  • IP2Location integration for country lookup
  • pyasn integration for ASN lookup
  • Graceful fallback when database files missing
  • Country codes displayed in test output: (US), (IN), etc.
  • Data files: IP2LOCATION-LITE-DB1.BIN, ipasn.dat

SSL Proxy Testing (Done)

  • Default checktype changed to 'ssl'
  • ssl_targets list with major HTTPS sites
  • TLS handshake validation with certificate verification
  • Detects MITM proxies that intercept SSL connections

Export Functionality (Done)

  • export.py CLI tool for exporting working proxies
  • Multiple formats: txt, json, csv, len (length-prefixed)
  • Filters: proto, country, anonymity, max_latency
  • Sort options: latency, added, tested, success
  • Output to stdout or file

Web Dashboard (Done)

  • /dashboard endpoint with dark theme HTML UI
  • /api/stats endpoint for JSON runtime statistics
  • Auto-refresh with JavaScript fetch every 5 seconds
  • Stats provider callback from proxywatchd.py to httpd.py
  • Displays: tested/passed/success rate, thread count, uptime
  • Tor pool health: per-host latency, success rate, availability
  • Failure categories breakdown: timeout, proxy, ssl, closed

Technical Debt

Item Description Risk
Dual _known_proxies ppf.py and fetch.py maintain separate caches Resolved
Global config in fetch.py set_config() pattern is fragile Low - works but not clean
No input validation Proxy strings parsed without validation Resolved
Silent exception catching Some except: pass patterns hide errors Resolved
Hardcoded timeouts Various timeout values scattered in code Resolved

File Reference

File Purpose Status
ppf.py Main URL harvester daemon Active, cleaned
proxywatchd.py Proxy validation daemon Active, enhanced
scraper.py Searx search integration Active, cleaned
fetch.py HTTP fetching with proxy support Active
dbs.py Database schema and inserts Active
mysqlite.py SQLite wrapper Active
rocksock.py Socket/proxy abstraction (3rd party) Stable
http2.py HTTP client implementation Stable
config.py Configuration management Active
comboparse.py Config/arg parser framework Stable, cleaned
soup_parser.py BeautifulSoup wrapper Stable, cleaned
misc.py Utilities (timestamp, logging) Stable, cleaned
export.py Proxy export CLI tool Active