Files
ppf/ROADMAP.md
2026-01-08 09:05:39 +01:00

20 KiB

PPF Project Roadmap

Project Purpose

PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:

  1. Discover proxy addresses by crawling websites and search engines
  2. Validate proxies through multi-target testing via Tor
  3. Maintain a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              PPF Architecture                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                     │
│  │ scraper.py  │    │   ppf.py    │    │proxywatchd  │                     │
│  │             │    │             │    │             │                     │
│  │ Searx query │───>│ URL harvest │───>│ Proxy test  │                     │
│  │ URL finding │    │ Proxy extract│   │ Validation  │                     │
│  └─────────────┘    └─────────────┘    └─────────────┘                     │
│         │                  │                  │                             │
│         v                  v                  v                             │
│  ┌─────────────────────────────────────────────────────────────────┐       │
│  │                        SQLite Databases                          │       │
│  │  uris.db (URLs)                    proxies.db (proxy list)       │       │
│  └─────────────────────────────────────────────────────────────────┘       │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────┐       │
│  │                         Network Layer                            │       │
│  │  rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server      │       │
│  └─────────────────────────────────────────────────────────────────┘       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Constraints

  • Python 2.7 compatibility required
  • Minimal external dependencies (avoid adding new modules)
  • Current dependencies: beautifulsoup4, pyasn, IP2Location
  • Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)

Phase 1: Stability & Code Quality

Objective: Establish a solid, maintainable codebase

1.1 Error Handling Improvements

Task Description File(s)
Add connection retry logic Implement exponential backoff for failed connections rocksock.py, fetch.py
Graceful database errors Handle SQLite lock/busy errors with retry mysqlite.py
Timeout standardization Consistent timeout handling across all network ops proxywatchd.py, fetch.py
Exception logging Log exceptions with context, not just silently catch all files

1.2 Code Consolidation

Task Description File(s)
Unify _known_proxies Single source of truth for known proxy cache ppf.py, fetch.py
Extract proxy utils Create proxy_utils.py with cleanse/validate functions new file
Remove global config pattern Pass config explicitly instead of set_config() fetch.py
Standardize logging Consistent _log() usage with levels across all modules all files

1.3 Testing Infrastructure

Task Description File(s)
Add unit tests Test proxy parsing, URL extraction, IP validation tests/
Mock network layer Allow testing without live network/Tor tests/
Validation test suite Verify multi-target voting logic tests/

Phase 2: Performance Optimization

Objective: Improve throughput and resource efficiency

2.1 Connection Pooling

Task Description File(s)
Tor connection reuse Pool Tor SOCKS connections instead of reconnecting proxywatchd.py
HTTP keep-alive Reuse connections to same target servers http2.py
Connection warm-up Pre-establish connections before job assignment proxywatchd.py

2.2 Database Optimization

Task Description File(s)
Batch inserts Group INSERT operations (already partial) dbs.py
Index optimization Add indexes for frequent query patterns dbs.py
WAL mode Enable Write-Ahead Logging for better concurrency mysqlite.py
Prepared statements Cache compiled SQL statements mysqlite.py

2.3 Threading Improvements

Task Description File(s)
Dynamic thread scaling Adjust thread count based on success rate proxywatchd.py
Priority queue Test high-value proxies (low fail count) first proxywatchd.py
Stale proxy cleanup Background thread to remove long-dead proxies proxywatchd.py

Phase 3: Reliability & Accuracy

Objective: Improve proxy validation accuracy and system reliability

3.1 Enhanced Validation

Task Description File(s)
Latency tracking Store and use connection latency metrics proxywatchd.py, dbs.py
Geographic validation Verify proxy actually routes through claimed location proxywatchd.py
Protocol fingerprinting Better SOCKS4/SOCKS5/HTTP detection rocksock.py
HTTPS/SSL testing Validate SSL proxy capabilities proxywatchd.py

3.2 Target Management

Task Description File(s)
Dynamic target pool Auto-discover and rotate validation targets proxywatchd.py
Target health tracking Remove unresponsive targets from pool proxywatchd.py
Geographic target spread Ensure targets span multiple regions config.py

3.3 Failure Analysis

Task Description File(s)
Failure categorization Distinguish timeout vs refused vs auth-fail proxywatchd.py
Retry strategies Different retry logic per failure type proxywatchd.py
Dead proxy quarantine Separate storage for likely-dead proxies dbs.py

Phase 4: Features & Usability

Objective: Add useful features while maintaining simplicity

4.1 Reporting & Monitoring

Task Description File(s)
Statistics collection Track success rates, throughput, latency proxywatchd.py
Periodic status output Log summary stats every N minutes ppf.py, proxywatchd.py
Export functionality Export working proxies to file (txt, json) new: export.py

4.2 Configuration

Task Description File(s)
Config validation Validate config.ini on startup config.py
Runtime reconfiguration Reload config without restart (SIGHUP) proxywatchd.py
Sensible defaults Document and improve default values config.py

4.3 Proxy Source Expansion

Task Description File(s)
Additional scrapers Support more search engines beyond Searx scraper.py
API sources Integrate free proxy API endpoints new: api_sources.py
Import formats Support various proxy list formats ppf.py

Implementation Priority

┌─────────────────────────────────────────────────────────────────────────────┐
│ Priority Matrix                                                             │
├──────────────────────────┬──────────────────────────────────────────────────┤
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT                        │
│                          │                                                  │
│ [x] Unify _known_proxies │ [x] Connection pooling                           │
│ [x] Graceful DB errors   │ [x] Dynamic thread scaling                       │
│ [x] Batch inserts        │ [x] Unit test infrastructure                     │
│ [x] WAL mode for SQLite  │ [x] Latency tracking                             │
│                          │                                                  │
├──────────────────────────┼──────────────────────────────────────────────────┤
│ LOW IMPACT / LOW EFFORT  │ LOW IMPACT / HIGH EFFORT                         │
│                          │                                                  │
│ [x] Standardize logging  │ [x] Geographic validation                        │
│ [x] Config validation    │ [x] Additional scrapers (Bing, Yahoo, Mojeek)    │
│ [x] Export functionality │ [ ] API sources                                  │
│ [x] Status output        │ [ ] Protocol fingerprinting                      │
│                          │                                                  │
└──────────────────────────┴──────────────────────────────────────────────────┘

Completed Work

Multi-Target Validation (Done)

  • Work-stealing queue with shared Queue.Queue()
  • Multi-target validation (2/3 majority voting)
  • Interleaved testing (jobs shuffled across proxies)
  • ProxyTestState and TargetTestJob classes

Code Cleanup (Done)

  • Removed dead HTTP server code from ppf.py
  • Removed dead gumbo code from soup_parser.py
  • Removed test code from comboparse.py
  • Removed unused functions from misc.py
  • Fixed IP/port cleansing in ppf.py extract_proxies()
  • Updated .gitignore, removed .pyc files

Database Optimization (Done)

  • Enable SQLite WAL mode for better concurrency
  • Add indexes for common query patterns (failed, tested, proto, error, check_time)
  • Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE)

Dependency Reduction (Done)

  • Make lxml optional (removed from requirements)
  • Make IP2Location optional (graceful fallback)
  • Add --nobs flag for stdlib HTMLParser fallback (bs4 optional)

Rate Limiting & Stability (Done)

  • InstanceTracker class in scraper.py with exponential backoff
  • Configurable backoff_base, backoff_max, fail_threshold
  • Exception logging with context (replaced bare except blocks)
  • Unified _known_proxies cache in fetch.py

Monitoring & Maintenance (Done)

  • Stats class in proxywatchd.py (tested/passed/failed tracking)
  • Periodic stats reporting (configurable stats_interval)
  • Stale proxy cleanup (cleanup_stale() with configurable stale_days)
  • Timeout config options (timeout_connect, timeout_read)

Connection Pooling (Done)

  • TorHostState class tracking per-host health and latency
  • TorConnectionPool with worker affinity for circuit reuse
  • Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures
  • Pool warmup and health status reporting

Priority Queue (Done)

  • PriorityJobQueue class with heap-based ordering
  • calculate_priority() assigns priority 0-4 by proxy state
  • New proxies tested first, high-fail proxies last

Dynamic Thread Scaling (Done)

  • ThreadScaler class adjusts thread count dynamically
  • Scales up when queue deep and success rate acceptable
  • Scales down when queue shallow or success rate drops
  • Respects min/max bounds with cooldown period

Latency Tracking (Done)

  • avg_latency, latency_samples columns in proxylist
  • Exponential moving average calculation
  • Migration function for existing databases
  • Latency recorded for successful proxy tests

Container Support (Done)

  • Dockerfile with Python 2.7-slim base
  • docker-compose.yml for local development
  • Rootless podman deployment documentation
  • Volume mounts for persistent data

Code Style (Done)

  • Normalized indentation (4-space, no tabs)
  • Removed dead code and unused imports
  • Added docstrings to classes and functions
  • Python 2/3 compatible imports (Queue/queue)

Geographic Validation (Done)

  • IP2Location integration for country lookup
  • pyasn integration for ASN lookup
  • Graceful fallback when database files missing
  • Country codes displayed in test output: (US), (IN), etc.
  • Data files: IP2LOCATION-LITE-DB1.BIN, ipasn.dat

SSL Proxy Testing (Done)

  • Default checktype changed to 'ssl'
  • ssl_targets list with major HTTPS sites
  • TLS handshake validation with certificate verification
  • Detects MITM proxies that intercept SSL connections

Export Functionality (Done)

  • export.py CLI tool for exporting working proxies
  • Multiple formats: txt, json, csv, len (length-prefixed)
  • Filters: proto, country, anonymity, max_latency
  • Sort options: latency, added, tested, success
  • Output to stdout or file

Web Dashboard (Done)

  • /dashboard endpoint with dark theme HTML UI
  • /api/stats endpoint for JSON runtime statistics
  • Auto-refresh with JavaScript fetch every 3 seconds
  • Stats provider callback from proxywatchd.py to httpd.py
  • Displays: tested/passed/success rate, thread count, uptime
  • Tor pool health: per-host latency, success rate, availability
  • Failure categories breakdown: timeout, proxy, ssl, closed

Dashboard Enhancements v2 (Done)

  • Prominent check type badge in header (SSL/JUDGES/HTTP/IRC)
  • System monitor bar: load, memory, disk, process RSS
  • Anonymity breakdown: elite/anonymous/transparent counts
  • Database health: size, tested/hour, added/day, dead count
  • Enhanced Tor pool stats: requests, success rate, healthy nodes, latency
  • SQLite ANALYZE/VACUUM functions for query optimization
  • Lightweight design: client-side polling, minimal DOM updates

Dashboard Enhancements v3 (Done)

  • Electric cyan theme with translucent glass-morphism effects
  • Unified wrapper styling (.chart-wrap, .histo-wrap, .stats-wrap, .lb-wrap, .pie-wrap)
  • Consistent backdrop-filter blur and electric glow borders
  • Tor Exit Nodes cards with hover effects (.tor-card)
  • Lighter background/tile color scheme (#1e2738 bg, #181f2a card)
  • Map endpoint restyled to match dashboard (electric cyan theme)
  • Map markers updated from gold to cyan for approximate locations

Memory Profiling & Analysis (Done)

  • /api/memory endpoint with comprehensive memory stats
  • objgraph integration for object type counting
  • pympler integration for memory summaries
  • Memory sample history tracking (RSS over time)
  • Process memory from /proc/self/status (VmRSS, VmPeak, VmData, etc.)
  • GC statistics (collections, objects, thresholds)

MITM Detection Optimization (Done)

  • MITM re-test skip optimization - avoid redundant SSL checks for known MITM proxies
  • mitm_retest_skipped stats counter for tracking optimization effectiveness
  • Content hash deduplication for stale proxy list detection
  • stale_count reset when content hash changes

Distributed Workers (Done)

  • Worker registration and heartbeat system
  • /api/workers endpoint for worker status monitoring
  • Tor connectivity check before workers claim work
  • Worker test rate tracking with sliding window calculation
  • Combined rate aggregation across all workers
  • Dashboard worker cards showing per-worker stats

Dashboard Performance (Done)

  • Keyboard shortcuts: r=refresh, 1-9=tabs, t=theme, p=pause
  • Tab-aware chart rendering - skip expensive renders for hidden tabs
  • Visibility API - pause polling when browser tab hidden
  • Dark/muted-dark/light theme cycling
  • Stats export endpoint (/api/stats/export?format=json|csv)

Proxy Validation Cache (Done)

  • LRU cache for is_usable_proxy() using OrderedDict
  • Thread-safe with lock for concurrent access
  • Proper LRU eviction (move_to_end on hits, popitem oldest when full)

Database Context Manager (Done)

  • Refactored all DB operations to use _db_context() context manager
  • Connections guaranteed to close even on exceptions
  • Removed deprecated _prep_db() and _close_db() methods
  • fetch_rows() now accepts db parameter for cleaner dependency injection

Additional Search Engines (Done)

  • Bing and Yahoo engine implementations in scraper.py
  • Engine rotation for rate limit avoidance
  • engines.py module with SearchEngine base class

Worker Health Improvements (Done)

  • Tor connectivity check before workers claim work
  • Fixed interval Tor check (30s) instead of exponential backoff
  • Graceful handling when Tor unavailable

Memory Optimization (Done)

  • __slots__ on ProxyTestState (27 attrs) and TargetTestJob (4 attrs)
  • Reduced per-object memory overhead for hot path objects

Technical Debt

Item Description Risk
Dual _known_proxies ppf.py and fetch.py maintain separate caches Resolved
Global config in fetch.py set_config() pattern is fragile Low - works but not clean
No input validation Proxy strings parsed without validation Resolved
Silent exception catching Some except: pass patterns hide errors Resolved
Hardcoded timeouts Various timeout values scattered in code Resolved

File Reference

File Purpose Status
ppf.py Main URL harvester daemon Active, cleaned
proxywatchd.py Proxy validation daemon Active, enhanced
scraper.py Searx search integration Active, cleaned
fetch.py HTTP fetching with proxy support Active, LRU cache
dbs.py Database schema and inserts Active
mysqlite.py SQLite wrapper Active
rocksock.py Socket/proxy abstraction (3rd party) Stable
http2.py HTTP client implementation Stable
httpd.py Web dashboard and REST API server Active, enhanced
config.py Configuration management Active
comboparse.py Config/arg parser framework Stable, cleaned
soup_parser.py BeautifulSoup wrapper Stable, cleaned
misc.py Utilities (timestamp, logging) Stable, cleaned
export.py Proxy export CLI tool Active
engines.py Search engine implementations Active
connection_pool.py Tor connection pooling Active
network_stats.py Network statistics tracking Active
dns.py DNS resolution with caching Active
mitm.py MITM certificate detection Active
job.py Priority job queue Active
static/dashboard.js Dashboard frontend logic Active, enhanced
static/dashboard.html Dashboard HTML template Active