# PPF Project Roadmap ## Project Purpose PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to: 1. **Discover** proxy addresses by crawling websites and search engines 2. **Validate** proxies through multi-target testing via Tor 3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP) ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ PPF Architecture │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ scraper.py │ │ ppf.py │ │proxywatchd │ │ │ │ │ │ │ │ │ │ │ │ Searx query │───>│ URL harvest │───>│ Proxy test │ │ │ │ URL finding │ │ Proxy extract│ │ Validation │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ v v v │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ SQLite Databases │ │ │ │ uris.db (URLs) proxies.db (proxy list) │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Network Layer │ │ │ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ## Constraints - **Python 2.7** compatibility required - **Minimal external dependencies** (avoid adding new modules) - Current dependencies: beautifulsoup4, pyasn, IP2Location - Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN) --- ## Phase 1: Stability & Code Quality **Objective:** Establish a solid, maintainable codebase ### 1.1 Error Handling Improvements | Task | Description | File(s) | |------|-------------|---------| | Add connection retry logic | Implement exponential backoff for failed connections | rocksock.py, fetch.py | | Graceful database errors | Handle SQLite lock/busy errors with retry | mysqlite.py | | Timeout standardization | Consistent timeout handling across all network ops | proxywatchd.py, fetch.py | | Exception logging | Log exceptions with context, not just silently catch | all files | ### 1.2 Code Consolidation | Task | Description | File(s) | |------|-------------|---------| | Unify _known_proxies | Single source of truth for known proxy cache | ppf.py, fetch.py | | Extract proxy utils | Create proxy_utils.py with cleanse/validate functions | new file | | Remove global config pattern | Pass config explicitly instead of set_config() | fetch.py | | Standardize logging | Consistent _log() usage with levels across all modules | all files | ### 1.3 Testing Infrastructure | Task | Description | File(s) | |------|-------------|---------| | Add unit tests | Test proxy parsing, URL extraction, IP validation | tests/ | | Mock network layer | Allow testing without live network/Tor | tests/ | | Validation test suite | Verify multi-target voting logic | tests/ | --- ## Phase 2: Performance Optimization **Objective:** Improve throughput and resource efficiency ### 2.1 Connection Pooling | Task | Description | File(s) | |------|-------------|---------| | Tor connection reuse | Pool Tor SOCKS connections instead of reconnecting | proxywatchd.py | | HTTP keep-alive | Reuse connections to same target servers | http2.py | | Connection warm-up | Pre-establish connections before job assignment | proxywatchd.py | ### 2.2 Database Optimization | Task | Description | File(s) | |------|-------------|---------| | Batch inserts | Group INSERT operations (already partial) | dbs.py | | Index optimization | Add indexes for frequent query patterns | dbs.py | | WAL mode | Enable Write-Ahead Logging for better concurrency | mysqlite.py | | Prepared statements | Cache compiled SQL statements | mysqlite.py | ### 2.3 Threading Improvements | Task | Description | File(s) | |------|-------------|---------| | Dynamic thread scaling | Adjust thread count based on success rate | proxywatchd.py | | Priority queue | Test high-value proxies (low fail count) first | proxywatchd.py | | Stale proxy cleanup | Background thread to remove long-dead proxies | proxywatchd.py | --- ## Phase 3: Reliability & Accuracy **Objective:** Improve proxy validation accuracy and system reliability ### 3.1 Enhanced Validation | Task | Description | File(s) | |------|-------------|---------| | Latency tracking | Store and use connection latency metrics | proxywatchd.py, dbs.py | | Geographic validation | Verify proxy actually routes through claimed location | proxywatchd.py | | Protocol fingerprinting | Better SOCKS4/SOCKS5/HTTP detection | rocksock.py | | HTTPS/SSL testing | Validate SSL proxy capabilities | proxywatchd.py | ### 3.2 Target Management | Task | Description | File(s) | |------|-------------|---------| | Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py | | Target health tracking | Remove unresponsive targets from pool | proxywatchd.py | | Geographic target spread | Ensure targets span multiple regions | config.py | ### 3.3 Failure Analysis | Task | Description | File(s) | |------|-------------|---------| | Failure categorization | Distinguish timeout vs refused vs auth-fail | proxywatchd.py | | Retry strategies | Different retry logic per failure type | proxywatchd.py | | Dead proxy quarantine | Separate storage for likely-dead proxies | dbs.py | --- ## Phase 4: Features & Usability **Objective:** Add useful features while maintaining simplicity ### 4.1 Reporting & Monitoring | Task | Description | File(s) | |------|-------------|---------| | Statistics collection | Track success rates, throughput, latency | proxywatchd.py | | Periodic status output | Log summary stats every N minutes | ppf.py, proxywatchd.py | | Export functionality | Export working proxies to file (txt, json) | new: export.py | ### 4.2 Configuration | Task | Description | File(s) | |------|-------------|---------| | Config validation | Validate config.ini on startup | config.py | | Runtime reconfiguration | Reload config without restart (SIGHUP) | proxywatchd.py | | Sensible defaults | Document and improve default values | config.py | ### 4.3 Proxy Source Expansion | Task | Description | File(s) | |------|-------------|---------| | Additional scrapers | Support more search engines beyond Searx | scraper.py | | API sources | Integrate free proxy API endpoints | new: api_sources.py | | Import formats | Support various proxy list formats | ppf.py | --- ## Implementation Priority ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Priority Matrix │ ├──────────────────────────┬──────────────────────────────────────────────────┤ │ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │ │ │ │ │ [x] Unify _known_proxies │ [x] Connection pooling │ │ [x] Graceful DB errors │ [x] Dynamic thread scaling │ │ [x] Batch inserts │ [ ] Unit test infrastructure │ │ [x] WAL mode for SQLite │ [x] Latency tracking │ │ │ │ ├──────────────────────────┼──────────────────────────────────────────────────┤ │ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │ │ │ │ │ [x] Standardize logging │ [x] Geographic validation │ │ [x] Config validation │ [x] Additional scrapers │ │ [x] Export functionality │ [ ] API sources │ │ [x] Status output │ [ ] Protocol fingerprinting │ │ │ │ └──────────────────────────┴──────────────────────────────────────────────────┘ ``` --- ## Completed Work ### Multi-Target Validation (Done) - [x] Work-stealing queue with shared Queue.Queue() - [x] Multi-target validation (2/3 majority voting) - [x] Interleaved testing (jobs shuffled across proxies) - [x] ProxyTestState and TargetTestJob classes ### Code Cleanup (Done) - [x] Removed dead HTTP server code from ppf.py - [x] Removed dead gumbo code from soup_parser.py - [x] Removed test code from comboparse.py - [x] Removed unused functions from misc.py - [x] Fixed IP/port cleansing in ppf.py extract_proxies() - [x] Updated .gitignore, removed .pyc files ### Database Optimization (Done) - [x] Enable SQLite WAL mode for better concurrency - [x] Add indexes for common query patterns (failed, tested, proto, error, check_time) - [x] Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE) ### Dependency Reduction (Done) - [x] Make lxml optional (removed from requirements) - [x] Make IP2Location optional (graceful fallback) - [x] Add --nobs flag for stdlib HTMLParser fallback (bs4 optional) ### Rate Limiting & Stability (Done) - [x] InstanceTracker class in scraper.py with exponential backoff - [x] Configurable backoff_base, backoff_max, fail_threshold - [x] Exception logging with context (replaced bare except blocks) - [x] Unified _known_proxies cache in fetch.py ### Monitoring & Maintenance (Done) - [x] Stats class in proxywatchd.py (tested/passed/failed tracking) - [x] Periodic stats reporting (configurable stats_interval) - [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days) - [x] Timeout config options (timeout_connect, timeout_read) ### Connection Pooling (Done) - [x] TorHostState class tracking per-host health and latency - [x] TorConnectionPool with worker affinity for circuit reuse - [x] Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures - [x] Pool warmup and health status reporting ### Priority Queue (Done) - [x] PriorityJobQueue class with heap-based ordering - [x] calculate_priority() assigns priority 0-4 by proxy state - [x] New proxies tested first, high-fail proxies last ### Dynamic Thread Scaling (Done) - [x] ThreadScaler class adjusts thread count dynamically - [x] Scales up when queue deep and success rate acceptable - [x] Scales down when queue shallow or success rate drops - [x] Respects min/max bounds with cooldown period ### Latency Tracking (Done) - [x] avg_latency, latency_samples columns in proxylist - [x] Exponential moving average calculation - [x] Migration function for existing databases - [x] Latency recorded for successful proxy tests ### Container Support (Done) - [x] Dockerfile with Python 2.7-slim base - [x] docker-compose.yml for local development - [x] Rootless podman deployment documentation - [x] Volume mounts for persistent data ### Code Style (Done) - [x] Normalized indentation (4-space, no tabs) - [x] Removed dead code and unused imports - [x] Added docstrings to classes and functions - [x] Python 2/3 compatible imports (Queue/queue) ### Geographic Validation (Done) - [x] IP2Location integration for country lookup - [x] pyasn integration for ASN lookup - [x] Graceful fallback when database files missing - [x] Country codes displayed in test output: `(US)`, `(IN)`, etc. - [x] Data files: IP2LOCATION-LITE-DB1.BIN, ipasn.dat ### SSL Proxy Testing (Done) - [x] Default checktype changed to 'ssl' - [x] ssl_targets list with major HTTPS sites - [x] TLS handshake validation with certificate verification - [x] Detects MITM proxies that intercept SSL connections ### Export Functionality (Done) - [x] export.py CLI tool for exporting working proxies - [x] Multiple formats: txt, json, csv, len (length-prefixed) - [x] Filters: proto, country, anonymity, max_latency - [x] Sort options: latency, added, tested, success - [x] Output to stdout or file ### Web Dashboard (Done) - [x] /dashboard endpoint with dark theme HTML UI - [x] /api/stats endpoint for JSON runtime statistics - [x] Auto-refresh with JavaScript fetch every 3 seconds - [x] Stats provider callback from proxywatchd.py to httpd.py - [x] Displays: tested/passed/success rate, thread count, uptime - [x] Tor pool health: per-host latency, success rate, availability - [x] Failure categories breakdown: timeout, proxy, ssl, closed ### Dashboard Enhancements v2 (Done) - [x] Prominent check type badge in header (SSL/JUDGES/HTTP/IRC) - [x] System monitor bar: load, memory, disk, process RSS - [x] Anonymity breakdown: elite/anonymous/transparent counts - [x] Database health: size, tested/hour, added/day, dead count - [x] Enhanced Tor pool stats: requests, success rate, healthy nodes, latency - [x] SQLite ANALYZE/VACUUM functions for query optimization - [x] Lightweight design: client-side polling, minimal DOM updates ### Dashboard Enhancements v3 (Done) - [x] Electric cyan theme with translucent glass-morphism effects - [x] Unified wrapper styling (.chart-wrap, .histo-wrap, .stats-wrap, .lb-wrap, .pie-wrap) - [x] Consistent backdrop-filter blur and electric glow borders - [x] Tor Exit Nodes cards with hover effects (.tor-card) - [x] Lighter background/tile color scheme (#1e2738 bg, #181f2a card) - [x] Map endpoint restyled to match dashboard (electric cyan theme) - [x] Map markers updated from gold to cyan for approximate locations ### Memory Profiling & Analysis (Done) - [x] /api/memory endpoint with comprehensive memory stats - [x] objgraph integration for object type counting - [x] pympler integration for memory summaries - [x] Memory sample history tracking (RSS over time) - [x] Process memory from /proc/self/status (VmRSS, VmPeak, VmData, etc.) - [x] GC statistics (collections, objects, thresholds) ### MITM Detection Optimization (Done) - [x] MITM re-test skip optimization - avoid redundant SSL checks for known MITM proxies - [x] mitm_retest_skipped stats counter for tracking optimization effectiveness - [x] Content hash deduplication for stale proxy list detection - [x] stale_count reset when content hash changes --- ## Technical Debt | Item | Description | Risk | |------|-------------|------| | ~~Dual _known_proxies~~ | ~~ppf.py and fetch.py maintain separate caches~~ | **Resolved** | | Global config in fetch.py | set_config() pattern is fragile | Low - works but not clean | | ~~No input validation~~ | ~~Proxy strings parsed without validation~~ | **Resolved** | | ~~Silent exception catching~~ | ~~Some except: pass patterns hide errors~~ | **Resolved** | | ~~Hardcoded timeouts~~ | ~~Various timeout values scattered in code~~ | **Resolved** | --- ## File Reference | File | Purpose | Status | |------|---------|--------| | ppf.py | Main URL harvester daemon | Active, cleaned | | proxywatchd.py | Proxy validation daemon | Active, enhanced | | scraper.py | Searx search integration | Active, cleaned | | fetch.py | HTTP fetching with proxy support | Active | | dbs.py | Database schema and inserts | Active | | mysqlite.py | SQLite wrapper | Active | | rocksock.py | Socket/proxy abstraction (3rd party) | Stable | | http2.py | HTTP client implementation | Stable | | config.py | Configuration management | Active | | comboparse.py | Config/arg parser framework | Stable, cleaned | | soup_parser.py | BeautifulSoup wrapper | Stable, cleaned | | misc.py | Utilities (timestamp, logging) | Stable, cleaned | | export.py | Proxy export CLI tool | Active |