# PPF Project Roadmap ## Project Purpose PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to: 1. **Discover** proxy addresses by crawling websites and search engines 2. **Validate** proxies through multi-target testing via Tor 3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP) ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ PPF Architecture │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ scraper.py │ │ ppf.py │ │proxywatchd │ │ │ │ │ │ │ │ │ │ │ │ Searx query │───>│ URL harvest │───>│ Proxy test │ │ │ │ URL finding │ │ Proxy extract│ │ Validation │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ v v v │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ SQLite Databases │ │ │ │ uris.db (URLs) proxies.db (proxy list) │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Network Layer │ │ │ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ## Constraints - **Python 2.7** compatibility required - **Minimal external dependencies** (avoid adding new modules) - Current dependencies: beautifulsoup4 - Optional: IP2Location (for proxy geolocation) --- ## Phase 1: Stability & Code Quality **Objective:** Establish a solid, maintainable codebase ### 1.1 Error Handling Improvements | Task | Description | File(s) | |------|-------------|---------| | Add connection retry logic | Implement exponential backoff for failed connections | rocksock.py, fetch.py | | Graceful database errors | Handle SQLite lock/busy errors with retry | mysqlite.py | | Timeout standardization | Consistent timeout handling across all network ops | proxywatchd.py, fetch.py | | Exception logging | Log exceptions with context, not just silently catch | all files | ### 1.2 Code Consolidation | Task | Description | File(s) | |------|-------------|---------| | Unify _known_proxies | Single source of truth for known proxy cache | ppf.py, fetch.py | | Extract proxy utils | Create proxy_utils.py with cleanse/validate functions | new file | | Remove global config pattern | Pass config explicitly instead of set_config() | fetch.py | | Standardize logging | Consistent _log() usage with levels across all modules | all files | ### 1.3 Testing Infrastructure | Task | Description | File(s) | |------|-------------|---------| | Add unit tests | Test proxy parsing, URL extraction, IP validation | tests/ | | Mock network layer | Allow testing without live network/Tor | tests/ | | Validation test suite | Verify multi-target voting logic | tests/ | --- ## Phase 2: Performance Optimization **Objective:** Improve throughput and resource efficiency ### 2.1 Connection Pooling | Task | Description | File(s) | |------|-------------|---------| | Tor connection reuse | Pool Tor SOCKS connections instead of reconnecting | proxywatchd.py | | HTTP keep-alive | Reuse connections to same target servers | http2.py | | Connection warm-up | Pre-establish connections before job assignment | proxywatchd.py | ### 2.2 Database Optimization | Task | Description | File(s) | |------|-------------|---------| | Batch inserts | Group INSERT operations (already partial) | dbs.py | | Index optimization | Add indexes for frequent query patterns | dbs.py | | WAL mode | Enable Write-Ahead Logging for better concurrency | mysqlite.py | | Prepared statements | Cache compiled SQL statements | mysqlite.py | ### 2.3 Threading Improvements | Task | Description | File(s) | |------|-------------|---------| | Dynamic thread scaling | Adjust thread count based on success rate | proxywatchd.py | | Priority queue | Test high-value proxies (low fail count) first | proxywatchd.py | | Stale proxy cleanup | Background thread to remove long-dead proxies | proxywatchd.py | --- ## Phase 3: Reliability & Accuracy **Objective:** Improve proxy validation accuracy and system reliability ### 3.1 Enhanced Validation | Task | Description | File(s) | |------|-------------|---------| | Latency tracking | Store and use connection latency metrics | proxywatchd.py, dbs.py | | Geographic validation | Verify proxy actually routes through claimed location | proxywatchd.py | | Protocol fingerprinting | Better SOCKS4/SOCKS5/HTTP detection | rocksock.py | | HTTPS/SSL testing | Validate SSL proxy capabilities | proxywatchd.py | ### 3.2 Target Management | Task | Description | File(s) | |------|-------------|---------| | Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py | | Target health tracking | Remove unresponsive targets from pool | proxywatchd.py | | Geographic target spread | Ensure targets span multiple regions | config.py | ### 3.3 Failure Analysis | Task | Description | File(s) | |------|-------------|---------| | Failure categorization | Distinguish timeout vs refused vs auth-fail | proxywatchd.py | | Retry strategies | Different retry logic per failure type | proxywatchd.py | | Dead proxy quarantine | Separate storage for likely-dead proxies | dbs.py | --- ## Phase 4: Features & Usability **Objective:** Add useful features while maintaining simplicity ### 4.1 Reporting & Monitoring | Task | Description | File(s) | |------|-------------|---------| | Statistics collection | Track success rates, throughput, latency | proxywatchd.py | | Periodic status output | Log summary stats every N minutes | ppf.py, proxywatchd.py | | Export functionality | Export working proxies to file (txt, json) | new: export.py | ### 4.2 Configuration | Task | Description | File(s) | |------|-------------|---------| | Config validation | Validate config.ini on startup | config.py | | Runtime reconfiguration | Reload config without restart (SIGHUP) | proxywatchd.py | | Sensible defaults | Document and improve default values | config.py | ### 4.3 Proxy Source Expansion | Task | Description | File(s) | |------|-------------|---------| | Additional scrapers | Support more search engines beyond Searx | scraper.py | | API sources | Integrate free proxy API endpoints | new: api_sources.py | | Import formats | Support various proxy list formats | ppf.py | --- ## Implementation Priority ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ Priority Matrix │ ├──────────────────────────┬──────────────────────────────────────────────────┤ │ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │ │ │ │ │ ● Unify _known_proxies │ ● Connection pooling │ │ ● Graceful DB errors │ ● Dynamic thread scaling │ │ ● Batch inserts │ ● Unit test infrastructure │ │ ● WAL mode for SQLite │ ● Latency tracking │ │ │ │ ├──────────────────────────┼──────────────────────────────────────────────────┤ │ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │ │ │ │ │ ● Standardize logging │ ● Geographic validation │ │ ● Config validation │ ● Additional scrapers │ │ ● Export functionality │ ● API sources │ │ ● Status output │ ● Protocol fingerprinting │ │ │ │ └──────────────────────────┴──────────────────────────────────────────────────┘ ``` --- ## Completed Work ### Multi-Target Validation (Done) - [x] Work-stealing queue with shared Queue.Queue() - [x] Multi-target validation (2/3 majority voting) - [x] Interleaved testing (jobs shuffled across proxies) - [x] ProxyTestState and TargetTestJob classes ### Code Cleanup (Done) - [x] Removed dead HTTP server code from ppf.py - [x] Removed dead gumbo code from soup_parser.py - [x] Removed test code from comboparse.py - [x] Removed unused functions from misc.py - [x] Fixed IP/port cleansing in ppf.py extract_proxies() - [x] Updated .gitignore, removed .pyc files ### Database Optimization (Done) - [x] Enable SQLite WAL mode for better concurrency - [x] Add indexes for common query patterns (failed, tested, proto, error, check_time) - [x] Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE) ### Dependency Reduction (Done) - [x] Make lxml optional (removed from requirements) - [x] Make IP2Location optional (graceful fallback) - [x] Add --nobs flag for stdlib HTMLParser fallback (bs4 optional) ### Rate Limiting & Stability (Done) - [x] InstanceTracker class in scraper.py with exponential backoff - [x] Configurable backoff_base, backoff_max, fail_threshold - [x] Exception logging with context (replaced bare except blocks) - [x] Unified _known_proxies cache in fetch.py ### Monitoring & Maintenance (Done) - [x] Stats class in proxywatchd.py (tested/passed/failed tracking) - [x] Periodic stats reporting (configurable stats_interval) - [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days) - [x] Timeout config options (timeout_connect, timeout_read) --- ## Technical Debt | Item | Description | Risk | |------|-------------|------| | ~~Dual _known_proxies~~ | ~~ppf.py and fetch.py maintain separate caches~~ | **Resolved** | | Global config in fetch.py | set_config() pattern is fragile | Low - works but not clean | | No input validation | Proxy strings parsed without validation | Medium - could crash on bad data | | ~~Silent exception catching~~ | ~~Some except: pass patterns hide errors~~ | **Resolved** | | ~~Hardcoded timeouts~~ | ~~Various timeout values scattered in code~~ | **Resolved** | --- ## File Reference | File | Purpose | Status | |------|---------|--------| | ppf.py | Main URL harvester daemon | Active, cleaned | | proxywatchd.py | Proxy validation daemon | Active, enhanced | | scraper.py | Searx search integration | Active, cleaned | | fetch.py | HTTP fetching with proxy support | Active | | dbs.py | Database schema and inserts | Active | | mysqlite.py | SQLite wrapper | Active | | rocksock.py | Socket/proxy abstraction (3rd party) | Stable | | http2.py | HTTP client implementation | Stable | | config.py | Configuration management | Active | | comboparse.py | Config/arg parser framework | Stable, cleaned | | soup_parser.py | BeautifulSoup wrapper | Stable, cleaned | | misc.py | Utilities (timestamp, logging) | Stable, cleaned |