PPF Project Roadmap
Project Purpose
PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:
- Discover proxy addresses by crawling websites and search engines
- Validate proxies through multi-target testing via Tor
- Maintain a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)
Architecture Overview
Constraints
- Python 2.7 compatibility required
- Minimal external dependencies (avoid adding new modules)
- Current dependencies: beautifulsoup4, pyasn, IP2Location
- Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)
Phase 1: Stability & Code Quality
Objective: Establish a solid, maintainable codebase
1.1 Error Handling Improvements
| Task |
Description |
File(s) |
| Add connection retry logic |
Implement exponential backoff for failed connections |
rocksock.py, fetch.py |
| Graceful database errors |
Handle SQLite lock/busy errors with retry |
mysqlite.py |
| Timeout standardization |
Consistent timeout handling across all network ops |
proxywatchd.py, fetch.py |
| Exception logging |
Log exceptions with context, not just silently catch |
all files |
1.2 Code Consolidation
| Task |
Description |
File(s) |
| Unify _known_proxies |
Single source of truth for known proxy cache |
ppf.py, fetch.py |
| Extract proxy utils |
Create proxy_utils.py with cleanse/validate functions |
new file |
| Remove global config pattern |
Pass config explicitly instead of set_config() |
fetch.py |
| Standardize logging |
Consistent _log() usage with levels across all modules |
all files |
1.3 Testing Infrastructure
| Task |
Description |
File(s) |
| Add unit tests |
Test proxy parsing, URL extraction, IP validation |
tests/ |
| Mock network layer |
Allow testing without live network/Tor |
tests/ |
| Validation test suite |
Verify multi-target voting logic |
tests/ |
Phase 2: Performance Optimization
Objective: Improve throughput and resource efficiency
2.1 Connection Pooling
| Task |
Description |
File(s) |
| Tor connection reuse |
Pool Tor SOCKS connections instead of reconnecting |
proxywatchd.py |
| HTTP keep-alive |
Reuse connections to same target servers |
http2.py |
| Connection warm-up |
Pre-establish connections before job assignment |
proxywatchd.py |
2.2 Database Optimization
| Task |
Description |
File(s) |
| Batch inserts |
Group INSERT operations (already partial) |
dbs.py |
| Index optimization |
Add indexes for frequent query patterns |
dbs.py |
| WAL mode |
Enable Write-Ahead Logging for better concurrency |
mysqlite.py |
| Prepared statements |
Cache compiled SQL statements |
mysqlite.py |
2.3 Threading Improvements
| Task |
Description |
File(s) |
| Dynamic thread scaling |
Adjust thread count based on success rate |
proxywatchd.py |
| Priority queue |
Test high-value proxies (low fail count) first |
proxywatchd.py |
| Stale proxy cleanup |
Background thread to remove long-dead proxies |
proxywatchd.py |
Phase 3: Reliability & Accuracy
Objective: Improve proxy validation accuracy and system reliability
3.1 Enhanced Validation
| Task |
Description |
File(s) |
| Latency tracking |
Store and use connection latency metrics |
proxywatchd.py, dbs.py |
| Geographic validation |
Verify proxy actually routes through claimed location |
proxywatchd.py |
| Protocol fingerprinting |
Better SOCKS4/SOCKS5/HTTP detection |
rocksock.py |
| HTTPS/SSL testing |
Validate SSL proxy capabilities |
proxywatchd.py |
3.2 Target Management
| Task |
Description |
File(s) |
| Dynamic target pool |
Auto-discover and rotate validation targets |
proxywatchd.py |
| Target health tracking |
Remove unresponsive targets from pool |
proxywatchd.py |
| Geographic target spread |
Ensure targets span multiple regions |
config.py |
3.3 Failure Analysis
| Task |
Description |
File(s) |
| Failure categorization |
Distinguish timeout vs refused vs auth-fail |
proxywatchd.py |
| Retry strategies |
Different retry logic per failure type |
proxywatchd.py |
| Dead proxy quarantine |
Separate storage for likely-dead proxies |
dbs.py |
Phase 4: Features & Usability
Objective: Add useful features while maintaining simplicity
4.1 Reporting & Monitoring
| Task |
Description |
File(s) |
| Statistics collection |
Track success rates, throughput, latency |
proxywatchd.py |
| Periodic status output |
Log summary stats every N minutes |
ppf.py, proxywatchd.py |
| Export functionality |
Export working proxies to file (txt, json) |
new: export.py |
4.2 Configuration
| Task |
Description |
File(s) |
| Config validation |
Validate config.ini on startup |
config.py |
| Runtime reconfiguration |
Reload config without restart (SIGHUP) |
proxywatchd.py |
| Sensible defaults |
Document and improve default values |
config.py |
4.3 Proxy Source Expansion
| Task |
Description |
File(s) |
| Additional scrapers |
Support more search engines beyond Searx |
scraper.py |
| API sources |
Integrate free proxy API endpoints |
new: api_sources.py |
| Import formats |
Support various proxy list formats |
ppf.py |
Implementation Priority
Completed Work
Multi-Target Validation (Done)
Code Cleanup (Done)
Database Optimization (Done)
Dependency Reduction (Done)
Rate Limiting & Stability (Done)
Monitoring & Maintenance (Done)
Connection Pooling (Done)
Priority Queue (Done)
Dynamic Thread Scaling (Done)
Latency Tracking (Done)
Container Support (Done)
Code Style (Done)
Geographic Validation (Done)
SSL Proxy Testing (Done)
Export Functionality (Done)
Web Dashboard (Done)
Dashboard Enhancements v2 (Done)
- Prominent check type badge in header (SSL/JUDGES/HTTP/IRC)
- System monitor bar: load, memory, disk, process RSS
- Anonymity breakdown: elite/anonymous/transparent counts
- Database health: size, tested/hour, added/day, dead count
- Enhanced Tor pool stats: requests, success rate, healthy nodes, latency
- SQLite ANALYZE/VACUUM functions for query optimization
- Lightweight design: client-side polling, minimal DOM updates
Technical Debt
| Item |
Description |
Risk |
Dual _known_proxies |
ppf.py and fetch.py maintain separate caches |
Resolved |
| Global config in fetch.py |
set_config() pattern is fragile |
Low - works but not clean |
No input validation |
Proxy strings parsed without validation |
Resolved |
Silent exception catching |
Some except: pass patterns hide errors |
Resolved |
Hardcoded timeouts |
Various timeout values scattered in code |
Resolved |
File Reference
| File |
Purpose |
Status |
| ppf.py |
Main URL harvester daemon |
Active, cleaned |
| proxywatchd.py |
Proxy validation daemon |
Active, enhanced |
| scraper.py |
Searx search integration |
Active, cleaned |
| fetch.py |
HTTP fetching with proxy support |
Active |
| dbs.py |
Database schema and inserts |
Active |
| mysqlite.py |
SQLite wrapper |
Active |
| rocksock.py |
Socket/proxy abstraction (3rd party) |
Stable |
| http2.py |
HTTP client implementation |
Stable |
| config.py |
Configuration management |
Active |
| comboparse.py |
Config/arg parser framework |
Stable, cleaned |
| soup_parser.py |
BeautifulSoup wrapper |
Stable, cleaned |
| misc.py |
Utilities (timestamp, logging) |
Stable, cleaned |
| export.py |
Proxy export CLI tool |
Active |