PPF Project Roadmap
Project Purpose
PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:
- Discover proxy addresses by crawling websites and search engines
- Validate proxies through multi-target testing via Tor
- Maintain a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)
Architecture Overview
Constraints
- Python 2.7 compatibility required
- Minimal external dependencies (avoid adding new modules)
- Current dependencies: beautifulsoup4
- Optional: IP2Location (for proxy geolocation)
Phase 1: Stability & Code Quality
Objective: Establish a solid, maintainable codebase
1.1 Error Handling Improvements
| Task |
Description |
File(s) |
| Add connection retry logic |
Implement exponential backoff for failed connections |
rocksock.py, fetch.py |
| Graceful database errors |
Handle SQLite lock/busy errors with retry |
mysqlite.py |
| Timeout standardization |
Consistent timeout handling across all network ops |
proxywatchd.py, fetch.py |
| Exception logging |
Log exceptions with context, not just silently catch |
all files |
1.2 Code Consolidation
| Task |
Description |
File(s) |
| Unify _known_proxies |
Single source of truth for known proxy cache |
ppf.py, fetch.py |
| Extract proxy utils |
Create proxy_utils.py with cleanse/validate functions |
new file |
| Remove global config pattern |
Pass config explicitly instead of set_config() |
fetch.py |
| Standardize logging |
Consistent _log() usage with levels across all modules |
all files |
1.3 Testing Infrastructure
| Task |
Description |
File(s) |
| Add unit tests |
Test proxy parsing, URL extraction, IP validation |
tests/ |
| Mock network layer |
Allow testing without live network/Tor |
tests/ |
| Validation test suite |
Verify multi-target voting logic |
tests/ |
Phase 2: Performance Optimization
Objective: Improve throughput and resource efficiency
2.1 Connection Pooling
| Task |
Description |
File(s) |
| Tor connection reuse |
Pool Tor SOCKS connections instead of reconnecting |
proxywatchd.py |
| HTTP keep-alive |
Reuse connections to same target servers |
http2.py |
| Connection warm-up |
Pre-establish connections before job assignment |
proxywatchd.py |
2.2 Database Optimization
| Task |
Description |
File(s) |
| Batch inserts |
Group INSERT operations (already partial) |
dbs.py |
| Index optimization |
Add indexes for frequent query patterns |
dbs.py |
| WAL mode |
Enable Write-Ahead Logging for better concurrency |
mysqlite.py |
| Prepared statements |
Cache compiled SQL statements |
mysqlite.py |
2.3 Threading Improvements
| Task |
Description |
File(s) |
| Dynamic thread scaling |
Adjust thread count based on success rate |
proxywatchd.py |
| Priority queue |
Test high-value proxies (low fail count) first |
proxywatchd.py |
| Stale proxy cleanup |
Background thread to remove long-dead proxies |
proxywatchd.py |
Phase 3: Reliability & Accuracy
Objective: Improve proxy validation accuracy and system reliability
3.1 Enhanced Validation
| Task |
Description |
File(s) |
| Latency tracking |
Store and use connection latency metrics |
proxywatchd.py, dbs.py |
| Geographic validation |
Verify proxy actually routes through claimed location |
proxywatchd.py |
| Protocol fingerprinting |
Better SOCKS4/SOCKS5/HTTP detection |
rocksock.py |
| HTTPS/SSL testing |
Validate SSL proxy capabilities |
proxywatchd.py |
3.2 Target Management
| Task |
Description |
File(s) |
| Dynamic target pool |
Auto-discover and rotate validation targets |
proxywatchd.py |
| Target health tracking |
Remove unresponsive targets from pool |
proxywatchd.py |
| Geographic target spread |
Ensure targets span multiple regions |
config.py |
3.3 Failure Analysis
| Task |
Description |
File(s) |
| Failure categorization |
Distinguish timeout vs refused vs auth-fail |
proxywatchd.py |
| Retry strategies |
Different retry logic per failure type |
proxywatchd.py |
| Dead proxy quarantine |
Separate storage for likely-dead proxies |
dbs.py |
Phase 4: Features & Usability
Objective: Add useful features while maintaining simplicity
4.1 Reporting & Monitoring
| Task |
Description |
File(s) |
| Statistics collection |
Track success rates, throughput, latency |
proxywatchd.py |
| Periodic status output |
Log summary stats every N minutes |
ppf.py, proxywatchd.py |
| Export functionality |
Export working proxies to file (txt, json) |
new: export.py |
4.2 Configuration
| Task |
Description |
File(s) |
| Config validation |
Validate config.ini on startup |
config.py |
| Runtime reconfiguration |
Reload config without restart (SIGHUP) |
proxywatchd.py |
| Sensible defaults |
Document and improve default values |
config.py |
4.3 Proxy Source Expansion
| Task |
Description |
File(s) |
| Additional scrapers |
Support more search engines beyond Searx |
scraper.py |
| API sources |
Integrate free proxy API endpoints |
new: api_sources.py |
| Import formats |
Support various proxy list formats |
ppf.py |
Implementation Priority
Completed Work
Multi-Target Validation (Done)
Code Cleanup (Done)
Database Optimization (Done)
Dependency Reduction (Done)
Technical Debt
| Item |
Description |
Risk |
| Dual _known_proxies |
ppf.py and fetch.py maintain separate caches |
Medium - duplicates possible |
| Global config in fetch.py |
set_config() pattern is fragile |
Low - works but not clean |
| No input validation |
Proxy strings parsed without validation |
Medium - could crash on bad data |
| Silent exception catching |
Some except: pass patterns hide errors |
High - hard to debug |
| Hardcoded timeouts |
Various timeout values scattered in code |
Low - works but not configurable |
File Reference
| File |
Purpose |
Status |
| ppf.py |
Main URL harvester daemon |
Active, cleaned |
| proxywatchd.py |
Proxy validation daemon |
Active, enhanced |
| scraper.py |
Searx search integration |
Active, cleaned |
| fetch.py |
HTTP fetching with proxy support |
Active |
| dbs.py |
Database schema and inserts |
Active |
| mysqlite.py |
SQLite wrapper |
Active |
| rocksock.py |
Socket/proxy abstraction (3rd party) |
Stable |
| http2.py |
HTTP client implementation |
Stable |
| config.py |
Configuration management |
Active |
| comboparse.py |
Config/arg parser framework |
Stable, cleaned |
| soup_parser.py |
BeautifulSoup wrapper |
Stable, cleaned |
| misc.py |
Utilities (timestamp, logging) |
Stable, cleaned |