18 KiB
18 KiB
PPF Project Roadmap
Project Purpose
PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:
- Discover proxy addresses by crawling websites and search engines
- Validate proxies through multi-target testing via Tor
- Maintain a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ PPF Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ scraper.py │ │ ppf.py │ │proxywatchd │ │
│ │ │ │ │ │ │ │
│ │ Searx query │───>│ URL harvest │───>│ Proxy test │ │
│ │ URL finding │ │ Proxy extract│ │ Validation │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ v v v │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ SQLite Databases │ │
│ │ uris.db (URLs) proxies.db (proxy list) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Network Layer │ │
│ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Constraints
- Python 2.7 compatibility required
- Minimal external dependencies (avoid adding new modules)
- Current dependencies: beautifulsoup4, pyasn, IP2Location
- Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)
Phase 1: Stability & Code Quality
Objective: Establish a solid, maintainable codebase
1.1 Error Handling Improvements
| Task | Description | File(s) |
|---|---|---|
| Add connection retry logic | Implement exponential backoff for failed connections | rocksock.py, fetch.py |
| Graceful database errors | Handle SQLite lock/busy errors with retry | mysqlite.py |
| Timeout standardization | Consistent timeout handling across all network ops | proxywatchd.py, fetch.py |
| Exception logging | Log exceptions with context, not just silently catch | all files |
1.2 Code Consolidation
| Task | Description | File(s) |
|---|---|---|
| Unify _known_proxies | Single source of truth for known proxy cache | ppf.py, fetch.py |
| Extract proxy utils | Create proxy_utils.py with cleanse/validate functions | new file |
| Remove global config pattern | Pass config explicitly instead of set_config() | fetch.py |
| Standardize logging | Consistent _log() usage with levels across all modules | all files |
1.3 Testing Infrastructure
| Task | Description | File(s) |
|---|---|---|
| Add unit tests | Test proxy parsing, URL extraction, IP validation | tests/ |
| Mock network layer | Allow testing without live network/Tor | tests/ |
| Validation test suite | Verify multi-target voting logic | tests/ |
Phase 2: Performance Optimization
Objective: Improve throughput and resource efficiency
2.1 Connection Pooling
| Task | Description | File(s) |
|---|---|---|
| Tor connection reuse | Pool Tor SOCKS connections instead of reconnecting | proxywatchd.py |
| HTTP keep-alive | Reuse connections to same target servers | http2.py |
| Connection warm-up | Pre-establish connections before job assignment | proxywatchd.py |
2.2 Database Optimization
| Task | Description | File(s) |
|---|---|---|
| Batch inserts | Group INSERT operations (already partial) | dbs.py |
| Index optimization | Add indexes for frequent query patterns | dbs.py |
| WAL mode | Enable Write-Ahead Logging for better concurrency | mysqlite.py |
| Prepared statements | Cache compiled SQL statements | mysqlite.py |
2.3 Threading Improvements
| Task | Description | File(s) |
|---|---|---|
| Dynamic thread scaling | Adjust thread count based on success rate | proxywatchd.py |
| Priority queue | Test high-value proxies (low fail count) first | proxywatchd.py |
| Stale proxy cleanup | Background thread to remove long-dead proxies | proxywatchd.py |
Phase 3: Reliability & Accuracy
Objective: Improve proxy validation accuracy and system reliability
3.1 Enhanced Validation
| Task | Description | File(s) |
|---|---|---|
| Latency tracking | Store and use connection latency metrics | proxywatchd.py, dbs.py |
| Geographic validation | Verify proxy actually routes through claimed location | proxywatchd.py |
| Protocol fingerprinting | Better SOCKS4/SOCKS5/HTTP detection | rocksock.py |
| HTTPS/SSL testing | Validate SSL proxy capabilities | proxywatchd.py |
3.2 Target Management
| Task | Description | File(s) |
|---|---|---|
| Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py |
| Target health tracking | Remove unresponsive targets from pool | proxywatchd.py |
| Geographic target spread | Ensure targets span multiple regions | config.py |
3.3 Failure Analysis
| Task | Description | File(s) |
|---|---|---|
| Failure categorization | Distinguish timeout vs refused vs auth-fail | proxywatchd.py |
| Retry strategies | Different retry logic per failure type | proxywatchd.py |
| Dead proxy quarantine | Separate storage for likely-dead proxies | dbs.py |
Phase 4: Features & Usability
Objective: Add useful features while maintaining simplicity
4.1 Reporting & Monitoring
| Task | Description | File(s) |
|---|---|---|
| Statistics collection | Track success rates, throughput, latency | proxywatchd.py |
| Periodic status output | Log summary stats every N minutes | ppf.py, proxywatchd.py |
| Export functionality | Export working proxies to file (txt, json) | new: export.py |
4.2 Configuration
| Task | Description | File(s) |
|---|---|---|
| Config validation | Validate config.ini on startup | config.py |
| Runtime reconfiguration | Reload config without restart (SIGHUP) | proxywatchd.py |
| Sensible defaults | Document and improve default values | config.py |
4.3 Proxy Source Expansion
| Task | Description | File(s) |
|---|---|---|
| Additional scrapers | Support more search engines beyond Searx | scraper.py |
| API sources | Integrate free proxy API endpoints | new: api_sources.py |
| Import formats | Support various proxy list formats | ppf.py |
Implementation Priority
┌─────────────────────────────────────────────────────────────────────────────┐
│ Priority Matrix │
├──────────────────────────┬──────────────────────────────────────────────────┤
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │
│ │ │
│ [x] Unify _known_proxies │ [x] Connection pooling │
│ [x] Graceful DB errors │ [x] Dynamic thread scaling │
│ [x] Batch inserts │ [ ] Unit test infrastructure │
│ [x] WAL mode for SQLite │ [x] Latency tracking │
│ │ │
├──────────────────────────┼──────────────────────────────────────────────────┤
│ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │
│ │ │
│ [x] Standardize logging │ [x] Geographic validation │
│ [x] Config validation │ [x] Additional scrapers │
│ [x] Export functionality │ [ ] API sources │
│ [x] Status output │ [ ] Protocol fingerprinting │
│ │ │
└──────────────────────────┴──────────────────────────────────────────────────┘
Completed Work
Multi-Target Validation (Done)
- Work-stealing queue with shared Queue.Queue()
- Multi-target validation (2/3 majority voting)
- Interleaved testing (jobs shuffled across proxies)
- ProxyTestState and TargetTestJob classes
Code Cleanup (Done)
- Removed dead HTTP server code from ppf.py
- Removed dead gumbo code from soup_parser.py
- Removed test code from comboparse.py
- Removed unused functions from misc.py
- Fixed IP/port cleansing in ppf.py extract_proxies()
- Updated .gitignore, removed .pyc files
Database Optimization (Done)
- Enable SQLite WAL mode for better concurrency
- Add indexes for common query patterns (failed, tested, proto, error, check_time)
- Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE)
Dependency Reduction (Done)
- Make lxml optional (removed from requirements)
- Make IP2Location optional (graceful fallback)
- Add --nobs flag for stdlib HTMLParser fallback (bs4 optional)
Rate Limiting & Stability (Done)
- InstanceTracker class in scraper.py with exponential backoff
- Configurable backoff_base, backoff_max, fail_threshold
- Exception logging with context (replaced bare except blocks)
- Unified _known_proxies cache in fetch.py
Monitoring & Maintenance (Done)
- Stats class in proxywatchd.py (tested/passed/failed tracking)
- Periodic stats reporting (configurable stats_interval)
- Stale proxy cleanup (cleanup_stale() with configurable stale_days)
- Timeout config options (timeout_connect, timeout_read)
Connection Pooling (Done)
- TorHostState class tracking per-host health and latency
- TorConnectionPool with worker affinity for circuit reuse
- Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures
- Pool warmup and health status reporting
Priority Queue (Done)
- PriorityJobQueue class with heap-based ordering
- calculate_priority() assigns priority 0-4 by proxy state
- New proxies tested first, high-fail proxies last
Dynamic Thread Scaling (Done)
- ThreadScaler class adjusts thread count dynamically
- Scales up when queue deep and success rate acceptable
- Scales down when queue shallow or success rate drops
- Respects min/max bounds with cooldown period
Latency Tracking (Done)
- avg_latency, latency_samples columns in proxylist
- Exponential moving average calculation
- Migration function for existing databases
- Latency recorded for successful proxy tests
Container Support (Done)
- Dockerfile with Python 2.7-slim base
- docker-compose.yml for local development
- Rootless podman deployment documentation
- Volume mounts for persistent data
Code Style (Done)
- Normalized indentation (4-space, no tabs)
- Removed dead code and unused imports
- Added docstrings to classes and functions
- Python 2/3 compatible imports (Queue/queue)
Geographic Validation (Done)
- IP2Location integration for country lookup
- pyasn integration for ASN lookup
- Graceful fallback when database files missing
- Country codes displayed in test output:
(US),(IN), etc. - Data files: IP2LOCATION-LITE-DB1.BIN, ipasn.dat
SSL Proxy Testing (Done)
- Default checktype changed to 'ssl'
- ssl_targets list with major HTTPS sites
- TLS handshake validation with certificate verification
- Detects MITM proxies that intercept SSL connections
Export Functionality (Done)
- export.py CLI tool for exporting working proxies
- Multiple formats: txt, json, csv, len (length-prefixed)
- Filters: proto, country, anonymity, max_latency
- Sort options: latency, added, tested, success
- Output to stdout or file
Web Dashboard (Done)
- /dashboard endpoint with dark theme HTML UI
- /api/stats endpoint for JSON runtime statistics
- Auto-refresh with JavaScript fetch every 3 seconds
- Stats provider callback from proxywatchd.py to httpd.py
- Displays: tested/passed/success rate, thread count, uptime
- Tor pool health: per-host latency, success rate, availability
- Failure categories breakdown: timeout, proxy, ssl, closed
Dashboard Enhancements v2 (Done)
- Prominent check type badge in header (SSL/JUDGES/HTTP/IRC)
- System monitor bar: load, memory, disk, process RSS
- Anonymity breakdown: elite/anonymous/transparent counts
- Database health: size, tested/hour, added/day, dead count
- Enhanced Tor pool stats: requests, success rate, healthy nodes, latency
- SQLite ANALYZE/VACUUM functions for query optimization
- Lightweight design: client-side polling, minimal DOM updates
Dashboard Enhancements v3 (Done)
- Electric cyan theme with translucent glass-morphism effects
- Unified wrapper styling (.chart-wrap, .histo-wrap, .stats-wrap, .lb-wrap, .pie-wrap)
- Consistent backdrop-filter blur and electric glow borders
- Tor Exit Nodes cards with hover effects (.tor-card)
- Lighter background/tile color scheme (#1e2738 bg, #181f2a card)
- Map endpoint restyled to match dashboard (electric cyan theme)
- Map markers updated from gold to cyan for approximate locations
Memory Profiling & Analysis (Done)
- /api/memory endpoint with comprehensive memory stats
- objgraph integration for object type counting
- pympler integration for memory summaries
- Memory sample history tracking (RSS over time)
- Process memory from /proc/self/status (VmRSS, VmPeak, VmData, etc.)
- GC statistics (collections, objects, thresholds)
MITM Detection Optimization (Done)
- MITM re-test skip optimization - avoid redundant SSL checks for known MITM proxies
- mitm_retest_skipped stats counter for tracking optimization effectiveness
- Content hash deduplication for stale proxy list detection
- stale_count reset when content hash changes
Technical Debt
| Item | Description | Risk |
|---|---|---|
| Resolved | ||
| Global config in fetch.py | set_config() pattern is fragile | Low - works but not clean |
| Resolved | ||
| Resolved | ||
| Resolved |
File Reference
| File | Purpose | Status |
|---|---|---|
| ppf.py | Main URL harvester daemon | Active, cleaned |
| proxywatchd.py | Proxy validation daemon | Active, enhanced |
| scraper.py | Searx search integration | Active, cleaned |
| fetch.py | HTTP fetching with proxy support | Active |
| dbs.py | Database schema and inserts | Active |
| mysqlite.py | SQLite wrapper | Active |
| rocksock.py | Socket/proxy abstraction (3rd party) | Stable |
| http2.py | HTTP client implementation | Stable |
| config.py | Configuration management | Active |
| comboparse.py | Config/arg parser framework | Stable, cleaned |
| soup_parser.py | BeautifulSoup wrapper | Stable, cleaned |
| misc.py | Utilities (timestamp, logging) | Stable, cleaned |
| export.py | Proxy export CLI tool | Active |