Files
ppf/ROADMAP.md
Username eb1bba0e13 docs: update roadmap and task tracking
- README: update feature list
- ROADMAP: add completed features, update priorities
- TODO: mark completed tasks, add new items
- config.ini.sample: update example values
- http2: minor cleanup
2025-12-23 17:24:25 +01:00

331 lines
16 KiB
Markdown

# PPF Project Roadmap
## Project Purpose
PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:
1. **Discover** proxy addresses by crawling websites and search engines
2. **Validate** proxies through multi-target testing via Tor
3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ PPF Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ scraper.py │ │ ppf.py │ │proxywatchd │ │
│ │ │ │ │ │ │ │
│ │ Searx query │───>│ URL harvest │───>│ Proxy test │ │
│ │ URL finding │ │ Proxy extract│ │ Validation │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ v v v │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ SQLite Databases │ │
│ │ uris.db (URLs) proxies.db (proxy list) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Network Layer │ │
│ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Constraints
- **Python 2.7** compatibility required
- **Minimal external dependencies** (avoid adding new modules)
- Current dependencies: beautifulsoup4, pyasn, IP2Location
- Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)
---
## Phase 1: Stability & Code Quality
**Objective:** Establish a solid, maintainable codebase
### 1.1 Error Handling Improvements
| Task | Description | File(s) |
|------|-------------|---------|
| Add connection retry logic | Implement exponential backoff for failed connections | rocksock.py, fetch.py |
| Graceful database errors | Handle SQLite lock/busy errors with retry | mysqlite.py |
| Timeout standardization | Consistent timeout handling across all network ops | proxywatchd.py, fetch.py |
| Exception logging | Log exceptions with context, not just silently catch | all files |
### 1.2 Code Consolidation
| Task | Description | File(s) |
|------|-------------|---------|
| Unify _known_proxies | Single source of truth for known proxy cache | ppf.py, fetch.py |
| Extract proxy utils | Create proxy_utils.py with cleanse/validate functions | new file |
| Remove global config pattern | Pass config explicitly instead of set_config() | fetch.py |
| Standardize logging | Consistent _log() usage with levels across all modules | all files |
### 1.3 Testing Infrastructure
| Task | Description | File(s) |
|------|-------------|---------|
| Add unit tests | Test proxy parsing, URL extraction, IP validation | tests/ |
| Mock network layer | Allow testing without live network/Tor | tests/ |
| Validation test suite | Verify multi-target voting logic | tests/ |
---
## Phase 2: Performance Optimization
**Objective:** Improve throughput and resource efficiency
### 2.1 Connection Pooling
| Task | Description | File(s) |
|------|-------------|---------|
| Tor connection reuse | Pool Tor SOCKS connections instead of reconnecting | proxywatchd.py |
| HTTP keep-alive | Reuse connections to same target servers | http2.py |
| Connection warm-up | Pre-establish connections before job assignment | proxywatchd.py |
### 2.2 Database Optimization
| Task | Description | File(s) |
|------|-------------|---------|
| Batch inserts | Group INSERT operations (already partial) | dbs.py |
| Index optimization | Add indexes for frequent query patterns | dbs.py |
| WAL mode | Enable Write-Ahead Logging for better concurrency | mysqlite.py |
| Prepared statements | Cache compiled SQL statements | mysqlite.py |
### 2.3 Threading Improvements
| Task | Description | File(s) |
|------|-------------|---------|
| Dynamic thread scaling | Adjust thread count based on success rate | proxywatchd.py |
| Priority queue | Test high-value proxies (low fail count) first | proxywatchd.py |
| Stale proxy cleanup | Background thread to remove long-dead proxies | proxywatchd.py |
---
## Phase 3: Reliability & Accuracy
**Objective:** Improve proxy validation accuracy and system reliability
### 3.1 Enhanced Validation
| Task | Description | File(s) |
|------|-------------|---------|
| Latency tracking | Store and use connection latency metrics | proxywatchd.py, dbs.py |
| Geographic validation | Verify proxy actually routes through claimed location | proxywatchd.py |
| Protocol fingerprinting | Better SOCKS4/SOCKS5/HTTP detection | rocksock.py |
| HTTPS/SSL testing | Validate SSL proxy capabilities | proxywatchd.py |
### 3.2 Target Management
| Task | Description | File(s) |
|------|-------------|---------|
| Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py |
| Target health tracking | Remove unresponsive targets from pool | proxywatchd.py |
| Geographic target spread | Ensure targets span multiple regions | config.py |
### 3.3 Failure Analysis
| Task | Description | File(s) |
|------|-------------|---------|
| Failure categorization | Distinguish timeout vs refused vs auth-fail | proxywatchd.py |
| Retry strategies | Different retry logic per failure type | proxywatchd.py |
| Dead proxy quarantine | Separate storage for likely-dead proxies | dbs.py |
---
## Phase 4: Features & Usability
**Objective:** Add useful features while maintaining simplicity
### 4.1 Reporting & Monitoring
| Task | Description | File(s) |
|------|-------------|---------|
| Statistics collection | Track success rates, throughput, latency | proxywatchd.py |
| Periodic status output | Log summary stats every N minutes | ppf.py, proxywatchd.py |
| Export functionality | Export working proxies to file (txt, json) | new: export.py |
### 4.2 Configuration
| Task | Description | File(s) |
|------|-------------|---------|
| Config validation | Validate config.ini on startup | config.py |
| Runtime reconfiguration | Reload config without restart (SIGHUP) | proxywatchd.py |
| Sensible defaults | Document and improve default values | config.py |
### 4.3 Proxy Source Expansion
| Task | Description | File(s) |
|------|-------------|---------|
| Additional scrapers | Support more search engines beyond Searx | scraper.py |
| API sources | Integrate free proxy API endpoints | new: api_sources.py |
| Import formats | Support various proxy list formats | ppf.py |
---
## Implementation Priority
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Priority Matrix │
├──────────────────────────┬──────────────────────────────────────────────────┤
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │
│ │ │
│ [x] Unify _known_proxies │ [x] Connection pooling │
│ [x] Graceful DB errors │ [x] Dynamic thread scaling │
│ [x] Batch inserts │ [ ] Unit test infrastructure │
│ [x] WAL mode for SQLite │ [x] Latency tracking │
│ │ │
├──────────────────────────┼──────────────────────────────────────────────────┤
│ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │
│ │ │
│ [x] Standardize logging │ [x] Geographic validation │
│ [x] Config validation │ [x] Additional scrapers │
│ [x] Export functionality │ [ ] API sources │
│ [x] Status output │ [ ] Protocol fingerprinting │
│ │ │
└──────────────────────────┴──────────────────────────────────────────────────┘
```
---
## Completed Work
### Multi-Target Validation (Done)
- [x] Work-stealing queue with shared Queue.Queue()
- [x] Multi-target validation (2/3 majority voting)
- [x] Interleaved testing (jobs shuffled across proxies)
- [x] ProxyTestState and TargetTestJob classes
### Code Cleanup (Done)
- [x] Removed dead HTTP server code from ppf.py
- [x] Removed dead gumbo code from soup_parser.py
- [x] Removed test code from comboparse.py
- [x] Removed unused functions from misc.py
- [x] Fixed IP/port cleansing in ppf.py extract_proxies()
- [x] Updated .gitignore, removed .pyc files
### Database Optimization (Done)
- [x] Enable SQLite WAL mode for better concurrency
- [x] Add indexes for common query patterns (failed, tested, proto, error, check_time)
- [x] Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE)
### Dependency Reduction (Done)
- [x] Make lxml optional (removed from requirements)
- [x] Make IP2Location optional (graceful fallback)
- [x] Add --nobs flag for stdlib HTMLParser fallback (bs4 optional)
### Rate Limiting & Stability (Done)
- [x] InstanceTracker class in scraper.py with exponential backoff
- [x] Configurable backoff_base, backoff_max, fail_threshold
- [x] Exception logging with context (replaced bare except blocks)
- [x] Unified _known_proxies cache in fetch.py
### Monitoring & Maintenance (Done)
- [x] Stats class in proxywatchd.py (tested/passed/failed tracking)
- [x] Periodic stats reporting (configurable stats_interval)
- [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days)
- [x] Timeout config options (timeout_connect, timeout_read)
### Connection Pooling (Done)
- [x] TorHostState class tracking per-host health and latency
- [x] TorConnectionPool with worker affinity for circuit reuse
- [x] Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures
- [x] Pool warmup and health status reporting
### Priority Queue (Done)
- [x] PriorityJobQueue class with heap-based ordering
- [x] calculate_priority() assigns priority 0-4 by proxy state
- [x] New proxies tested first, high-fail proxies last
### Dynamic Thread Scaling (Done)
- [x] ThreadScaler class adjusts thread count dynamically
- [x] Scales up when queue deep and success rate acceptable
- [x] Scales down when queue shallow or success rate drops
- [x] Respects min/max bounds with cooldown period
### Latency Tracking (Done)
- [x] avg_latency, latency_samples columns in proxylist
- [x] Exponential moving average calculation
- [x] Migration function for existing databases
- [x] Latency recorded for successful proxy tests
### Container Support (Done)
- [x] Dockerfile with Python 2.7-slim base
- [x] docker-compose.yml for local development
- [x] Rootless podman deployment documentation
- [x] Volume mounts for persistent data
### Code Style (Done)
- [x] Normalized indentation (4-space, no tabs)
- [x] Removed dead code and unused imports
- [x] Added docstrings to classes and functions
- [x] Python 2/3 compatible imports (Queue/queue)
### Geographic Validation (Done)
- [x] IP2Location integration for country lookup
- [x] pyasn integration for ASN lookup
- [x] Graceful fallback when database files missing
- [x] Country codes displayed in test output: `(US)`, `(IN)`, etc.
- [x] Data files: IP2LOCATION-LITE-DB1.BIN, ipasn.dat
### SSL Proxy Testing (Done)
- [x] Default checktype changed to 'ssl'
- [x] ssl_targets list with major HTTPS sites
- [x] TLS handshake validation with certificate verification
- [x] Detects MITM proxies that intercept SSL connections
### Export Functionality (Done)
- [x] export.py CLI tool for exporting working proxies
- [x] Multiple formats: txt, json, csv, len (length-prefixed)
- [x] Filters: proto, country, anonymity, max_latency
- [x] Sort options: latency, added, tested, success
- [x] Output to stdout or file
### Web Dashboard (Done)
- [x] /dashboard endpoint with dark theme HTML UI
- [x] /api/stats endpoint for JSON runtime statistics
- [x] Auto-refresh with JavaScript fetch every 5 seconds
- [x] Stats provider callback from proxywatchd.py to httpd.py
- [x] Displays: tested/passed/success rate, thread count, uptime
- [x] Tor pool health: per-host latency, success rate, availability
- [x] Failure categories breakdown: timeout, proxy, ssl, closed
---
## Technical Debt
| Item | Description | Risk |
|------|-------------|------|
| ~~Dual _known_proxies~~ | ~~ppf.py and fetch.py maintain separate caches~~ | **Resolved** |
| Global config in fetch.py | set_config() pattern is fragile | Low - works but not clean |
| ~~No input validation~~ | ~~Proxy strings parsed without validation~~ | **Resolved** |
| ~~Silent exception catching~~ | ~~Some except: pass patterns hide errors~~ | **Resolved** |
| ~~Hardcoded timeouts~~ | ~~Various timeout values scattered in code~~ | **Resolved** |
---
## File Reference
| File | Purpose | Status |
|------|---------|--------|
| ppf.py | Main URL harvester daemon | Active, cleaned |
| proxywatchd.py | Proxy validation daemon | Active, enhanced |
| scraper.py | Searx search integration | Active, cleaned |
| fetch.py | HTTP fetching with proxy support | Active |
| dbs.py | Database schema and inserts | Active |
| mysqlite.py | SQLite wrapper | Active |
| rocksock.py | Socket/proxy abstraction (3rd party) | Stable |
| http2.py | HTTP client implementation | Stable |
| config.py | Configuration management | Active |
| comboparse.py | Config/arg parser framework | Stable, cleaned |
| soup_parser.py | BeautifulSoup wrapper | Stable, cleaned |
| misc.py | Utilities (timestamp, logging) | Stable, cleaned |
| export.py | Proxy export CLI tool | Active |