docs: remove completed items from TODO and ROADMAP
This commit is contained in:
362
ROADMAP.md
362
ROADMAP.md
@@ -45,83 +45,15 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Stability & Code Quality
|
||||
## Open Work
|
||||
|
||||
**Objective:** Establish a solid, maintainable codebase
|
||||
|
||||
### 1.1 Error Handling Improvements
|
||||
### Validation
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Add connection retry logic | Implement exponential backoff for failed connections | rocksock.py, fetch.py |
|
||||
| Graceful database errors | Handle SQLite lock/busy errors with retry | mysqlite.py |
|
||||
| Timeout standardization | Consistent timeout handling across all network ops | proxywatchd.py, fetch.py |
|
||||
| Exception logging | Log exceptions with context, not just silently catch | all files |
|
||||
|
||||
### 1.2 Code Consolidation
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Unify _known_proxies | Single source of truth for known proxy cache | ppf.py, fetch.py |
|
||||
| Extract proxy utils | Create proxy_utils.py with cleanse/validate functions | new file |
|
||||
| Remove global config pattern | Pass config explicitly instead of set_config() | fetch.py |
|
||||
| Standardize logging | Consistent _log() usage with levels across all modules | all files |
|
||||
|
||||
### 1.3 Testing Infrastructure
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Add unit tests | Test proxy parsing, URL extraction, IP validation | tests/ |
|
||||
| Mock network layer | Allow testing without live network/Tor | tests/ |
|
||||
| Validation test suite | Verify multi-target voting logic | tests/ |
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Performance Optimization
|
||||
|
||||
**Objective:** Improve throughput and resource efficiency
|
||||
|
||||
### 2.1 Connection Pooling
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Tor connection reuse | Pool Tor SOCKS connections instead of reconnecting | proxywatchd.py |
|
||||
| HTTP keep-alive | Reuse connections to same target servers | http2.py |
|
||||
| Connection warm-up | Pre-establish connections before job assignment | proxywatchd.py |
|
||||
|
||||
### 2.2 Database Optimization
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Batch inserts | Group INSERT operations (already partial) | dbs.py |
|
||||
| Index optimization | Add indexes for frequent query patterns | dbs.py |
|
||||
| WAL mode | Enable Write-Ahead Logging for better concurrency | mysqlite.py |
|
||||
| Prepared statements | Cache compiled SQL statements | mysqlite.py |
|
||||
|
||||
### 2.3 Threading Improvements
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Dynamic thread scaling | Adjust thread count based on success rate | proxywatchd.py |
|
||||
| Priority queue | Test high-value proxies (low fail count) first | proxywatchd.py |
|
||||
| Stale proxy cleanup | Background thread to remove long-dead proxies | proxywatchd.py |
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Reliability & Accuracy
|
||||
|
||||
**Objective:** Improve proxy validation accuracy and system reliability
|
||||
|
||||
### 3.1 Enhanced Validation
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Latency tracking | Store and use connection latency metrics | proxywatchd.py, dbs.py |
|
||||
| Geographic validation | Verify proxy actually routes through claimed location | proxywatchd.py |
|
||||
| Protocol fingerprinting | Better SOCKS4/SOCKS5/HTTP detection | rocksock.py |
|
||||
| HTTPS/SSL testing | Validate SSL proxy capabilities | proxywatchd.py |
|
||||
|
||||
### 3.2 Target Management
|
||||
### Target Management
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
@@ -129,245 +61,11 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
| Target health tracking | Remove unresponsive targets from pool | proxywatchd.py |
|
||||
| Geographic target spread | Ensure targets span multiple regions | config.py |
|
||||
|
||||
### 3.3 Failure Analysis
|
||||
### Proxy Source Expansion
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Failure categorization | Distinguish timeout vs refused vs auth-fail | proxywatchd.py |
|
||||
| Retry strategies | Different retry logic per failure type | proxywatchd.py |
|
||||
| Dead proxy quarantine | Separate storage for likely-dead proxies | dbs.py |
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Features & Usability
|
||||
|
||||
**Objective:** Add useful features while maintaining simplicity
|
||||
|
||||
### 4.1 Reporting & Monitoring
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Statistics collection | Track success rates, throughput, latency | proxywatchd.py |
|
||||
| Periodic status output | Log summary stats every N minutes | ppf.py, proxywatchd.py |
|
||||
| Export functionality | Export working proxies to file (txt, json) | new: export.py |
|
||||
|
||||
### 4.2 Configuration
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Config validation | Validate config.ini on startup | config.py |
|
||||
| Runtime reconfiguration | Reload config without restart (SIGHUP) | proxywatchd.py |
|
||||
| Sensible defaults | Document and improve default values | config.py |
|
||||
|
||||
### 4.3 Proxy Source Expansion
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Additional scrapers | Support more search engines beyond Searx | scraper.py |
|
||||
| API sources | Integrate free proxy API endpoints | new: api_sources.py |
|
||||
| Import formats | Support various proxy list formats | ppf.py |
|
||||
|
||||
---
|
||||
|
||||
## Implementation Priority
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Priority Matrix │
|
||||
├──────────────────────────┬──────────────────────────────────────────────────┤
|
||||
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │
|
||||
│ │ │
|
||||
│ [x] Unify _known_proxies │ [x] Connection pooling │
|
||||
│ [x] Graceful DB errors │ [x] Dynamic thread scaling │
|
||||
│ [x] Batch inserts │ [x] Unit test infrastructure │
|
||||
│ [x] WAL mode for SQLite │ [x] Latency tracking │
|
||||
│ │ │
|
||||
├──────────────────────────┼──────────────────────────────────────────────────┤
|
||||
│ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │
|
||||
│ │ │
|
||||
│ [x] Standardize logging │ [x] Geographic validation │
|
||||
│ [x] Config validation │ [x] Additional scrapers (Bing, Yahoo, Mojeek) │
|
||||
│ [x] Export functionality │ [ ] API sources │
|
||||
│ [x] Status output │ [ ] Protocol fingerprinting │
|
||||
│ │ │
|
||||
└──────────────────────────┴──────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Completed Work
|
||||
|
||||
### Multi-Target Validation (Done)
|
||||
- [x] Work-stealing queue with shared Queue.Queue()
|
||||
- [x] Multi-target validation (2/3 majority voting)
|
||||
- [x] Interleaved testing (jobs shuffled across proxies)
|
||||
- [x] ProxyTestState and TargetTestJob classes
|
||||
|
||||
### Code Cleanup (Done)
|
||||
- [x] Removed dead HTTP server code from ppf.py
|
||||
- [x] Removed dead gumbo code from soup_parser.py
|
||||
- [x] Removed test code from comboparse.py
|
||||
- [x] Removed unused functions from misc.py
|
||||
- [x] Fixed IP/port cleansing in ppf.py extract_proxies()
|
||||
- [x] Updated .gitignore, removed .pyc files
|
||||
|
||||
### Database Optimization (Done)
|
||||
- [x] Enable SQLite WAL mode for better concurrency
|
||||
- [x] Add indexes for common query patterns (failed, tested, proto, error, check_time)
|
||||
- [x] Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE)
|
||||
|
||||
### Dependency Reduction (Done)
|
||||
- [x] Make lxml optional (removed from requirements)
|
||||
- [x] Make IP2Location optional (graceful fallback)
|
||||
- [x] Add --nobs flag for stdlib HTMLParser fallback (bs4 optional)
|
||||
|
||||
### Rate Limiting & Stability (Done)
|
||||
- [x] InstanceTracker class in scraper.py with exponential backoff
|
||||
- [x] Configurable backoff_base, backoff_max, fail_threshold
|
||||
- [x] Exception logging with context (replaced bare except blocks)
|
||||
- [x] Unified _known_proxies cache in fetch.py
|
||||
|
||||
### Monitoring & Maintenance (Done)
|
||||
- [x] Stats class in proxywatchd.py (tested/passed/failed tracking)
|
||||
- [x] Periodic stats reporting (configurable stats_interval)
|
||||
- [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days)
|
||||
- [x] Timeout config options (timeout_connect, timeout_read)
|
||||
|
||||
### Connection Pooling (Done)
|
||||
- [x] TorHostState class tracking per-host health and latency
|
||||
- [x] TorConnectionPool with worker affinity for circuit reuse
|
||||
- [x] Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures
|
||||
- [x] Pool warmup and health status reporting
|
||||
|
||||
### Priority Queue (Done)
|
||||
- [x] PriorityJobQueue class with heap-based ordering
|
||||
- [x] calculate_priority() assigns priority 0-4 by proxy state
|
||||
- [x] New proxies tested first, high-fail proxies last
|
||||
|
||||
### Dynamic Thread Scaling (Done)
|
||||
- [x] ThreadScaler class adjusts thread count dynamically
|
||||
- [x] Scales up when queue deep and success rate acceptable
|
||||
- [x] Scales down when queue shallow or success rate drops
|
||||
- [x] Respects min/max bounds with cooldown period
|
||||
|
||||
### Latency Tracking (Done)
|
||||
- [x] avg_latency, latency_samples columns in proxylist
|
||||
- [x] Exponential moving average calculation
|
||||
- [x] Migration function for existing databases
|
||||
- [x] Latency recorded for successful proxy tests
|
||||
|
||||
### Container Support (Done)
|
||||
- [x] Dockerfile with Python 2.7-slim base
|
||||
- [x] docker-compose.yml for local development
|
||||
- [x] Rootless podman deployment documentation
|
||||
- [x] Volume mounts for persistent data
|
||||
|
||||
### Code Style (Done)
|
||||
- [x] Normalized indentation (4-space, no tabs)
|
||||
- [x] Removed dead code and unused imports
|
||||
- [x] Added docstrings to classes and functions
|
||||
- [x] Python 2/3 compatible imports (Queue/queue)
|
||||
|
||||
### Geographic Validation (Done)
|
||||
- [x] IP2Location integration for country lookup
|
||||
- [x] pyasn integration for ASN lookup
|
||||
- [x] Graceful fallback when database files missing
|
||||
- [x] Country codes displayed in test output: `(US)`, `(IN)`, etc.
|
||||
- [x] Data files: IP2LOCATION-LITE-DB1.BIN, ipasn.dat
|
||||
|
||||
### SSL Proxy Testing (Done)
|
||||
- [x] Default checktype changed to 'ssl'
|
||||
- [x] ssl_targets list with major HTTPS sites
|
||||
- [x] TLS handshake validation with certificate verification
|
||||
- [x] Detects MITM proxies that intercept SSL connections
|
||||
|
||||
### Export Functionality (Done)
|
||||
- [x] export.py CLI tool for exporting working proxies
|
||||
- [x] Multiple formats: txt, json, csv, len (length-prefixed)
|
||||
- [x] Filters: proto, country, anonymity, max_latency
|
||||
- [x] Sort options: latency, added, tested, success
|
||||
- [x] Output to stdout or file
|
||||
|
||||
### Web Dashboard (Done)
|
||||
- [x] /dashboard endpoint with dark theme HTML UI
|
||||
- [x] /api/stats endpoint for JSON runtime statistics
|
||||
- [x] Auto-refresh with JavaScript fetch every 3 seconds
|
||||
- [x] Stats provider callback from proxywatchd.py to httpd.py
|
||||
- [x] Displays: tested/passed/success rate, thread count, uptime
|
||||
- [x] Tor pool health: per-host latency, success rate, availability
|
||||
- [x] Failure categories breakdown: timeout, proxy, ssl, closed
|
||||
|
||||
### Dashboard Enhancements v2 (Done)
|
||||
- [x] Prominent check type badge in header (SSL/JUDGES/HTTP/IRC)
|
||||
- [x] System monitor bar: load, memory, disk, process RSS
|
||||
- [x] Anonymity breakdown: elite/anonymous/transparent counts
|
||||
- [x] Database health: size, tested/hour, added/day, dead count
|
||||
- [x] Enhanced Tor pool stats: requests, success rate, healthy nodes, latency
|
||||
- [x] SQLite ANALYZE/VACUUM functions for query optimization
|
||||
- [x] Lightweight design: client-side polling, minimal DOM updates
|
||||
|
||||
### Dashboard Enhancements v3 (Done)
|
||||
- [x] Electric cyan theme with translucent glass-morphism effects
|
||||
- [x] Unified wrapper styling (.chart-wrap, .histo-wrap, .stats-wrap, .lb-wrap, .pie-wrap)
|
||||
- [x] Consistent backdrop-filter blur and electric glow borders
|
||||
- [x] Tor Exit Nodes cards with hover effects (.tor-card)
|
||||
- [x] Lighter background/tile color scheme (#1e2738 bg, #181f2a card)
|
||||
- [x] Map endpoint restyled to match dashboard (electric cyan theme)
|
||||
- [x] Map markers updated from gold to cyan for approximate locations
|
||||
|
||||
### Memory Profiling & Analysis (Done)
|
||||
- [x] /api/memory endpoint with comprehensive memory stats
|
||||
- [x] objgraph integration for object type counting
|
||||
- [x] pympler integration for memory summaries
|
||||
- [x] Memory sample history tracking (RSS over time)
|
||||
- [x] Process memory from /proc/self/status (VmRSS, VmPeak, VmData, etc.)
|
||||
- [x] GC statistics (collections, objects, thresholds)
|
||||
|
||||
### MITM Detection Optimization (Done)
|
||||
- [x] MITM re-test skip optimization - avoid redundant SSL checks for known MITM proxies
|
||||
- [x] mitm_retest_skipped stats counter for tracking optimization effectiveness
|
||||
- [x] Content hash deduplication for stale proxy list detection
|
||||
- [x] stale_count reset when content hash changes
|
||||
|
||||
### Distributed Workers (Done)
|
||||
- [x] Worker registration and heartbeat system
|
||||
- [x] /api/workers endpoint for worker status monitoring
|
||||
- [x] Tor connectivity check before workers claim work
|
||||
- [x] Worker test rate tracking with sliding window calculation
|
||||
- [x] Combined rate aggregation across all workers
|
||||
- [x] Dashboard worker cards showing per-worker stats
|
||||
|
||||
### Dashboard Performance (Done)
|
||||
- [x] Keyboard shortcuts: r=refresh, 1-9=tabs, t=theme, p=pause
|
||||
- [x] Tab-aware chart rendering - skip expensive renders for hidden tabs
|
||||
- [x] Visibility API - pause polling when browser tab hidden
|
||||
- [x] Dark/muted-dark/light theme cycling
|
||||
- [x] Stats export endpoint (/api/stats/export?format=json|csv)
|
||||
|
||||
### Proxy Validation Cache (Done)
|
||||
- [x] LRU cache for is_usable_proxy() using OrderedDict
|
||||
- [x] Thread-safe with lock for concurrent access
|
||||
- [x] Proper LRU eviction (move_to_end on hits, popitem oldest when full)
|
||||
|
||||
### Database Context Manager (Done)
|
||||
- [x] Refactored all DB operations to use `_db_context()` context manager
|
||||
- [x] Connections guaranteed to close even on exceptions
|
||||
- [x] Removed deprecated `_prep_db()` and `_close_db()` methods
|
||||
- [x] `fetch_rows()` now accepts db parameter for cleaner dependency injection
|
||||
|
||||
### Additional Search Engines (Done)
|
||||
- [x] Bing and Yahoo engine implementations in scraper.py
|
||||
- [x] Engine rotation for rate limit avoidance
|
||||
- [x] engines.py module with SearchEngine base class
|
||||
|
||||
### Worker Health Improvements (Done)
|
||||
- [x] Tor connectivity check before workers claim work
|
||||
- [x] Fixed interval Tor check (30s) instead of exponential backoff
|
||||
- [x] Graceful handling when Tor unavailable
|
||||
|
||||
### Memory Optimization (Done)
|
||||
- [x] `__slots__` on ProxyTestState (27 attrs) and TargetTestJob (4 attrs)
|
||||
- [x] Reduced per-object memory overhead for hot path objects
|
||||
|
||||
---
|
||||
|
||||
@@ -375,37 +73,33 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
|
||||
| Item | Description | Risk |
|
||||
|------|-------------|------|
|
||||
| ~~Dual _known_proxies~~ | ~~ppf.py and fetch.py maintain separate caches~~ | **Resolved** |
|
||||
| Global config in fetch.py | set_config() pattern is fragile | Low - works but not clean |
|
||||
| ~~No input validation~~ | ~~Proxy strings parsed without validation~~ | **Resolved** |
|
||||
| ~~Silent exception catching~~ | ~~Some except: pass patterns hide errors~~ | **Resolved** |
|
||||
| ~~Hardcoded timeouts~~ | ~~Various timeout values scattered in code~~ | **Resolved** |
|
||||
|
||||
---
|
||||
|
||||
## File Reference
|
||||
|
||||
| File | Purpose | Status |
|
||||
|------|---------|--------|
|
||||
| ppf.py | Main URL harvester daemon | Active, cleaned |
|
||||
| proxywatchd.py | Proxy validation daemon | Active, enhanced |
|
||||
| scraper.py | Searx search integration | Active, cleaned |
|
||||
| fetch.py | HTTP fetching with proxy support | Active, LRU cache |
|
||||
| dbs.py | Database schema and inserts | Active |
|
||||
| mysqlite.py | SQLite wrapper | Active |
|
||||
| rocksock.py | Socket/proxy abstraction (3rd party) | Stable |
|
||||
| http2.py | HTTP client implementation | Stable |
|
||||
| httpd.py | Web dashboard and REST API server | Active, enhanced |
|
||||
| config.py | Configuration management | Active |
|
||||
| comboparse.py | Config/arg parser framework | Stable, cleaned |
|
||||
| soup_parser.py | BeautifulSoup wrapper | Stable, cleaned |
|
||||
| misc.py | Utilities (timestamp, logging) | Stable, cleaned |
|
||||
| export.py | Proxy export CLI tool | Active |
|
||||
| engines.py | Search engine implementations | Active |
|
||||
| connection_pool.py | Tor connection pooling | Active |
|
||||
| network_stats.py | Network statistics tracking | Active |
|
||||
| dns.py | DNS resolution with caching | Active |
|
||||
| mitm.py | MITM certificate detection | Active |
|
||||
| job.py | Priority job queue | Active |
|
||||
| static/dashboard.js | Dashboard frontend logic | Active, enhanced |
|
||||
| static/dashboard.html | Dashboard HTML template | Active |
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| ppf.py | Main URL harvester daemon |
|
||||
| proxywatchd.py | Proxy validation daemon |
|
||||
| scraper.py | Searx search integration |
|
||||
| fetch.py | HTTP fetching with proxy support |
|
||||
| dbs.py | Database schema and inserts |
|
||||
| mysqlite.py | SQLite wrapper |
|
||||
| rocksock.py | Socket/proxy abstraction (3rd party) |
|
||||
| http2.py | HTTP client implementation |
|
||||
| httpd.py | Web dashboard and REST API server |
|
||||
| config.py | Configuration management |
|
||||
| comboparse.py | Config/arg parser framework |
|
||||
| soup_parser.py | BeautifulSoup wrapper |
|
||||
| misc.py | Utilities (timestamp, logging) |
|
||||
| export.py | Proxy export CLI tool |
|
||||
| engines.py | Search engine implementations |
|
||||
| connection_pool.py | Tor connection pooling |
|
||||
| network_stats.py | Network statistics tracking |
|
||||
| dns.py | DNS resolution with caching |
|
||||
| mitm.py | MITM certificate detection |
|
||||
| job.py | Priority job queue |
|
||||
| static/dashboard.js | Dashboard frontend logic |
|
||||
| static/dashboard.html | Dashboard HTML template |
|
||||
|
||||
Reference in New Issue
Block a user