# PPF Implementation Tasks ## Legend ``` [ ] Not started [~] In progress [x] Completed [!] Blocked/needs discussion ``` --- ## Immediate Priority (Next Sprint) ### [x] 1. Unify _known_proxies Cache **Completed.** Added `init_known_proxies()`, `add_known_proxies()`, `is_known_proxy()` to fetch.py. Updated ppf.py to use these functions instead of local cache. --- ### [x] 2. Graceful SQLite Error Handling **Completed.** mysqlite.py now retries on "locked" errors with exponential backoff. --- ### [x] 3. Enable SQLite WAL Mode **Completed.** mysqlite.py enables WAL mode and NORMAL synchronous on init. --- ### [x] 4. Batch Database Inserts **Completed.** dbs.py uses executemany() for batch inserts. --- ### [x] 5. Add Database Indexes **Completed.** dbs.py creates indexes on failed, tested, proto, error, check_time. --- ## Short Term (This Month) ### [x] 6. Log Level Filtering **Completed.** Added log level filtering with -q/--quiet and -v/--verbose CLI flags. - misc.py: LOG_LEVELS dict, set_log_level(), get_log_level() - config.py: Added -q/--quiet and -v/--verbose arguments - Log levels: debug=0, info=1, warn=2, error=3 - --quiet: only show warn/error - --verbose: show debug messages --- ### [x] 7. Connection Timeout Standardization **Completed.** Added timeout_connect and timeout_read to [common] section in config.py. --- ### [x] 8. Failure Categorization **Completed.** Added failure categorization for proxy errors. - misc.py: categorize_error() function, FAIL_* constants - Categories: timeout, refused, auth, unreachable, dns, ssl, closed, proxy, other - proxywatchd.py: Stats.record() now accepts category parameter - Stats.report() shows failure breakdown by category - ProxyTestState.evaluate() returns (success, category) tuple --- ### [x] 9. Priority Queue for Proxy Testing **Completed.** Added priority-based job scheduling for proxy tests. - PriorityJobQueue class with heap-based ordering - calculate_priority() assigns priority 0-4 based on proxy state - Priority 0: New proxies (never tested) - Priority 1: Working proxies (no failures) - Priority 2: Low fail count (< 3) - Priority 3-4: Medium/high fail count - Integrated into prepare_jobs() for automatic prioritization --- ### [x] 10. Periodic Statistics Output **Completed.** Added Stats class to proxywatchd.py with record(), should_report(), and report() methods. Integrated into main loop with configurable stats_interval. --- ## Medium Term (Next Quarter) ### [x] 11. Tor Connection Pooling **Completed.** Added connection pooling with worker-Tor affinity and health monitoring. - connection_pool.py: TorHostState class tracks per-host health, latency, backoff - connection_pool.py: TorConnectionPool with worker affinity, warmup, statistics - proxywatchd.py: Workers get consistent Tor host assignment for circuit reuse - Success/failure tracking with exponential backoff (5s, 10s, 20s, 40s, max 60s) - Latency tracking with rolling averages - Pool status reported alongside periodic stats --- ### [ ] 12. Dynamic Thread Scaling **Problem:** Fixed thread count regardless of success rate or system load. **Implementation:** ```python # proxywatchd.py class ThreadScaler: """Dynamically adjust thread count based on performance.""" def __init__(self, min_threads=5, max_threads=50): self.min = min_threads self.max = max_threads self.current = min_threads self.success_rate_window = [] def record_result(self, success): self.success_rate_window.append(success) if len(self.success_rate_window) > 100: self.success_rate_window.pop(0) def recommended_threads(self): if len(self.success_rate_window) < 20: return self.current success_rate = sum(self.success_rate_window) / len(self.success_rate_window) # High success rate -> can handle more threads if success_rate > 0.7 and self.current < self.max: return self.current + 5 # Low success rate -> reduce load elif success_rate < 0.3 and self.current > self.min: return self.current - 5 return self.current ``` **Files:** proxywatchd.py **Effort:** Medium **Risk:** Medium --- ### [ ] 13. Latency Tracking **Problem:** No visibility into proxy speed. A working but slow proxy may be less useful than a fast one. **Implementation:** ```python # dbs.py - add columns # ALTER TABLE proxylist ADD COLUMN avg_latency REAL DEFAULT 0 # ALTER TABLE proxylist ADD COLUMN latency_samples INTEGER DEFAULT 0 def update_proxy_latency(proxydb, proxy, latency): """Update rolling average latency for proxy.""" row = proxydb.execute( 'SELECT avg_latency, latency_samples FROM proxylist WHERE proxy=?', (proxy,) ).fetchone() if row: old_avg, samples = row # Exponential moving average new_avg = (old_avg * samples + latency) / (samples + 1) new_samples = min(samples + 1, 100) # Cap at 100 samples proxydb.execute( 'UPDATE proxylist SET avg_latency=?, latency_samples=? WHERE proxy=?', (new_avg, new_samples, proxy) ) ``` **Files:** dbs.py, proxywatchd.py **Effort:** Medium **Risk:** Low --- ### [ ] 14. Export Functionality **Problem:** No easy way to export working proxies for use elsewhere. **Implementation:** ```python # new file: export.py def export_proxies(proxydb, format='txt', filters=None): """Export working proxies to various formats.""" query = 'SELECT proto, proxy FROM proxylist WHERE failed=0' if filters: if 'proto' in filters: query += ' AND proto=?' rows = proxydb.execute(query).fetchall() if format == 'txt': return '\n'.join('%s://%s' % (r[0], r[1]) for r in rows) elif format == 'json': import json return json.dumps([{'proto': r[0], 'address': r[1]} for r in rows]) elif format == 'csv': return 'proto,address\n' + '\n'.join('%s,%s' % r for r in rows) # CLI: python export.py --format json --proto socks5 > proxies.json ``` **Files:** new export.py **Effort:** Low **Risk:** Low --- ### [ ] 15. Unit Test Infrastructure **Problem:** No automated tests. Changes can break existing functionality silently. **Implementation:** ``` tests/ ├── __init__.py ├── test_proxy_utils.py # Test IP validation, cleansing ├── test_extract.py # Test proxy/URL extraction ├── test_database.py # Test DB operations with temp DB └── mock_network.py # Mock rocksock for offline testing ``` ```python # tests/test_proxy_utils.py import unittest import sys sys.path.insert(0, '..') import fetch class TestProxyValidation(unittest.TestCase): def test_valid_proxy(self): self.assertTrue(fetch.is_usable_proxy('8.8.8.8:8080')) def test_private_ip_rejected(self): self.assertFalse(fetch.is_usable_proxy('192.168.1.1:8080')) self.assertFalse(fetch.is_usable_proxy('10.0.0.1:8080')) self.assertFalse(fetch.is_usable_proxy('172.16.0.1:8080')) def test_invalid_port_rejected(self): self.assertFalse(fetch.is_usable_proxy('8.8.8.8:0')) self.assertFalse(fetch.is_usable_proxy('8.8.8.8:99999')) if __name__ == '__main__': unittest.main() ``` **Files:** tests/ directory **Effort:** High (initial), Low (ongoing) **Risk:** Low --- ## Long Term (Future) ### [ ] 16. Geographic Validation Verify proxy actually routes through claimed location using IP geolocation. ### [ ] 17. HTTPS/SSL Proxy Testing Add capability to test HTTPS CONNECT proxies. ### [x] 18. Additional Search Engines **Completed.** Added modular search engine architecture. - engines.py: SearchEngine base class with build_url(), extract_urls(), is_rate_limited() - Engines: DuckDuckGo, Startpage, Mojeek (UK), Qwant (FR), Yandex (RU), Ecosia, Brave - Git hosters: GitHub, GitLab, Codeberg, Gitea - scraper.py: EngineTracker class for multi-engine rate limiting - Config: [scraper] engines, max_pages settings - searx.instances: Updated with 51 active SearXNG instances ### [x] 19. REST API **Completed.** Added HTTP API server for querying working proxies. - httpd.py: ProxyAPIServer class with BaseHTTPServer - Endpoints: /proxies, /proxies/count, /health - Params: limit, proto, country, format (json/plain) - Integrated into proxywatchd.py (starts when httpd.enabled=True) - Config: [httpd] section with listenip, port, enabled ### [ ] 20. Web Dashboard Status page showing live statistics. --- ## Completed ### [x] Work-Stealing Queue - Implemented shared Queue.Queue() for job distribution - Workers pull from shared queue instead of pre-assigned lists - Better utilization across threads ### [x] Multi-Target Validation - Test each proxy against 3 random targets - 2/3 majority required for success - Reduces false negatives from single target failures ### [x] Interleaved Testing - Jobs shuffled across all proxies before queueing - Prevents burst of 3 connections to same proxy - ProxyTestState accumulates results from TargetTestJobs ### [x] Code Cleanup - Removed 93 lines dead HTTP server code (ppf.py) - Removed dead gumbo parser (soup_parser.py) - Removed test code (comboparse.py) - Removed unused functions (misc.py) - Fixed IP/port cleansing (ppf.py) - Updated .gitignore ### [x] Rate Limiting & Instance Tracking (scraper.py) - InstanceTracker class with exponential backoff - Configurable backoff_base, backoff_max, fail_threshold - Instance cycling when rate limited ### [x] Exception Logging with Context - Replaced bare `except:` with typed exceptions across all files - Added context logging to exception handlers (e.g., URL, error message) ### [x] Timeout Standardization - Added timeout_connect, timeout_read to [common] config section - Added stale_days, stats_interval to [watchd] config section ### [x] Periodic Stats & Stale Cleanup (proxywatchd.py) - Stats class tracks tested/passed/failed with thread-safe counters - Configurable stats_interval (default: 300s) - cleanup_stale() removes dead proxies older than stale_days (default: 30) ### [x] Unified Proxy Cache - Moved _known_proxies to fetch.py with helper functions - init_known_proxies(), add_known_proxies(), is_known_proxy() - ppf.py now uses shared cache via fetch module ### [x] Config Validation - config.py: validate() method checks config values on startup - Validates: port ranges, timeout values, thread counts, engine names - Warns on missing source_file, unknown engines - Errors on unwritable database directories - Integrated into ppf.py, proxywatchd.py, scraper.py main entry points