11 KiB
PPF Implementation Tasks
Legend
[ ] Not started
[~] In progress
[x] Completed
[!] Blocked/needs discussion
Immediate Priority (Next Sprint)
[x] 1. Unify _known_proxies Cache
Completed. Added init_known_proxies(), add_known_proxies(), is_known_proxy()
to fetch.py. Updated ppf.py to use these functions instead of local cache.
[x] 2. Graceful SQLite Error Handling
Completed. mysqlite.py now retries on "locked" errors with exponential backoff.
[x] 3. Enable SQLite WAL Mode
Completed. mysqlite.py enables WAL mode and NORMAL synchronous on init.
[x] 4. Batch Database Inserts
Completed. dbs.py uses executemany() for batch inserts.
[x] 5. Add Database Indexes
Completed. dbs.py creates indexes on failed, tested, proto, error, check_time.
Short Term (This Month)
[x] 6. Log Level Filtering
Completed. Added log level filtering with -q/--quiet and -v/--verbose CLI flags.
- misc.py: LOG_LEVELS dict, set_log_level(), get_log_level()
- config.py: Added -q/--quiet and -v/--verbose arguments
- Log levels: debug=0, info=1, warn=2, error=3
- --quiet: only show warn/error
- --verbose: show debug messages
[x] 7. Connection Timeout Standardization
Completed. Added timeout_connect and timeout_read to [common] section in config.py.
[x] 8. Failure Categorization
Completed. Added failure categorization for proxy errors.
- misc.py: categorize_error() function, FAIL_* constants
- Categories: timeout, refused, auth, unreachable, dns, ssl, closed, proxy, other
- proxywatchd.py: Stats.record() now accepts category parameter
- Stats.report() shows failure breakdown by category
- ProxyTestState.evaluate() returns (success, category) tuple
[x] 9. Priority Queue for Proxy Testing
Completed. Added priority-based job scheduling for proxy tests.
- PriorityJobQueue class with heap-based ordering
- calculate_priority() assigns priority 0-4 based on proxy state
- Priority 0: New proxies (never tested)
- Priority 1: Working proxies (no failures)
- Priority 2: Low fail count (< 3)
- Priority 3-4: Medium/high fail count
- Integrated into prepare_jobs() for automatic prioritization
[x] 10. Periodic Statistics Output
Completed. Added Stats class to proxywatchd.py with record(), should_report(), and report() methods. Integrated into main loop with configurable stats_interval.
Medium Term (Next Quarter)
[x] 11. Tor Connection Pooling
Completed. Added connection pooling with worker-Tor affinity and health monitoring.
- connection_pool.py: TorHostState class tracks per-host health, latency, backoff
- connection_pool.py: TorConnectionPool with worker affinity, warmup, statistics
- proxywatchd.py: Workers get consistent Tor host assignment for circuit reuse
- Success/failure tracking with exponential backoff (5s, 10s, 20s, 40s, max 60s)
- Latency tracking with rolling averages
- Pool status reported alongside periodic stats
[x] 12. Dynamic Thread Scaling
Completed. Added dynamic thread scaling based on queue depth and success rate.
- ThreadScaler class in proxywatchd.py with should_scale(), status_line()
- Scales up when queue is deep (2x target) and success rate > 10%
- Scales down when queue is shallow or success rate drops
- Min/max threads derived from config.watchd.threads (1/4x to 2x)
- 30-second cooldown between scaling decisions
- _spawn_thread(), _remove_thread(), _adjust_threads() helper methods
- Scaler status reported alongside periodic stats
[x] 13. Latency Tracking
Completed. Added per-proxy latency tracking with exponential moving average.
- dbs.py: avg_latency, latency_samples columns added to proxylist schema
- dbs.py: _migrate_latency_columns() for backward-compatible migration
- dbs.py: update_proxy_latency() with EMA (alpha = 2/(samples+1))
- proxywatchd.py: ProxyTestState.last_latency_ms field
- proxywatchd.py: evaluate() calculates average latency from successful tests
- proxywatchd.py: submit_collected() records latency for passing proxies
[ ] 14. Export Functionality
Problem: No easy way to export working proxies for use elsewhere.
Implementation:
# new file: export.py
def export_proxies(proxydb, format='txt', filters=None):
"""Export working proxies to various formats."""
query = 'SELECT proto, proxy FROM proxylist WHERE failed=0'
if filters:
if 'proto' in filters:
query += ' AND proto=?'
rows = proxydb.execute(query).fetchall()
if format == 'txt':
return '\n'.join('%s://%s' % (r[0], r[1]) for r in rows)
elif format == 'json':
import json
return json.dumps([{'proto': r[0], 'address': r[1]} for r in rows])
elif format == 'csv':
return 'proto,address\n' + '\n'.join('%s,%s' % r for r in rows)
# CLI: python export.py --format json --proto socks5 > proxies.json
Files: new export.py Effort: Low Risk: Low
[ ] 15. Unit Test Infrastructure
Problem: No automated tests. Changes can break existing functionality silently.
Implementation:
tests/
├── __init__.py
├── test_proxy_utils.py # Test IP validation, cleansing
├── test_extract.py # Test proxy/URL extraction
├── test_database.py # Test DB operations with temp DB
└── mock_network.py # Mock rocksock for offline testing
# tests/test_proxy_utils.py
import unittest
import sys
sys.path.insert(0, '..')
import fetch
class TestProxyValidation(unittest.TestCase):
def test_valid_proxy(self):
self.assertTrue(fetch.is_usable_proxy('8.8.8.8:8080'))
def test_private_ip_rejected(self):
self.assertFalse(fetch.is_usable_proxy('192.168.1.1:8080'))
self.assertFalse(fetch.is_usable_proxy('10.0.0.1:8080'))
self.assertFalse(fetch.is_usable_proxy('172.16.0.1:8080'))
def test_invalid_port_rejected(self):
self.assertFalse(fetch.is_usable_proxy('8.8.8.8:0'))
self.assertFalse(fetch.is_usable_proxy('8.8.8.8:99999'))
if __name__ == '__main__':
unittest.main()
Files: tests/ directory Effort: High (initial), Low (ongoing) Risk: Low
Long Term (Future)
[x] 16. Geographic Validation
Completed. Added IP2Location and pyasn for proxy geolocation.
- requirements.txt: Added IP2Location package
- proxywatchd.py: IP2Location for country lookup, pyasn for ASN lookup
- proxywatchd.py: Fixed ValueError handling when database files missing
- data/: IP2LOCATION-LITE-DB1.BIN (2.7M), ipasn.dat (23M)
- Output shows country codes:
http://1.2.3.4:8080 (US)or(IN),(DE), etc.
[x] 17. SSL Proxy Testing
Completed. Added SSL checktype for TLS handshake validation.
- config.py: Default checktype changed to 'ssl'
- proxywatchd.py: ssl_targets list with major HTTPS sites
- Validates TLS handshake with certificate verification
- Detects MITM proxies that intercept SSL connections
[x] 18. Additional Search Engines
Completed. Added modular search engine architecture.
- engines.py: SearchEngine base class with build_url(), extract_urls(), is_rate_limited()
- Engines: DuckDuckGo, Startpage, Mojeek (UK), Qwant (FR), Yandex (RU), Ecosia, Brave
- Git hosters: GitHub, GitLab, Codeberg, Gitea
- scraper.py: EngineTracker class for multi-engine rate limiting
- Config: [scraper] engines, max_pages settings
- searx.instances: Updated with 51 active SearXNG instances
[x] 19. REST API
Completed. Added HTTP API server for querying working proxies.
- httpd.py: ProxyAPIServer class with BaseHTTPServer
- Endpoints: /proxies, /proxies/count, /health
- Params: limit, proto, country, format (json/plain)
- Integrated into proxywatchd.py (starts when httpd.enabled=True)
- Config: [httpd] section with listenip, port, enabled
[ ] 20. Web Dashboard
Status page showing live statistics.
Completed
[x] Work-Stealing Queue
- Implemented shared Queue.Queue() for job distribution
- Workers pull from shared queue instead of pre-assigned lists
- Better utilization across threads
[x] Multi-Target Validation
- Test each proxy against 3 random targets
- 2/3 majority required for success
- Reduces false negatives from single target failures
[x] Interleaved Testing
- Jobs shuffled across all proxies before queueing
- Prevents burst of 3 connections to same proxy
- ProxyTestState accumulates results from TargetTestJobs
[x] Code Cleanup
- Removed 93 lines dead HTTP server code (ppf.py)
- Removed dead gumbo parser (soup_parser.py)
- Removed test code (comboparse.py)
- Removed unused functions (misc.py)
- Fixed IP/port cleansing (ppf.py)
- Updated .gitignore
[x] Rate Limiting & Instance Tracking (scraper.py)
- InstanceTracker class with exponential backoff
- Configurable backoff_base, backoff_max, fail_threshold
- Instance cycling when rate limited
[x] Exception Logging with Context
- Replaced bare
except:with typed exceptions across all files - Added context logging to exception handlers (e.g., URL, error message)
[x] Timeout Standardization
- Added timeout_connect, timeout_read to [common] config section
- Added stale_days, stats_interval to [watchd] config section
[x] Periodic Stats & Stale Cleanup (proxywatchd.py)
- Stats class tracks tested/passed/failed with thread-safe counters
- Configurable stats_interval (default: 300s)
- cleanup_stale() removes dead proxies older than stale_days (default: 30)
[x] Unified Proxy Cache
- Moved _known_proxies to fetch.py with helper functions
- init_known_proxies(), add_known_proxies(), is_known_proxy()
- ppf.py now uses shared cache via fetch module
[x] Config Validation
- config.py: validate() method checks config values on startup
- Validates: port ranges, timeout values, thread counts, engine names
- Warns on missing source_file, unknown engines
- Errors on unwritable database directories
- Integrated into ppf.py, proxywatchd.py, scraper.py main entry points
[x] Profiling Support
- config.py: Added --profile CLI argument
- ppf.py: Refactored main logic into main() function
- ppf.py: cProfile wrapper with stats output to profile.stats
- Prints top 20 functions by cumulative time on exit
- Usage:
python2 ppf.py --profile
[x] SIGTERM Graceful Shutdown
- ppf.py: Added signal handler converting SIGTERM to KeyboardInterrupt
- Ensures profile stats are written before container exit
- Allows clean thread shutdown in containerized environments
- Podman stop now triggers proper cleanup instead of SIGKILL
[x] Unicode Exception Handling (Python 2)
- Problem:
repr(e)on exceptions with unicode content caused encoding errors - Files affected: ppf.py, scraper.py (3 exception handlers)
- Solution: Check
isinstance(err_msg, unicode)then encode with 'backslashreplace' - Pattern applied:
try: err_msg = repr(e) if isinstance(err_msg, unicode): err_msg = err_msg.encode('ascii', 'backslashreplace') except: err_msg = type(e).__name__ - Handles Korean/CJK characters in search queries without crashing