Files
ppf/TODO.md
2025-12-24 00:20:40 +01:00

20 KiB

PPF Implementation Tasks

Legend

[ ] Not started
[~] In progress
[x] Completed
[!] Blocked/needs discussion

Immediate Priority (Next Sprint)

[x] 1. Unify _known_proxies Cache

Completed. Added init_known_proxies(), add_known_proxies(), is_known_proxy() to fetch.py. Updated ppf.py to use these functions instead of local cache.


[x] 2. Graceful SQLite Error Handling

Completed. mysqlite.py now retries on "locked" errors with exponential backoff.


[x] 3. Enable SQLite WAL Mode

Completed. mysqlite.py enables WAL mode and NORMAL synchronous on init.


[x] 4. Batch Database Inserts

Completed. dbs.py uses executemany() for batch inserts.


[x] 5. Add Database Indexes

Completed. dbs.py creates indexes on failed, tested, proto, error, check_time.


Short Term (This Month)

[x] 6. Log Level Filtering

Completed. Added log level filtering with -q/--quiet and -v/--verbose CLI flags.

  • misc.py: LOG_LEVELS dict, set_log_level(), get_log_level()
  • config.py: Added -q/--quiet and -v/--verbose arguments
  • Log levels: debug=0, info=1, warn=2, error=3
  • --quiet: only show warn/error
  • --verbose: show debug messages

[x] 7. Connection Timeout Standardization

Completed. Added timeout_connect and timeout_read to [common] section in config.py.


[x] 8. Failure Categorization

Completed. Added failure categorization for proxy errors.

  • misc.py: categorize_error() function, FAIL_* constants
  • Categories: timeout, refused, auth, unreachable, dns, ssl, closed, proxy, other
  • proxywatchd.py: Stats.record() now accepts category parameter
  • Stats.report() shows failure breakdown by category
  • ProxyTestState.evaluate() returns (success, category) tuple

[x] 9. Priority Queue for Proxy Testing

Completed. Added priority-based job scheduling for proxy tests.

  • PriorityJobQueue class with heap-based ordering
  • calculate_priority() assigns priority 0-4 based on proxy state
  • Priority 0: New proxies (never tested)
  • Priority 1: Working proxies (no failures)
  • Priority 2: Low fail count (< 3)
  • Priority 3-4: Medium/high fail count
  • Integrated into prepare_jobs() for automatic prioritization

[x] 10. Periodic Statistics Output

Completed. Added Stats class to proxywatchd.py with record(), should_report(), and report() methods. Integrated into main loop with configurable stats_interval.


Medium Term (Next Quarter)

[x] 11. Tor Connection Pooling

Completed. Added connection pooling with worker-Tor affinity and health monitoring.

  • connection_pool.py: TorHostState class tracks per-host health, latency, backoff
  • connection_pool.py: TorConnectionPool with worker affinity, warmup, statistics
  • proxywatchd.py: Workers get consistent Tor host assignment for circuit reuse
  • Success/failure tracking with exponential backoff (5s, 10s, 20s, 40s, max 60s)
  • Latency tracking with rolling averages
  • Pool status reported alongside periodic stats

[x] 12. Dynamic Thread Scaling

Completed. Added dynamic thread scaling based on queue depth and success rate.

  • ThreadScaler class in proxywatchd.py with should_scale(), status_line()
  • Scales up when queue is deep (2x target) and success rate > 10%
  • Scales down when queue is shallow or success rate drops
  • Min/max threads derived from config.watchd.threads (1/4x to 2x)
  • 30-second cooldown between scaling decisions
  • _spawn_thread(), _remove_thread(), _adjust_threads() helper methods
  • Scaler status reported alongside periodic stats

[x] 13. Latency Tracking

Completed. Added per-proxy latency tracking with exponential moving average.

  • dbs.py: avg_latency, latency_samples columns added to proxylist schema
  • dbs.py: _migrate_latency_columns() for backward-compatible migration
  • dbs.py: update_proxy_latency() with EMA (alpha = 2/(samples+1))
  • proxywatchd.py: ProxyTestState.last_latency_ms field
  • proxywatchd.py: evaluate() calculates average latency from successful tests
  • proxywatchd.py: submit_collected() records latency for passing proxies

[x] 14. Export Functionality

Completed. Added export.py CLI tool for exporting working proxies.

  • Formats: txt (default), json, csv, len (length-prefixed)
  • Filters: --proto, --country, --anonymity, --max-latency
  • Options: --sort (latency, added, tested, success), --limit, --pretty
  • Output: stdout or --output file
  • Usage: python export.py --proto http --country US --sort latency --limit 100

[ ] 15. Unit Test Infrastructure

Problem: No automated tests. Changes can break existing functionality silently.

Implementation:

tests/
├── __init__.py
├── test_proxy_utils.py    # Test IP validation, cleansing
├── test_extract.py        # Test proxy/URL extraction
├── test_database.py       # Test DB operations with temp DB
└── mock_network.py        # Mock rocksock for offline testing
# tests/test_proxy_utils.py
import unittest
import sys
sys.path.insert(0, '..')
import fetch

class TestProxyValidation(unittest.TestCase):
    def test_valid_proxy(self):
        self.assertTrue(fetch.is_usable_proxy('8.8.8.8:8080'))

    def test_private_ip_rejected(self):
        self.assertFalse(fetch.is_usable_proxy('192.168.1.1:8080'))
        self.assertFalse(fetch.is_usable_proxy('10.0.0.1:8080'))
        self.assertFalse(fetch.is_usable_proxy('172.16.0.1:8080'))

    def test_invalid_port_rejected(self):
        self.assertFalse(fetch.is_usable_proxy('8.8.8.8:0'))
        self.assertFalse(fetch.is_usable_proxy('8.8.8.8:99999'))

if __name__ == '__main__':
    unittest.main()

Files: tests/ directory Effort: High (initial), Low (ongoing) Risk: Low


Long Term (Future)

[x] 16. Geographic Validation

Completed. Added IP2Location and pyasn for proxy geolocation.

  • requirements.txt: Added IP2Location package
  • proxywatchd.py: IP2Location for country lookup, pyasn for ASN lookup
  • proxywatchd.py: Fixed ValueError handling when database files missing
  • data/: IP2LOCATION-LITE-DB1.BIN (2.7M), ipasn.dat (23M)
  • Output shows country codes: http://1.2.3.4:8080 (US) or (IN), (DE), etc.

[x] 17. SSL Proxy Testing

Completed. Added SSL checktype for TLS handshake validation.

  • config.py: Default checktype changed to 'ssl'
  • proxywatchd.py: ssl_targets list with major HTTPS sites
  • Validates TLS handshake with certificate verification
  • Detects MITM proxies that intercept SSL connections

[x] 18. Additional Search Engines

Completed. Added modular search engine architecture.

  • engines.py: SearchEngine base class with build_url(), extract_urls(), is_rate_limited()
  • Engines: DuckDuckGo, Startpage, Mojeek (UK), Qwant (FR), Yandex (RU), Ecosia, Brave
  • Git hosters: GitHub, GitLab, Codeberg, Gitea
  • scraper.py: EngineTracker class for multi-engine rate limiting
  • Config: [scraper] engines, max_pages settings
  • searx.instances: Updated with 51 active SearXNG instances

[x] 19. REST API

Completed. Added HTTP API server for querying working proxies.

  • httpd.py: ProxyAPIServer class with BaseHTTPServer
  • Endpoints: /proxies, /proxies/count, /health
  • Params: limit, proto, country, format (json/plain)
  • Integrated into proxywatchd.py (starts when httpd.enabled=True)
  • Config: [httpd] section with listenip, port, enabled

[x] 20. Web Dashboard

Completed. Added web dashboard with live statistics.

  • httpd.py: DASHBOARD_HTML template with dark theme UI
  • Endpoint: /dashboard (HTML page with auto-refresh)
  • Endpoint: /api/stats (JSON runtime statistics)
  • Stats include: tested/passed counts, success rate, thread count, uptime
  • Tor pool health: per-host latency, success rate, availability
  • Failure categories: timeout, proxy, ssl, closed, etc.
  • proxywatchd.py: get_runtime_stats() method provides stats callback

[x] 21. Dashboard Enhancements (v2)

Completed. Major dashboard improvements for better visibility.

  • Prominent check type badge in header (SSL/JUDGES/HTTP/IRC with color coding)
  • System monitor bar: load average, memory usage, disk usage, process RSS
  • Anonymity breakdown: elite/anonymous/transparent proxy counts
  • Database health indicators: size, tested/hour, added/day, dead count
  • Enhanced Tor pool: total requests, success rate, healthy nodes, avg latency
  • SQLite ANALYZE/VACUUM functions for query optimization (dbs.py)
  • Database statistics API (get_database_stats())

[x] 22. Completion Queue Optimization

Completed. Eliminated polling bottleneck in proxy test collection.

  • Added completion_queue for event-driven state signaling
  • ProxyTestState.record_result() signals when all targets complete
  • collect_work() drains queue instead of polling all pending states
  • Changed pending_states from list to dict for O(1) removal
  • Result: is_complete() eliminated from hot path, collect_work() 54x faster

Profiling-Based Performance Optimizations

Baseline: 30-minute profiling session, 25.6M function calls, 1842s runtime

The following optimizations were identified through cProfile analysis. Each is assessed for real-world impact based on measured data.

[x] 1. SQLite Query Batching

Completed. Added batch update functions and optimized submit_collected().

Implementation:

  • batch_update_proxy_latency(): Single SELECT with IN clause, compute EMA in Python, batch UPDATE with executemany()
  • batch_update_proxy_anonymity(): Batch all anonymity updates in single executemany()
  • submit_collected(): Uses batch functions instead of per-proxy loops

Previous State:

  • 18,182 execute() calls consuming 50.6s (2.7% of runtime)
  • Individual UPDATE for each proxy latency and anonymity

Improvement:

  • Reduced from N execute() + N commit() to 1 SELECT + 1 executemany() per batch
  • Estimated 15-25% reduction in SQLite overhead

[ ] 2. Proxy Validation Caching

Current State:

  • is_usable_proxy(): 174,620 calls, 1.79s total
  • fetch.py:242 <genexpr>: 3,403,165 calls, 3.66s total (proxy iteration)
  • Many repeated validations for same proxy strings

Proposed Change:

  • Add LRU cache decorator to is_usable_proxy()
  • Cache size: 10,000 entries (covers typical working set)
  • TTL: None needed (IP validity doesn't change)

Assessment:

Current cost:     5.5s per 30min = 11s/hour = 4.4min/day
Potential saving: 50-70% cache hit rate = 2.7-3.8s per 30min = 5-8s/hour
Effort:           Very low (add @lru_cache decorator)
Risk:             None (pure function, deterministic output)

Verdict: LOW PRIORITY. Minimal gain for minimal effort. Do if convenient.


[x] 3. Regex Pattern Pre-compilation

Completed. Pre-compiled proxy extraction pattern at module load.

Implementation:

  • fetch.py: Added PROXY_PATTERN = re.compile(r'...') at module level
  • extract_proxies(): Changed re.findall(pattern, ...) to PROXY_PATTERN.findall(...)
  • Pattern compiled once at import, not on each call

Previous State:

  • extract_proxies(): 166 calls, 2.87s total (17.3ms each)
  • Pattern recompiled on each call

Improvement:

  • Eliminated per-call regex compilation overhead
  • Estimated 30-50% reduction in extract_proxies() time

[ ] 4. JSON Stats Response Caching

Current State:

  • 1.9M calls to JSON encoder functions
  • _iterencode_dict: 1.4s, _iterencode_list: 0.8s
  • Dashboard polls every 3 seconds = 600 requests per 30min
  • Most stats data unchanged between requests

Proposed Change:

  • Cache serialized JSON response with short TTL (1-2 seconds)
  • Only regenerate when underlying stats change
  • Use ETag/If-None-Match for client-side caching

Assessment:

Current cost:     ~5.5s per 30min (JSON encoding overhead)
Potential saving: 60-80% = 3.3-4.4s per 30min = 6.6-8.8s/hour
Effort:           Medium (add caching layer to httpd.py)
Risk:             Low (stale stats for 1-2 seconds acceptable)

Verdict: LOW PRIORITY. Only matters with frequent dashboard access.


[ ] 5. Object Pooling for Test States

Current State:

  • __new__ calls: 43,413 at 10.1s total
  • ProxyTestState.__init__: 18,150 calls, 0.87s
  • TargetTestJob creation: similar overhead
  • Objects created and discarded each test cycle

Proposed Change:

  • Implement object pool for ProxyTestState and TargetTestJob
  • Reset and reuse objects instead of creating new
  • Pool size: 2x thread count

Assessment:

Current cost:     ~11s per 30min = 22s/hour = 14.7min/day
Potential saving: 50-70% = 5.5-7.7s per 30min = 11-15s/hour = 7-10min/day
Effort:           High (significant refactoring, reset logic needed)
Risk:             Medium (state leakage bugs if reset incomplete)

Verdict: NOT RECOMMENDED. High effort, medium risk, modest gain. Python's object creation is already optimized. Focus elsewhere.


[ ] 6. SQLite Connection Reuse

Current State:

  • 718 connection opens in 30min session
  • Each open: 0.26ms (total 0.18s for connects)
  • Connection per operation pattern in mysqlite.py

Proposed Change:

  • Maintain persistent connection per thread
  • Implement connection pool with health checks
  • Reuse connections across operations

Assessment:

Current cost:     0.18s per 30min (connection overhead only)
Potential saving: 90% = 0.16s per 30min = 0.32s/hour
Effort:           Medium (thread-local storage, lifecycle management)
Risk:             Medium (connection state, locking issues)

Verdict: NOT RECOMMENDED. Negligible time savings (0.16s per 30min). SQLite's lightweight connections don't justify pooling complexity.


Summary: Optimization Priority Matrix

┌─────────────────────────────────────┬────────┬────────┬─────────┬───────────┐
│ Optimization                        │ Effort │ Risk   │ Savings │ Status
├─────────────────────────────────────┼────────┼────────┼─────────┼───────────┤
│ 1. SQLite Query Batching            │ Low    │ Low    │ 20-34s/h│ DONE
│ 2. Proxy Validation Caching         │ V.Low  │ None   │ 5-8s/h  │ Maybe
│ 3. Regex Pre-compilation            │ Low    │ None   │ 5-8s/h  │ DONE
│ 4. JSON Response Caching            │ Medium │ Low    │ 7-9s/h  │ Later
│ 5. Object Pooling                   │ High   │ Medium │ 11-15s/h│ Skip
│ 6. SQLite Connection Reuse          │ Medium │ Medium │ 0.3s/h  │ Skip
└─────────────────────────────────────┴────────┴────────┴─────────┴───────────┘

Completed: 1 (SQLite Batching), 3 (Regex Pre-compilation)
Remaining: 2 (Proxy Caching - Maybe), 4 (JSON Caching - Later)

Realized savings from completed optimizations:
  Per hour:   25-42 seconds saved
  Per day:    10-17 minutes saved
  Per week:   1.2-2.0 hours saved

Note: 68.7% of runtime is socket I/O (recv/send) which cannot be optimized
without changing the fundamental network architecture. The optimizations
above target the remaining 31.3% of CPU-bound operations.

Potential Dashboard Improvements

[ ] Dashboard Performance Optimizations

Goal: Ensure dashboard remains lightweight and doesn't impact system performance.

Current safeguards:

  • No polling on server side (client-initiated via fetch)
  • 3-second refresh interval (configurable)
  • Minimal DOM updates (targeted element updates, not full re-render)
  • Static CSS/JS (no server-side templating per request)
  • No persistent connections (stateless HTTP)

Future considerations:

  • Add rate limiting on /api/stats endpoint
  • Cache expensive DB queries (top countries, protocol breakdown)
  • Lazy-load historical data (only when scrolled into view)
  • WebSocket option for push updates (reduce polling overhead)
  • Configurable refresh interval via URL param or localStorage
  • Disable auto-refresh when tab not visible (Page Visibility API)

[ ] Dashboard Feature Ideas

Low priority - consider when time permits:

  • Dark/light theme toggle
  • Export stats as CSV/JSON from dashboard
  • Historical graphs (24h, 7d) using stats_history table
  • Per-ASN performance analysis
  • Geographic map visualization (requires JS library)
  • Alert thresholds (success rate < X%, MITM detected)
  • Mobile-responsive improvements
  • Keyboard shortcuts (r=refresh, t=toggle sections)

Completed

[x] Work-Stealing Queue

  • Implemented shared Queue.Queue() for job distribution
  • Workers pull from shared queue instead of pre-assigned lists
  • Better utilization across threads

[x] Multi-Target Validation

  • Test each proxy against 3 random targets
  • 2/3 majority required for success
  • Reduces false negatives from single target failures

[x] Interleaved Testing

  • Jobs shuffled across all proxies before queueing
  • Prevents burst of 3 connections to same proxy
  • ProxyTestState accumulates results from TargetTestJobs

[x] Code Cleanup

  • Removed 93 lines dead HTTP server code (ppf.py)
  • Removed dead gumbo parser (soup_parser.py)
  • Removed test code (comboparse.py)
  • Removed unused functions (misc.py)
  • Fixed IP/port cleansing (ppf.py)
  • Updated .gitignore

[x] Rate Limiting & Instance Tracking (scraper.py)

  • InstanceTracker class with exponential backoff
  • Configurable backoff_base, backoff_max, fail_threshold
  • Instance cycling when rate limited

[x] Exception Logging with Context

  • Replaced bare except: with typed exceptions across all files
  • Added context logging to exception handlers (e.g., URL, error message)

[x] Timeout Standardization

  • Added timeout_connect, timeout_read to [common] config section
  • Added stale_days, stats_interval to [watchd] config section

[x] Periodic Stats & Stale Cleanup (proxywatchd.py)

  • Stats class tracks tested/passed/failed with thread-safe counters
  • Configurable stats_interval (default: 300s)
  • cleanup_stale() removes dead proxies older than stale_days (default: 30)

[x] Unified Proxy Cache

  • Moved _known_proxies to fetch.py with helper functions
  • init_known_proxies(), add_known_proxies(), is_known_proxy()
  • ppf.py now uses shared cache via fetch module

[x] Config Validation

  • config.py: validate() method checks config values on startup
  • Validates: port ranges, timeout values, thread counts, engine names
  • Warns on missing source_file, unknown engines
  • Errors on unwritable database directories
  • Integrated into ppf.py, proxywatchd.py, scraper.py main entry points

[x] Profiling Support

  • config.py: Added --profile CLI argument
  • ppf.py: Refactored main logic into main() function
  • ppf.py: cProfile wrapper with stats output to profile.stats
  • Prints top 20 functions by cumulative time on exit
  • Usage: python2 ppf.py --profile

[x] SIGTERM Graceful Shutdown

  • ppf.py: Added signal handler converting SIGTERM to KeyboardInterrupt
  • Ensures profile stats are written before container exit
  • Allows clean thread shutdown in containerized environments
  • Podman stop now triggers proper cleanup instead of SIGKILL

[x] Unicode Exception Handling (Python 2)

  • Problem: repr(e) on exceptions with unicode content caused encoding errors
  • Files affected: ppf.py, scraper.py (3 exception handlers)
  • Solution: Check isinstance(err_msg, unicode) then encode with 'backslashreplace'
  • Pattern applied:
    try:
        err_msg = repr(e)
        if isinstance(err_msg, unicode):
            err_msg = err_msg.encode('ascii', 'backslashreplace')
    except:
        err_msg = type(e).__name__
    
  • Handles Korean/CJK characters in search queries without crashing