Files

Username d356cdf6ee docs: mark priority queue complete

2025-12-20 23:11:54 +01:00

10 KiB

Raw Blame History

PPF Implementation Tasks

Legend

[ ] Not started
[~] In progress
[x] Completed
[!] Blocked/needs discussion

Immediate Priority (Next Sprint)

[x] 1. Unify _known_proxies Cache

Completed. Added init_known_proxies(), add_known_proxies(), is_known_proxy() to fetch.py. Updated ppf.py to use these functions instead of local cache.

[x] 2. Graceful SQLite Error Handling

Completed. mysqlite.py now retries on "locked" errors with exponential backoff.

[x] 3. Enable SQLite WAL Mode

Completed. mysqlite.py enables WAL mode and NORMAL synchronous on init.

[x] 4. Batch Database Inserts

Completed. dbs.py uses executemany() for batch inserts.

[x] 5. Add Database Indexes

Completed. dbs.py creates indexes on failed, tested, proto, error, check_time.

Short Term (This Month)

[x] 6. Log Level Filtering

Completed. Added log level filtering with -q/--quiet and -v/--verbose CLI flags.

misc.py: LOG_LEVELS dict, set_log_level(), get_log_level()
config.py: Added -q/--quiet and -v/--verbose arguments
Log levels: debug=0, info=1, warn=2, error=3
--quiet: only show warn/error
--verbose: show debug messages

[x] 7. Connection Timeout Standardization

Completed. Added timeout_connect and timeout_read to [common] section in config.py.

[x] 8. Failure Categorization

Completed. Added failure categorization for proxy errors.

misc.py: categorize_error() function, FAIL_* constants
Categories: timeout, refused, auth, unreachable, dns, ssl, closed, proxy, other
proxywatchd.py: Stats.record() now accepts category parameter
Stats.report() shows failure breakdown by category
ProxyTestState.evaluate() returns (success, category) tuple

[x] 9. Priority Queue for Proxy Testing

Completed. Added priority-based job scheduling for proxy tests.

PriorityJobQueue class with heap-based ordering
calculate_priority() assigns priority 0-4 based on proxy state
Priority 0: New proxies (never tested)
Priority 1: Working proxies (no failures)
Priority 2: Low fail count (< 3)
Priority 3-4: Medium/high fail count
Integrated into prepare_jobs() for automatic prioritization

[x] 10. Periodic Statistics Output

Completed. Added Stats class to proxywatchd.py with record(), should_report(), and report() methods. Integrated into main loop with configurable stats_interval.

Medium Term (Next Quarter)

[x] 11. Tor Connection Pooling

Completed. Added connection pooling with worker-Tor affinity and health monitoring.

connection_pool.py: TorHostState class tracks per-host health, latency, backoff
connection_pool.py: TorConnectionPool with worker affinity, warmup, statistics
proxywatchd.py: Workers get consistent Tor host assignment for circuit reuse
Success/failure tracking with exponential backoff (5s, 10s, 20s, 40s, max 60s)
Latency tracking with rolling averages
Pool status reported alongside periodic stats

[ ] 12. Dynamic Thread Scaling

Problem: Fixed thread count regardless of success rate or system load.

Implementation:

# proxywatchd.py
class ThreadScaler:
    """Dynamically adjust thread count based on performance."""

    def __init__(self, min_threads=5, max_threads=50):
        self.min = min_threads
        self.max = max_threads
        self.current = min_threads
        self.success_rate_window = []

    def record_result(self, success):
        self.success_rate_window.append(success)
        if len(self.success_rate_window) > 100:
            self.success_rate_window.pop(0)

    def recommended_threads(self):
        if len(self.success_rate_window) < 20:
            return self.current

        success_rate = sum(self.success_rate_window) / len(self.success_rate_window)

        # High success rate -> can handle more threads
        if success_rate > 0.7 and self.current < self.max:
            return self.current + 5
        # Low success rate -> reduce load
        elif success_rate < 0.3 and self.current > self.min:
            return self.current - 5

        return self.current

Files: proxywatchd.py Effort: Medium Risk: Medium

[ ] 13. Latency Tracking

Problem: No visibility into proxy speed. A working but slow proxy may be less useful than a fast one.

Implementation:

# dbs.py - add columns
# ALTER TABLE proxylist ADD COLUMN avg_latency REAL DEFAULT 0
# ALTER TABLE proxylist ADD COLUMN latency_samples INTEGER DEFAULT 0

def update_proxy_latency(proxydb, proxy, latency):
    """Update rolling average latency for proxy."""
    row = proxydb.execute(
        'SELECT avg_latency, latency_samples FROM proxylist WHERE proxy=?',
        (proxy,)
    ).fetchone()

    if row:
        old_avg, samples = row
        # Exponential moving average
        new_avg = (old_avg * samples + latency) / (samples + 1)
        new_samples = min(samples + 1, 100)  # Cap at 100 samples

        proxydb.execute(
            'UPDATE proxylist SET avg_latency=?, latency_samples=? WHERE proxy=?',
            (new_avg, new_samples, proxy)
        )

Files: dbs.py, proxywatchd.py Effort: Medium Risk: Low

[ ] 14. Export Functionality

Problem: No easy way to export working proxies for use elsewhere.

Implementation:

# new file: export.py
def export_proxies(proxydb, format='txt', filters=None):
    """Export working proxies to various formats."""

    query = 'SELECT proto, proxy FROM proxylist WHERE failed=0'
    if filters:
        if 'proto' in filters:
            query += ' AND proto=?'

    rows = proxydb.execute(query).fetchall()

    if format == 'txt':
        return '\n'.join('%s://%s' % (r[0], r[1]) for r in rows)
    elif format == 'json':
        import json
        return json.dumps([{'proto': r[0], 'address': r[1]} for r in rows])
    elif format == 'csv':
        return 'proto,address\n' + '\n'.join('%s,%s' % r for r in rows)

# CLI: python export.py --format json --proto socks5 > proxies.json

Files: new export.py Effort: Low Risk: Low

[ ] 15. Unit Test Infrastructure

Problem: No automated tests. Changes can break existing functionality silently.

Implementation:

tests/
├── __init__.py
├── test_proxy_utils.py    # Test IP validation, cleansing
├── test_extract.py        # Test proxy/URL extraction
├── test_database.py       # Test DB operations with temp DB
└── mock_network.py        # Mock rocksock for offline testing

# tests/test_proxy_utils.py
import unittest
import sys
sys.path.insert(0, '..')
import fetch

class TestProxyValidation(unittest.TestCase):
    def test_valid_proxy(self):
        self.assertTrue(fetch.is_usable_proxy('8.8.8.8:8080'))

    def test_private_ip_rejected(self):
        self.assertFalse(fetch.is_usable_proxy('192.168.1.1:8080'))
        self.assertFalse(fetch.is_usable_proxy('10.0.0.1:8080'))
        self.assertFalse(fetch.is_usable_proxy('172.16.0.1:8080'))

    def test_invalid_port_rejected(self):
        self.assertFalse(fetch.is_usable_proxy('8.8.8.8:0'))
        self.assertFalse(fetch.is_usable_proxy('8.8.8.8:99999'))

if __name__ == '__main__':
    unittest.main()

Files: tests/ directory Effort: High (initial), Low (ongoing) Risk: Low

Long Term (Future)

[ ] 16. Geographic Validation

Verify proxy actually routes through claimed location using IP geolocation.

[ ] 17. HTTPS/SSL Proxy Testing

Add capability to test HTTPS CONNECT proxies.

[x] 18. Additional Search Engines

Completed. Added modular search engine architecture.

engines.py: SearchEngine base class with build_url(), extract_urls(), is_rate_limited()
Engines: DuckDuckGo, Startpage, Mojeek (UK), Qwant (FR), Yandex (RU), Ecosia, Brave
Git hosters: GitHub, GitLab, Codeberg, Gitea
scraper.py: EngineTracker class for multi-engine rate limiting
Config: [scraper] engines, max_pages settings
searx.instances: Updated with 51 active SearXNG instances

[x] 19. REST API

Completed. Added HTTP API server for querying working proxies.

httpd.py: ProxyAPIServer class with BaseHTTPServer
Endpoints: /proxies, /proxies/count, /health
Params: limit, proto, country, format (json/plain)
Integrated into proxywatchd.py (starts when httpd.enabled=True)
Config: [httpd] section with listenip, port, enabled

[ ] 20. Web Dashboard

Status page showing live statistics.

Completed

[x] Work-Stealing Queue

Implemented shared Queue.Queue() for job distribution
Workers pull from shared queue instead of pre-assigned lists
Better utilization across threads

[x] Multi-Target Validation

Test each proxy against 3 random targets
2/3 majority required for success
Reduces false negatives from single target failures

[x] Interleaved Testing

Jobs shuffled across all proxies before queueing
Prevents burst of 3 connections to same proxy
ProxyTestState accumulates results from TargetTestJobs

[x] Code Cleanup

Removed 93 lines dead HTTP server code (ppf.py)
Removed dead gumbo parser (soup_parser.py)
Removed test code (comboparse.py)
Removed unused functions (misc.py)
Fixed IP/port cleansing (ppf.py)
Updated .gitignore

[x] Rate Limiting & Instance Tracking (scraper.py)

InstanceTracker class with exponential backoff
Configurable backoff_base, backoff_max, fail_threshold
Instance cycling when rate limited

[x] Exception Logging with Context

Replaced bare except: with typed exceptions across all files
Added context logging to exception handlers (e.g., URL, error message)

[x] Timeout Standardization

Added timeout_connect, timeout_read to [common] config section
Added stale_days, stats_interval to [watchd] config section

[x] Periodic Stats & Stale Cleanup (proxywatchd.py)

Stats class tracks tested/passed/failed with thread-safe counters
Configurable stats_interval (default: 300s)
cleanup_stale() removes dead proxies older than stale_days (default: 30)

[x] Unified Proxy Cache

Moved _known_proxies to fetch.py with helper functions
init_known_proxies(), add_known_proxies(), is_known_proxy()
ppf.py now uses shared cache via fetch module

[x] Config Validation

config.py: validate() method checks config values on startup
Validates: port ranges, timeout values, thread counts, engine names
Warns on missing source_file, unknown engines
Errors on unwritable database directories
Integrated into ppf.py, proxywatchd.py, scraper.py main entry points

10 KiB Raw Blame History

PPF Implementation Tasks

Legend

Immediate Priority (Next Sprint)

[x] 1. Unify _known_proxies Cache

[x] 2. Graceful SQLite Error Handling

[x] 3. Enable SQLite WAL Mode

[x] 4. Batch Database Inserts

[x] 5. Add Database Indexes

Short Term (This Month)

[x] 6. Log Level Filtering

[x] 7. Connection Timeout Standardization

[x] 8. Failure Categorization

[x] 9. Priority Queue for Proxy Testing

[x] 10. Periodic Statistics Output

Medium Term (Next Quarter)

[x] 11. Tor Connection Pooling

[ ] 12. Dynamic Thread Scaling

[ ] 13. Latency Tracking

[ ] 14. Export Functionality

[ ] 15. Unit Test Infrastructure

Long Term (Future)

[ ] 16. Geographic Validation

[ ] 17. HTTPS/SSL Proxy Testing

[x] 18. Additional Search Engines

[x] 19. REST API

[ ] 20. Web Dashboard

Completed

[x] Work-Stealing Queue

[x] Multi-Target Validation

[x] Interleaved Testing

[x] Code Cleanup

[x] Rate Limiting & Instance Tracking (scraper.py)

[x] Exception Logging with Context

[x] Timeout Standardization

[x] Periodic Stats & Stale Cleanup (proxywatchd.py)

[x] Unified Proxy Cache

[x] Config Validation

10 KiB

Raw Blame History