diff --git a/ROADMAP.md b/ROADMAP.md index 06c2d38..30d7cab 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -221,17 +221,29 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design - [x] Make IP2Location optional (graceful fallback) - [x] Add --nobs flag for stdlib HTMLParser fallback (bs4 optional) +### Rate Limiting & Stability (Done) +- [x] InstanceTracker class in scraper.py with exponential backoff +- [x] Configurable backoff_base, backoff_max, fail_threshold +- [x] Exception logging with context (replaced bare except blocks) +- [x] Unified _known_proxies cache in fetch.py + +### Monitoring & Maintenance (Done) +- [x] Stats class in proxywatchd.py (tested/passed/failed tracking) +- [x] Periodic stats reporting (configurable stats_interval) +- [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days) +- [x] Timeout config options (timeout_connect, timeout_read) + --- ## Technical Debt | Item | Description | Risk | |------|-------------|------| -| Dual _known_proxies | ppf.py and fetch.py maintain separate caches | Medium - duplicates possible | +| ~~Dual _known_proxies~~ | ~~ppf.py and fetch.py maintain separate caches~~ | **Resolved** | | Global config in fetch.py | set_config() pattern is fragile | Low - works but not clean | | No input validation | Proxy strings parsed without validation | Medium - could crash on bad data | -| Silent exception catching | Some except: pass patterns hide errors | High - hard to debug | -| Hardcoded timeouts | Various timeout values scattered in code | Low - works but not configurable | +| ~~Silent exception catching~~ | ~~Some except: pass patterns hide errors~~ | **Resolved** | +| ~~Hardcoded timeouts~~ | ~~Various timeout values scattered in code~~ | **Resolved** | --- diff --git a/TODO.md b/TODO.md index fe08da0..2a8b279 100644 --- a/TODO.md +++ b/TODO.md @@ -13,221 +13,64 @@ ## Immediate Priority (Next Sprint) -### [ ] 1. Unify _known_proxies Cache +### [x] 1. Unify _known_proxies Cache -**Problem:** Both ppf.py and fetch.py maintain separate `_known_proxies` dictionaries. -Updates to one don't reflect in the other, causing potential duplicate processing. - -**Implementation:** -```python -# fetch.py - becomes the single source of truth -_known_proxies = {} -_known_proxies_lock = threading.Lock() - -def get_known_proxies(): - """Return reference to shared known proxies dict.""" - return _known_proxies - -def is_known_proxy(proxy): - """Thread-safe check if proxy is known.""" - with _known_proxies_lock: - return proxy in _known_proxies - -def mark_proxy_known(proxy): - """Thread-safe mark proxy as known.""" - with _known_proxies_lock: - _known_proxies[proxy] = True -``` - -**Files:** fetch.py, ppf.py -**Effort:** Low -**Risk:** Low +**Completed.** Added `init_known_proxies()`, `add_known_proxies()`, `is_known_proxy()` +to fetch.py. Updated ppf.py to use these functions instead of local cache. --- -### [ ] 2. Graceful SQLite Error Handling +### [x] 2. Graceful SQLite Error Handling -**Problem:** SQLite can throw "database is locked" errors under concurrent access. -Currently these bubble up and crash the application. - -**Implementation:** -```python -# mysqlite.py -import time - -class mysqlite(): - def execute(self, query, params=None, retries=5): - for attempt in range(retries): - try: - if params: - return self.cur.execute(query, params) - return self.cur.execute(query) - except sqlite3.OperationalError as e: - if 'locked' in str(e) and attempt < retries - 1: - time.sleep(0.1 * (attempt + 1)) # Exponential backoff - continue - raise -``` - -**Files:** mysqlite.py -**Effort:** Low -**Risk:** Low +**Completed.** mysqlite.py now retries on "locked" errors with exponential backoff. --- -### [ ] 3. Enable SQLite WAL Mode +### [x] 3. Enable SQLite WAL Mode -**Problem:** Default SQLite journaling mode blocks concurrent readers during writes. - -**Implementation:** -```python -# mysqlite.py - in __init__ -def __init__(self, database, rowtype): - self.conn = sqlite3.connect(database, check_same_thread=False) - self.conn.execute('PRAGMA journal_mode=WAL') - self.conn.execute('PRAGMA synchronous=NORMAL') - # ... -``` - -**Files:** mysqlite.py -**Effort:** Low -**Risk:** Low (WAL is well-tested) +**Completed.** mysqlite.py enables WAL mode and NORMAL synchronous on init. --- -### [ ] 4. Batch Database Inserts +### [x] 4. Batch Database Inserts -**Problem:** insert_proxies() and insert_urls() do individual INSERTs, causing -excessive disk I/O and lock contention. - -**Implementation:** -```python -# dbs.py -def insert_proxies(proxydb, proxies, source): - """Batch insert proxies.""" - if not proxies: - return - - mytime = int(time.time()) - values = [(p, source, mytime) for p in proxies] - - proxydb.executemany( - 'INSERT OR IGNORE INTO proxylist (proxy, source, first_seen) VALUES (?, ?, ?)', - values - ) - proxydb.commit() -``` - -**Files:** dbs.py -**Effort:** Low -**Risk:** Low +**Completed.** dbs.py uses executemany() for batch inserts. --- -### [ ] 5. Add Database Indexes +### [x] 5. Add Database Indexes -**Problem:** Queries on large tables are slow without proper indexes. - -**Implementation:** -```python -# dbs.py - in create_table_if_not_exists -def create_indexes(db, table): - if table == 'proxylist': - db.execute('CREATE INDEX IF NOT EXISTS idx_proxy_failed ON proxylist(failed)') - db.execute('CREATE INDEX IF NOT EXISTS idx_proxy_proto ON proxylist(proto)') - elif table == 'uris': - db.execute('CREATE INDEX IF NOT EXISTS idx_uri_error ON uris(error)') - db.execute('CREATE INDEX IF NOT EXISTS idx_uri_checktime ON uris(check_time)') - db.commit() -``` - -**Files:** dbs.py -**Effort:** Low -**Risk:** Low +**Completed.** dbs.py creates indexes on failed, tested, proto, error, check_time. --- ## Short Term (This Month) -### [ ] 6. Standardize Logging +### [x] 6. Log Level Filtering -**Problem:** Inconsistent logging across files. Some use print(), some use _log(), -some silently swallow errors. - -**Implementation:** -```python -# misc.py - enhanced logging -import sys - -LOG_LEVELS = {'debug': 0, 'info': 1, 'warn': 2, 'error': 3} -_log_level = 1 # Default: info - -def set_log_level(level): - global _log_level - _log_level = LOG_LEVELS.get(level, 1) - -def _log(msg, level='info', module=None): - if LOG_LEVELS.get(level, 1) < _log_level: - return - prefix = '%s/%s' % (timestamp(), level) - if module: - prefix = '%s/%s' % (prefix, module) - output = sys.stderr if level in ('warn', 'error') else sys.stdout - print >> output, '\r%s\t%s' % (prefix, msg) -``` - -**Files:** misc.py, all other files -**Effort:** Medium -**Risk:** Low +**Completed.** Added log level filtering with -q/--quiet and -v/--verbose CLI flags. +- misc.py: LOG_LEVELS dict, set_log_level(), get_log_level() +- config.py: Added -q/--quiet and -v/--verbose arguments +- Log levels: debug=0, info=1, warn=2, error=3 +- --quiet: only show warn/error +- --verbose: show debug messages --- -### [ ] 7. Connection Timeout Standardization +### [x] 7. Connection Timeout Standardization -**Problem:** Timeout values are scattered: config.ppf.timeout, config.watchd.timeout, -hardcoded values in rocksock.py. - -**Implementation:** -- Add to config.py: `[network]` section with timeout_connect, timeout_read, timeout_total -- Pass timeout explicitly to all network functions -- Remove hardcoded timeout values - -**Files:** config.py, fetch.py, proxywatchd.py, rocksock.py -**Effort:** Medium -**Risk:** Low +**Completed.** Added timeout_connect and timeout_read to [common] section in config.py. --- -### [ ] 8. Failure Categorization +### [x] 8. Failure Categorization -**Problem:** All failures treated equally. Timeout vs connection-refused vs -auth-failure have different implications. - -**Implementation:** -```python -# proxywatchd.py -class FailureType: - TIMEOUT = 'timeout' # Retry later - REFUSED = 'refused' # Proxy down, lower priority - AUTH_FAIL = 'auth_fail' # Wrong protocol, try others - TARGET_DOWN = 'target_down' # Not proxy's fault - UNKNOWN = 'unknown' - -def categorize_failure(exception): - """Categorize failure type from exception.""" - msg = str(exception).lower() - if 'timeout' in msg or 'timed out' in msg: - return FailureType.TIMEOUT - if 'refused' in msg: - return FailureType.REFUSED - if 'auth' in msg or 'handshake' in msg: - return FailureType.AUTH_FAIL - return FailureType.UNKNOWN -``` - -**Files:** proxywatchd.py -**Effort:** Medium -**Risk:** Low +**Completed.** Added failure categorization for proxy errors. +- misc.py: categorize_error() function, FAIL_* constants +- Categories: timeout, refused, auth, unreachable, dns, ssl, closed, proxy, other +- proxywatchd.py: Stats.record() now accepts category parameter +- Stats.report() shows failure breakdown by category +- ProxyTestState.evaluate() returns (success, category) tuple --- @@ -272,46 +115,10 @@ Priority calculation: --- -### [ ] 10. Periodic Statistics Output +### [x] 10. Periodic Statistics Output -**Problem:** No visibility into system performance during operation. - -**Implementation:** -```python -# proxywatchd.py -class Stats: - def __init__(self): - self.lock = threading.Lock() - self.tested = 0 - self.passed = 0 - self.failed = 0 - self.start_time = time.time() - - def record(self, success): - with self.lock: - self.tested += 1 - if success: - self.passed += 1 - else: - self.failed += 1 - - def report(self): - with self.lock: - elapsed = time.time() - self.start_time - rate = self.tested / elapsed if elapsed > 0 else 0 - pct = (self.passed * 100.0 / self.tested) if self.tested > 0 else 0 - return 'tested=%d passed=%d (%.1f%%) rate=%.1f/s' % ( - self.tested, self.passed, pct, rate) - -# In main loop, every 5 minutes: -if time.time() - last_stats > 300: - _log(stats.report(), 'stats', 'watchd') - last_stats = time.time() -``` - -**Files:** proxywatchd.py -**Effort:** Low -**Risk:** Low +**Completed.** Added Stats class to proxywatchd.py with record(), should_report(), +and report() methods. Integrated into main loop with configurable stats_interval. --- @@ -525,11 +332,24 @@ Verify proxy actually routes through claimed location using IP geolocation. ### [ ] 17. HTTPS/SSL Proxy Testing Add capability to test HTTPS CONNECT proxies. -### [ ] 18. Additional Search Engines -Support Google, Bing, DuckDuckGo beyond Searx. +### [x] 18. Additional Search Engines -### [ ] 19. REST API -Simple HTTP API to query proxy database. +**Completed.** Added modular search engine architecture. +- engines.py: SearchEngine base class with build_url(), extract_urls(), is_rate_limited() +- Engines: DuckDuckGo, Startpage, Mojeek (UK), Qwant (FR), Yandex (RU), Ecosia, Brave +- Git hosters: GitHub, GitLab, Codeberg, Gitea +- scraper.py: EngineTracker class for multi-engine rate limiting +- Config: [scraper] engines, max_pages settings +- searx.instances: Updated with 51 active SearXNG instances + +### [x] 19. REST API + +**Completed.** Added HTTP API server for querying working proxies. +- httpd.py: ProxyAPIServer class with BaseHTTPServer +- Endpoints: /proxies, /proxies/count, /health +- Params: limit, proto, country, format (json/plain) +- Integrated into proxywatchd.py (starts when httpd.enabled=True) +- Config: [httpd] section with listenip, port, enabled ### [ ] 20. Web Dashboard Status page showing live statistics. @@ -560,3 +380,33 @@ Status page showing live statistics. - Removed unused functions (misc.py) - Fixed IP/port cleansing (ppf.py) - Updated .gitignore + +### [x] Rate Limiting & Instance Tracking (scraper.py) +- InstanceTracker class with exponential backoff +- Configurable backoff_base, backoff_max, fail_threshold +- Instance cycling when rate limited + +### [x] Exception Logging with Context +- Replaced bare `except:` with typed exceptions across all files +- Added context logging to exception handlers (e.g., URL, error message) + +### [x] Timeout Standardization +- Added timeout_connect, timeout_read to [common] config section +- Added stale_days, stats_interval to [watchd] config section + +### [x] Periodic Stats & Stale Cleanup (proxywatchd.py) +- Stats class tracks tested/passed/failed with thread-safe counters +- Configurable stats_interval (default: 300s) +- cleanup_stale() removes dead proxies older than stale_days (default: 30) + +### [x] Unified Proxy Cache +- Moved _known_proxies to fetch.py with helper functions +- init_known_proxies(), add_known_proxies(), is_known_proxy() +- ppf.py now uses shared cache via fetch module + +### [x] Config Validation +- config.py: validate() method checks config values on startup +- Validates: port ranges, timeout values, thread counts, engine names +- Warns on missing source_file, unknown engines +- Errors on unwritable database directories +- Integrated into ppf.py, proxywatchd.py, scraper.py main entry points