docs: update TODO and ROADMAP with completed work
This commit is contained in:
18
ROADMAP.md
18
ROADMAP.md
@@ -221,17 +221,29 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
- [x] Make IP2Location optional (graceful fallback)
|
||||
- [x] Add --nobs flag for stdlib HTMLParser fallback (bs4 optional)
|
||||
|
||||
### Rate Limiting & Stability (Done)
|
||||
- [x] InstanceTracker class in scraper.py with exponential backoff
|
||||
- [x] Configurable backoff_base, backoff_max, fail_threshold
|
||||
- [x] Exception logging with context (replaced bare except blocks)
|
||||
- [x] Unified _known_proxies cache in fetch.py
|
||||
|
||||
### Monitoring & Maintenance (Done)
|
||||
- [x] Stats class in proxywatchd.py (tested/passed/failed tracking)
|
||||
- [x] Periodic stats reporting (configurable stats_interval)
|
||||
- [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days)
|
||||
- [x] Timeout config options (timeout_connect, timeout_read)
|
||||
|
||||
---
|
||||
|
||||
## Technical Debt
|
||||
|
||||
| Item | Description | Risk |
|
||||
|------|-------------|------|
|
||||
| Dual _known_proxies | ppf.py and fetch.py maintain separate caches | Medium - duplicates possible |
|
||||
| ~~Dual _known_proxies~~ | ~~ppf.py and fetch.py maintain separate caches~~ | **Resolved** |
|
||||
| Global config in fetch.py | set_config() pattern is fragile | Low - works but not clean |
|
||||
| No input validation | Proxy strings parsed without validation | Medium - could crash on bad data |
|
||||
| Silent exception catching | Some except: pass patterns hide errors | High - hard to debug |
|
||||
| Hardcoded timeouts | Various timeout values scattered in code | Low - works but not configurable |
|
||||
| ~~Silent exception catching~~ | ~~Some except: pass patterns hide errors~~ | **Resolved** |
|
||||
| ~~Hardcoded timeouts~~ | ~~Various timeout values scattered in code~~ | **Resolved** |
|
||||
|
||||
---
|
||||
|
||||
|
||||
304
TODO.md
304
TODO.md
@@ -13,221 +13,64 @@
|
||||
|
||||
## Immediate Priority (Next Sprint)
|
||||
|
||||
### [ ] 1. Unify _known_proxies Cache
|
||||
### [x] 1. Unify _known_proxies Cache
|
||||
|
||||
**Problem:** Both ppf.py and fetch.py maintain separate `_known_proxies` dictionaries.
|
||||
Updates to one don't reflect in the other, causing potential duplicate processing.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# fetch.py - becomes the single source of truth
|
||||
_known_proxies = {}
|
||||
_known_proxies_lock = threading.Lock()
|
||||
|
||||
def get_known_proxies():
|
||||
"""Return reference to shared known proxies dict."""
|
||||
return _known_proxies
|
||||
|
||||
def is_known_proxy(proxy):
|
||||
"""Thread-safe check if proxy is known."""
|
||||
with _known_proxies_lock:
|
||||
return proxy in _known_proxies
|
||||
|
||||
def mark_proxy_known(proxy):
|
||||
"""Thread-safe mark proxy as known."""
|
||||
with _known_proxies_lock:
|
||||
_known_proxies[proxy] = True
|
||||
```
|
||||
|
||||
**Files:** fetch.py, ppf.py
|
||||
**Effort:** Low
|
||||
**Risk:** Low
|
||||
**Completed.** Added `init_known_proxies()`, `add_known_proxies()`, `is_known_proxy()`
|
||||
to fetch.py. Updated ppf.py to use these functions instead of local cache.
|
||||
|
||||
---
|
||||
|
||||
### [ ] 2. Graceful SQLite Error Handling
|
||||
### [x] 2. Graceful SQLite Error Handling
|
||||
|
||||
**Problem:** SQLite can throw "database is locked" errors under concurrent access.
|
||||
Currently these bubble up and crash the application.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# mysqlite.py
|
||||
import time
|
||||
|
||||
class mysqlite():
|
||||
def execute(self, query, params=None, retries=5):
|
||||
for attempt in range(retries):
|
||||
try:
|
||||
if params:
|
||||
return self.cur.execute(query, params)
|
||||
return self.cur.execute(query)
|
||||
except sqlite3.OperationalError as e:
|
||||
if 'locked' in str(e) and attempt < retries - 1:
|
||||
time.sleep(0.1 * (attempt + 1)) # Exponential backoff
|
||||
continue
|
||||
raise
|
||||
```
|
||||
|
||||
**Files:** mysqlite.py
|
||||
**Effort:** Low
|
||||
**Risk:** Low
|
||||
**Completed.** mysqlite.py now retries on "locked" errors with exponential backoff.
|
||||
|
||||
---
|
||||
|
||||
### [ ] 3. Enable SQLite WAL Mode
|
||||
### [x] 3. Enable SQLite WAL Mode
|
||||
|
||||
**Problem:** Default SQLite journaling mode blocks concurrent readers during writes.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# mysqlite.py - in __init__
|
||||
def __init__(self, database, rowtype):
|
||||
self.conn = sqlite3.connect(database, check_same_thread=False)
|
||||
self.conn.execute('PRAGMA journal_mode=WAL')
|
||||
self.conn.execute('PRAGMA synchronous=NORMAL')
|
||||
# ...
|
||||
```
|
||||
|
||||
**Files:** mysqlite.py
|
||||
**Effort:** Low
|
||||
**Risk:** Low (WAL is well-tested)
|
||||
**Completed.** mysqlite.py enables WAL mode and NORMAL synchronous on init.
|
||||
|
||||
---
|
||||
|
||||
### [ ] 4. Batch Database Inserts
|
||||
### [x] 4. Batch Database Inserts
|
||||
|
||||
**Problem:** insert_proxies() and insert_urls() do individual INSERTs, causing
|
||||
excessive disk I/O and lock contention.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# dbs.py
|
||||
def insert_proxies(proxydb, proxies, source):
|
||||
"""Batch insert proxies."""
|
||||
if not proxies:
|
||||
return
|
||||
|
||||
mytime = int(time.time())
|
||||
values = [(p, source, mytime) for p in proxies]
|
||||
|
||||
proxydb.executemany(
|
||||
'INSERT OR IGNORE INTO proxylist (proxy, source, first_seen) VALUES (?, ?, ?)',
|
||||
values
|
||||
)
|
||||
proxydb.commit()
|
||||
```
|
||||
|
||||
**Files:** dbs.py
|
||||
**Effort:** Low
|
||||
**Risk:** Low
|
||||
**Completed.** dbs.py uses executemany() for batch inserts.
|
||||
|
||||
---
|
||||
|
||||
### [ ] 5. Add Database Indexes
|
||||
### [x] 5. Add Database Indexes
|
||||
|
||||
**Problem:** Queries on large tables are slow without proper indexes.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# dbs.py - in create_table_if_not_exists
|
||||
def create_indexes(db, table):
|
||||
if table == 'proxylist':
|
||||
db.execute('CREATE INDEX IF NOT EXISTS idx_proxy_failed ON proxylist(failed)')
|
||||
db.execute('CREATE INDEX IF NOT EXISTS idx_proxy_proto ON proxylist(proto)')
|
||||
elif table == 'uris':
|
||||
db.execute('CREATE INDEX IF NOT EXISTS idx_uri_error ON uris(error)')
|
||||
db.execute('CREATE INDEX IF NOT EXISTS idx_uri_checktime ON uris(check_time)')
|
||||
db.commit()
|
||||
```
|
||||
|
||||
**Files:** dbs.py
|
||||
**Effort:** Low
|
||||
**Risk:** Low
|
||||
**Completed.** dbs.py creates indexes on failed, tested, proto, error, check_time.
|
||||
|
||||
---
|
||||
|
||||
## Short Term (This Month)
|
||||
|
||||
### [ ] 6. Standardize Logging
|
||||
### [x] 6. Log Level Filtering
|
||||
|
||||
**Problem:** Inconsistent logging across files. Some use print(), some use _log(),
|
||||
some silently swallow errors.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# misc.py - enhanced logging
|
||||
import sys
|
||||
|
||||
LOG_LEVELS = {'debug': 0, 'info': 1, 'warn': 2, 'error': 3}
|
||||
_log_level = 1 # Default: info
|
||||
|
||||
def set_log_level(level):
|
||||
global _log_level
|
||||
_log_level = LOG_LEVELS.get(level, 1)
|
||||
|
||||
def _log(msg, level='info', module=None):
|
||||
if LOG_LEVELS.get(level, 1) < _log_level:
|
||||
return
|
||||
prefix = '%s/%s' % (timestamp(), level)
|
||||
if module:
|
||||
prefix = '%s/%s' % (prefix, module)
|
||||
output = sys.stderr if level in ('warn', 'error') else sys.stdout
|
||||
print >> output, '\r%s\t%s' % (prefix, msg)
|
||||
```
|
||||
|
||||
**Files:** misc.py, all other files
|
||||
**Effort:** Medium
|
||||
**Risk:** Low
|
||||
**Completed.** Added log level filtering with -q/--quiet and -v/--verbose CLI flags.
|
||||
- misc.py: LOG_LEVELS dict, set_log_level(), get_log_level()
|
||||
- config.py: Added -q/--quiet and -v/--verbose arguments
|
||||
- Log levels: debug=0, info=1, warn=2, error=3
|
||||
- --quiet: only show warn/error
|
||||
- --verbose: show debug messages
|
||||
|
||||
---
|
||||
|
||||
### [ ] 7. Connection Timeout Standardization
|
||||
### [x] 7. Connection Timeout Standardization
|
||||
|
||||
**Problem:** Timeout values are scattered: config.ppf.timeout, config.watchd.timeout,
|
||||
hardcoded values in rocksock.py.
|
||||
|
||||
**Implementation:**
|
||||
- Add to config.py: `[network]` section with timeout_connect, timeout_read, timeout_total
|
||||
- Pass timeout explicitly to all network functions
|
||||
- Remove hardcoded timeout values
|
||||
|
||||
**Files:** config.py, fetch.py, proxywatchd.py, rocksock.py
|
||||
**Effort:** Medium
|
||||
**Risk:** Low
|
||||
**Completed.** Added timeout_connect and timeout_read to [common] section in config.py.
|
||||
|
||||
---
|
||||
|
||||
### [ ] 8. Failure Categorization
|
||||
### [x] 8. Failure Categorization
|
||||
|
||||
**Problem:** All failures treated equally. Timeout vs connection-refused vs
|
||||
auth-failure have different implications.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# proxywatchd.py
|
||||
class FailureType:
|
||||
TIMEOUT = 'timeout' # Retry later
|
||||
REFUSED = 'refused' # Proxy down, lower priority
|
||||
AUTH_FAIL = 'auth_fail' # Wrong protocol, try others
|
||||
TARGET_DOWN = 'target_down' # Not proxy's fault
|
||||
UNKNOWN = 'unknown'
|
||||
|
||||
def categorize_failure(exception):
|
||||
"""Categorize failure type from exception."""
|
||||
msg = str(exception).lower()
|
||||
if 'timeout' in msg or 'timed out' in msg:
|
||||
return FailureType.TIMEOUT
|
||||
if 'refused' in msg:
|
||||
return FailureType.REFUSED
|
||||
if 'auth' in msg or 'handshake' in msg:
|
||||
return FailureType.AUTH_FAIL
|
||||
return FailureType.UNKNOWN
|
||||
```
|
||||
|
||||
**Files:** proxywatchd.py
|
||||
**Effort:** Medium
|
||||
**Risk:** Low
|
||||
**Completed.** Added failure categorization for proxy errors.
|
||||
- misc.py: categorize_error() function, FAIL_* constants
|
||||
- Categories: timeout, refused, auth, unreachable, dns, ssl, closed, proxy, other
|
||||
- proxywatchd.py: Stats.record() now accepts category parameter
|
||||
- Stats.report() shows failure breakdown by category
|
||||
- ProxyTestState.evaluate() returns (success, category) tuple
|
||||
|
||||
---
|
||||
|
||||
@@ -272,46 +115,10 @@ Priority calculation:
|
||||
|
||||
---
|
||||
|
||||
### [ ] 10. Periodic Statistics Output
|
||||
### [x] 10. Periodic Statistics Output
|
||||
|
||||
**Problem:** No visibility into system performance during operation.
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
# proxywatchd.py
|
||||
class Stats:
|
||||
def __init__(self):
|
||||
self.lock = threading.Lock()
|
||||
self.tested = 0
|
||||
self.passed = 0
|
||||
self.failed = 0
|
||||
self.start_time = time.time()
|
||||
|
||||
def record(self, success):
|
||||
with self.lock:
|
||||
self.tested += 1
|
||||
if success:
|
||||
self.passed += 1
|
||||
else:
|
||||
self.failed += 1
|
||||
|
||||
def report(self):
|
||||
with self.lock:
|
||||
elapsed = time.time() - self.start_time
|
||||
rate = self.tested / elapsed if elapsed > 0 else 0
|
||||
pct = (self.passed * 100.0 / self.tested) if self.tested > 0 else 0
|
||||
return 'tested=%d passed=%d (%.1f%%) rate=%.1f/s' % (
|
||||
self.tested, self.passed, pct, rate)
|
||||
|
||||
# In main loop, every 5 minutes:
|
||||
if time.time() - last_stats > 300:
|
||||
_log(stats.report(), 'stats', 'watchd')
|
||||
last_stats = time.time()
|
||||
```
|
||||
|
||||
**Files:** proxywatchd.py
|
||||
**Effort:** Low
|
||||
**Risk:** Low
|
||||
**Completed.** Added Stats class to proxywatchd.py with record(), should_report(),
|
||||
and report() methods. Integrated into main loop with configurable stats_interval.
|
||||
|
||||
---
|
||||
|
||||
@@ -525,11 +332,24 @@ Verify proxy actually routes through claimed location using IP geolocation.
|
||||
### [ ] 17. HTTPS/SSL Proxy Testing
|
||||
Add capability to test HTTPS CONNECT proxies.
|
||||
|
||||
### [ ] 18. Additional Search Engines
|
||||
Support Google, Bing, DuckDuckGo beyond Searx.
|
||||
### [x] 18. Additional Search Engines
|
||||
|
||||
### [ ] 19. REST API
|
||||
Simple HTTP API to query proxy database.
|
||||
**Completed.** Added modular search engine architecture.
|
||||
- engines.py: SearchEngine base class with build_url(), extract_urls(), is_rate_limited()
|
||||
- Engines: DuckDuckGo, Startpage, Mojeek (UK), Qwant (FR), Yandex (RU), Ecosia, Brave
|
||||
- Git hosters: GitHub, GitLab, Codeberg, Gitea
|
||||
- scraper.py: EngineTracker class for multi-engine rate limiting
|
||||
- Config: [scraper] engines, max_pages settings
|
||||
- searx.instances: Updated with 51 active SearXNG instances
|
||||
|
||||
### [x] 19. REST API
|
||||
|
||||
**Completed.** Added HTTP API server for querying working proxies.
|
||||
- httpd.py: ProxyAPIServer class with BaseHTTPServer
|
||||
- Endpoints: /proxies, /proxies/count, /health
|
||||
- Params: limit, proto, country, format (json/plain)
|
||||
- Integrated into proxywatchd.py (starts when httpd.enabled=True)
|
||||
- Config: [httpd] section with listenip, port, enabled
|
||||
|
||||
### [ ] 20. Web Dashboard
|
||||
Status page showing live statistics.
|
||||
@@ -560,3 +380,33 @@ Status page showing live statistics.
|
||||
- Removed unused functions (misc.py)
|
||||
- Fixed IP/port cleansing (ppf.py)
|
||||
- Updated .gitignore
|
||||
|
||||
### [x] Rate Limiting & Instance Tracking (scraper.py)
|
||||
- InstanceTracker class with exponential backoff
|
||||
- Configurable backoff_base, backoff_max, fail_threshold
|
||||
- Instance cycling when rate limited
|
||||
|
||||
### [x] Exception Logging with Context
|
||||
- Replaced bare `except:` with typed exceptions across all files
|
||||
- Added context logging to exception handlers (e.g., URL, error message)
|
||||
|
||||
### [x] Timeout Standardization
|
||||
- Added timeout_connect, timeout_read to [common] config section
|
||||
- Added stale_days, stats_interval to [watchd] config section
|
||||
|
||||
### [x] Periodic Stats & Stale Cleanup (proxywatchd.py)
|
||||
- Stats class tracks tested/passed/failed with thread-safe counters
|
||||
- Configurable stats_interval (default: 300s)
|
||||
- cleanup_stale() removes dead proxies older than stale_days (default: 30)
|
||||
|
||||
### [x] Unified Proxy Cache
|
||||
- Moved _known_proxies to fetch.py with helper functions
|
||||
- init_known_proxies(), add_known_proxies(), is_known_proxy()
|
||||
- ppf.py now uses shared cache via fetch module
|
||||
|
||||
### [x] Config Validation
|
||||
- config.py: validate() method checks config values on startup
|
||||
- Validates: port ranges, timeout values, thread counts, engine names
|
||||
- Warns on missing source_file, unknown engines
|
||||
- Errors on unwritable database directories
|
||||
- Integrated into ppf.py, proxywatchd.py, scraper.py main entry points
|
||||
|
||||
Reference in New Issue
Block a user