354 lines
10 KiB
Markdown
354 lines
10 KiB
Markdown
# PPF Implementation Tasks
|
|
|
|
## Legend
|
|
|
|
```
|
|
[ ] Not started
|
|
[~] In progress
|
|
[x] Completed
|
|
[!] Blocked/needs discussion
|
|
```
|
|
|
|
---
|
|
|
|
## Immediate Priority (Next Sprint)
|
|
|
|
### [x] 1. Unify _known_proxies Cache
|
|
|
|
**Completed.** Added `init_known_proxies()`, `add_known_proxies()`, `is_known_proxy()`
|
|
to fetch.py. Updated ppf.py to use these functions instead of local cache.
|
|
|
|
---
|
|
|
|
### [x] 2. Graceful SQLite Error Handling
|
|
|
|
**Completed.** mysqlite.py now retries on "locked" errors with exponential backoff.
|
|
|
|
---
|
|
|
|
### [x] 3. Enable SQLite WAL Mode
|
|
|
|
**Completed.** mysqlite.py enables WAL mode and NORMAL synchronous on init.
|
|
|
|
---
|
|
|
|
### [x] 4. Batch Database Inserts
|
|
|
|
**Completed.** dbs.py uses executemany() for batch inserts.
|
|
|
|
---
|
|
|
|
### [x] 5. Add Database Indexes
|
|
|
|
**Completed.** dbs.py creates indexes on failed, tested, proto, error, check_time.
|
|
|
|
---
|
|
|
|
## Short Term (This Month)
|
|
|
|
### [x] 6. Log Level Filtering
|
|
|
|
**Completed.** Added log level filtering with -q/--quiet and -v/--verbose CLI flags.
|
|
- misc.py: LOG_LEVELS dict, set_log_level(), get_log_level()
|
|
- config.py: Added -q/--quiet and -v/--verbose arguments
|
|
- Log levels: debug=0, info=1, warn=2, error=3
|
|
- --quiet: only show warn/error
|
|
- --verbose: show debug messages
|
|
|
|
---
|
|
|
|
### [x] 7. Connection Timeout Standardization
|
|
|
|
**Completed.** Added timeout_connect and timeout_read to [common] section in config.py.
|
|
|
|
---
|
|
|
|
### [x] 8. Failure Categorization
|
|
|
|
**Completed.** Added failure categorization for proxy errors.
|
|
- misc.py: categorize_error() function, FAIL_* constants
|
|
- Categories: timeout, refused, auth, unreachable, dns, ssl, closed, proxy, other
|
|
- proxywatchd.py: Stats.record() now accepts category parameter
|
|
- Stats.report() shows failure breakdown by category
|
|
- ProxyTestState.evaluate() returns (success, category) tuple
|
|
|
|
---
|
|
|
|
### [x] 9. Priority Queue for Proxy Testing
|
|
|
|
**Completed.** Added priority-based job scheduling for proxy tests.
|
|
- PriorityJobQueue class with heap-based ordering
|
|
- calculate_priority() assigns priority 0-4 based on proxy state
|
|
- Priority 0: New proxies (never tested)
|
|
- Priority 1: Working proxies (no failures)
|
|
- Priority 2: Low fail count (< 3)
|
|
- Priority 3-4: Medium/high fail count
|
|
- Integrated into prepare_jobs() for automatic prioritization
|
|
|
|
---
|
|
|
|
### [x] 10. Periodic Statistics Output
|
|
|
|
**Completed.** Added Stats class to proxywatchd.py with record(), should_report(),
|
|
and report() methods. Integrated into main loop with configurable stats_interval.
|
|
|
|
---
|
|
|
|
## Medium Term (Next Quarter)
|
|
|
|
### [x] 11. Tor Connection Pooling
|
|
|
|
**Completed.** Added connection pooling with worker-Tor affinity and health monitoring.
|
|
- connection_pool.py: TorHostState class tracks per-host health, latency, backoff
|
|
- connection_pool.py: TorConnectionPool with worker affinity, warmup, statistics
|
|
- proxywatchd.py: Workers get consistent Tor host assignment for circuit reuse
|
|
- Success/failure tracking with exponential backoff (5s, 10s, 20s, 40s, max 60s)
|
|
- Latency tracking with rolling averages
|
|
- Pool status reported alongside periodic stats
|
|
|
|
---
|
|
|
|
### [ ] 12. Dynamic Thread Scaling
|
|
|
|
**Problem:** Fixed thread count regardless of success rate or system load.
|
|
|
|
**Implementation:**
|
|
```python
|
|
# proxywatchd.py
|
|
class ThreadScaler:
|
|
"""Dynamically adjust thread count based on performance."""
|
|
|
|
def __init__(self, min_threads=5, max_threads=50):
|
|
self.min = min_threads
|
|
self.max = max_threads
|
|
self.current = min_threads
|
|
self.success_rate_window = []
|
|
|
|
def record_result(self, success):
|
|
self.success_rate_window.append(success)
|
|
if len(self.success_rate_window) > 100:
|
|
self.success_rate_window.pop(0)
|
|
|
|
def recommended_threads(self):
|
|
if len(self.success_rate_window) < 20:
|
|
return self.current
|
|
|
|
success_rate = sum(self.success_rate_window) / len(self.success_rate_window)
|
|
|
|
# High success rate -> can handle more threads
|
|
if success_rate > 0.7 and self.current < self.max:
|
|
return self.current + 5
|
|
# Low success rate -> reduce load
|
|
elif success_rate < 0.3 and self.current > self.min:
|
|
return self.current - 5
|
|
|
|
return self.current
|
|
```
|
|
|
|
**Files:** proxywatchd.py
|
|
**Effort:** Medium
|
|
**Risk:** Medium
|
|
|
|
---
|
|
|
|
### [ ] 13. Latency Tracking
|
|
|
|
**Problem:** No visibility into proxy speed. A working but slow proxy may be
|
|
less useful than a fast one.
|
|
|
|
**Implementation:**
|
|
```python
|
|
# dbs.py - add columns
|
|
# ALTER TABLE proxylist ADD COLUMN avg_latency REAL DEFAULT 0
|
|
# ALTER TABLE proxylist ADD COLUMN latency_samples INTEGER DEFAULT 0
|
|
|
|
def update_proxy_latency(proxydb, proxy, latency):
|
|
"""Update rolling average latency for proxy."""
|
|
row = proxydb.execute(
|
|
'SELECT avg_latency, latency_samples FROM proxylist WHERE proxy=?',
|
|
(proxy,)
|
|
).fetchone()
|
|
|
|
if row:
|
|
old_avg, samples = row
|
|
# Exponential moving average
|
|
new_avg = (old_avg * samples + latency) / (samples + 1)
|
|
new_samples = min(samples + 1, 100) # Cap at 100 samples
|
|
|
|
proxydb.execute(
|
|
'UPDATE proxylist SET avg_latency=?, latency_samples=? WHERE proxy=?',
|
|
(new_avg, new_samples, proxy)
|
|
)
|
|
```
|
|
|
|
**Files:** dbs.py, proxywatchd.py
|
|
**Effort:** Medium
|
|
**Risk:** Low
|
|
|
|
---
|
|
|
|
### [ ] 14. Export Functionality
|
|
|
|
**Problem:** No easy way to export working proxies for use elsewhere.
|
|
|
|
**Implementation:**
|
|
```python
|
|
# new file: export.py
|
|
def export_proxies(proxydb, format='txt', filters=None):
|
|
"""Export working proxies to various formats."""
|
|
|
|
query = 'SELECT proto, proxy FROM proxylist WHERE failed=0'
|
|
if filters:
|
|
if 'proto' in filters:
|
|
query += ' AND proto=?'
|
|
|
|
rows = proxydb.execute(query).fetchall()
|
|
|
|
if format == 'txt':
|
|
return '\n'.join('%s://%s' % (r[0], r[1]) for r in rows)
|
|
elif format == 'json':
|
|
import json
|
|
return json.dumps([{'proto': r[0], 'address': r[1]} for r in rows])
|
|
elif format == 'csv':
|
|
return 'proto,address\n' + '\n'.join('%s,%s' % r for r in rows)
|
|
|
|
# CLI: python export.py --format json --proto socks5 > proxies.json
|
|
```
|
|
|
|
**Files:** new export.py
|
|
**Effort:** Low
|
|
**Risk:** Low
|
|
|
|
---
|
|
|
|
### [ ] 15. Unit Test Infrastructure
|
|
|
|
**Problem:** No automated tests. Changes can break existing functionality silently.
|
|
|
|
**Implementation:**
|
|
```
|
|
tests/
|
|
├── __init__.py
|
|
├── test_proxy_utils.py # Test IP validation, cleansing
|
|
├── test_extract.py # Test proxy/URL extraction
|
|
├── test_database.py # Test DB operations with temp DB
|
|
└── mock_network.py # Mock rocksock for offline testing
|
|
```
|
|
|
|
```python
|
|
# tests/test_proxy_utils.py
|
|
import unittest
|
|
import sys
|
|
sys.path.insert(0, '..')
|
|
import fetch
|
|
|
|
class TestProxyValidation(unittest.TestCase):
|
|
def test_valid_proxy(self):
|
|
self.assertTrue(fetch.is_usable_proxy('8.8.8.8:8080'))
|
|
|
|
def test_private_ip_rejected(self):
|
|
self.assertFalse(fetch.is_usable_proxy('192.168.1.1:8080'))
|
|
self.assertFalse(fetch.is_usable_proxy('10.0.0.1:8080'))
|
|
self.assertFalse(fetch.is_usable_proxy('172.16.0.1:8080'))
|
|
|
|
def test_invalid_port_rejected(self):
|
|
self.assertFalse(fetch.is_usable_proxy('8.8.8.8:0'))
|
|
self.assertFalse(fetch.is_usable_proxy('8.8.8.8:99999'))
|
|
|
|
if __name__ == '__main__':
|
|
unittest.main()
|
|
```
|
|
|
|
**Files:** tests/ directory
|
|
**Effort:** High (initial), Low (ongoing)
|
|
**Risk:** Low
|
|
|
|
---
|
|
|
|
## Long Term (Future)
|
|
|
|
### [ ] 16. Geographic Validation
|
|
Verify proxy actually routes through claimed location using IP geolocation.
|
|
|
|
### [ ] 17. HTTPS/SSL Proxy Testing
|
|
Add capability to test HTTPS CONNECT proxies.
|
|
|
|
### [x] 18. Additional Search Engines
|
|
|
|
**Completed.** Added modular search engine architecture.
|
|
- engines.py: SearchEngine base class with build_url(), extract_urls(), is_rate_limited()
|
|
- Engines: DuckDuckGo, Startpage, Mojeek (UK), Qwant (FR), Yandex (RU), Ecosia, Brave
|
|
- Git hosters: GitHub, GitLab, Codeberg, Gitea
|
|
- scraper.py: EngineTracker class for multi-engine rate limiting
|
|
- Config: [scraper] engines, max_pages settings
|
|
- searx.instances: Updated with 51 active SearXNG instances
|
|
|
|
### [x] 19. REST API
|
|
|
|
**Completed.** Added HTTP API server for querying working proxies.
|
|
- httpd.py: ProxyAPIServer class with BaseHTTPServer
|
|
- Endpoints: /proxies, /proxies/count, /health
|
|
- Params: limit, proto, country, format (json/plain)
|
|
- Integrated into proxywatchd.py (starts when httpd.enabled=True)
|
|
- Config: [httpd] section with listenip, port, enabled
|
|
|
|
### [ ] 20. Web Dashboard
|
|
Status page showing live statistics.
|
|
|
|
---
|
|
|
|
## Completed
|
|
|
|
### [x] Work-Stealing Queue
|
|
- Implemented shared Queue.Queue() for job distribution
|
|
- Workers pull from shared queue instead of pre-assigned lists
|
|
- Better utilization across threads
|
|
|
|
### [x] Multi-Target Validation
|
|
- Test each proxy against 3 random targets
|
|
- 2/3 majority required for success
|
|
- Reduces false negatives from single target failures
|
|
|
|
### [x] Interleaved Testing
|
|
- Jobs shuffled across all proxies before queueing
|
|
- Prevents burst of 3 connections to same proxy
|
|
- ProxyTestState accumulates results from TargetTestJobs
|
|
|
|
### [x] Code Cleanup
|
|
- Removed 93 lines dead HTTP server code (ppf.py)
|
|
- Removed dead gumbo parser (soup_parser.py)
|
|
- Removed test code (comboparse.py)
|
|
- Removed unused functions (misc.py)
|
|
- Fixed IP/port cleansing (ppf.py)
|
|
- Updated .gitignore
|
|
|
|
### [x] Rate Limiting & Instance Tracking (scraper.py)
|
|
- InstanceTracker class with exponential backoff
|
|
- Configurable backoff_base, backoff_max, fail_threshold
|
|
- Instance cycling when rate limited
|
|
|
|
### [x] Exception Logging with Context
|
|
- Replaced bare `except:` with typed exceptions across all files
|
|
- Added context logging to exception handlers (e.g., URL, error message)
|
|
|
|
### [x] Timeout Standardization
|
|
- Added timeout_connect, timeout_read to [common] config section
|
|
- Added stale_days, stats_interval to [watchd] config section
|
|
|
|
### [x] Periodic Stats & Stale Cleanup (proxywatchd.py)
|
|
- Stats class tracks tested/passed/failed with thread-safe counters
|
|
- Configurable stats_interval (default: 300s)
|
|
- cleanup_stale() removes dead proxies older than stale_days (default: 30)
|
|
|
|
### [x] Unified Proxy Cache
|
|
- Moved _known_proxies to fetch.py with helper functions
|
|
- init_known_proxies(), add_known_proxies(), is_known_proxy()
|
|
- ppf.py now uses shared cache via fetch module
|
|
|
|
### [x] Config Validation
|
|
- config.py: validate() method checks config values on startup
|
|
- Validates: port ranges, timeout values, thread counts, engine names
|
|
- Warns on missing source_file, unknown engines
|
|
- Errors on unwritable database directories
|
|
- Integrated into ppf.py, proxywatchd.py, scraper.py main entry points
|