diff --git a/ROADMAP.md b/ROADMAP.md new file mode 100644 index 0000000..2ebeab4 --- /dev/null +++ b/ROADMAP.md @@ -0,0 +1,242 @@ +# PPF Project Roadmap + +## Project Purpose + +PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to: + +1. **Discover** proxy addresses by crawling websites and search engines +2. **Validate** proxies through multi-target testing via Tor +3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP) + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ PPF Architecture │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ scraper.py │ │ ppf.py │ │proxywatchd │ │ +│ │ │ │ │ │ │ │ +│ │ Searx query │───>│ URL harvest │───>│ Proxy test │ │ +│ │ URL finding │ │ Proxy extract│ │ Validation │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +│ │ │ │ │ +│ v v v │ +│ ┌─────────────────────────────────────────────────────────────────┐ │ +│ │ SQLite Databases │ │ +│ │ uris.db (URLs) proxies.db (proxy list) │ │ +│ └─────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────┐ │ +│ │ Network Layer │ │ +│ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │ +│ └─────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Constraints + +- **Python 2.7** compatibility required +- **Minimal external dependencies** (avoid adding new modules) +- Current dependencies: beautifulsoup4, lxml, IP2Location + +--- + +## Phase 1: Stability & Code Quality + +**Objective:** Establish a solid, maintainable codebase + +### 1.1 Error Handling Improvements + +| Task | Description | File(s) | +|------|-------------|---------| +| Add connection retry logic | Implement exponential backoff for failed connections | rocksock.py, fetch.py | +| Graceful database errors | Handle SQLite lock/busy errors with retry | mysqlite.py | +| Timeout standardization | Consistent timeout handling across all network ops | proxywatchd.py, fetch.py | +| Exception logging | Log exceptions with context, not just silently catch | all files | + +### 1.2 Code Consolidation + +| Task | Description | File(s) | +|------|-------------|---------| +| Unify _known_proxies | Single source of truth for known proxy cache | ppf.py, fetch.py | +| Extract proxy utils | Create proxy_utils.py with cleanse/validate functions | new file | +| Remove global config pattern | Pass config explicitly instead of set_config() | fetch.py | +| Standardize logging | Consistent _log() usage with levels across all modules | all files | + +### 1.3 Testing Infrastructure + +| Task | Description | File(s) | +|------|-------------|---------| +| Add unit tests | Test proxy parsing, URL extraction, IP validation | tests/ | +| Mock network layer | Allow testing without live network/Tor | tests/ | +| Validation test suite | Verify multi-target voting logic | tests/ | + +--- + +## Phase 2: Performance Optimization + +**Objective:** Improve throughput and resource efficiency + +### 2.1 Connection Pooling + +| Task | Description | File(s) | +|------|-------------|---------| +| Tor connection reuse | Pool Tor SOCKS connections instead of reconnecting | proxywatchd.py | +| HTTP keep-alive | Reuse connections to same target servers | http2.py | +| Connection warm-up | Pre-establish connections before job assignment | proxywatchd.py | + +### 2.2 Database Optimization + +| Task | Description | File(s) | +|------|-------------|---------| +| Batch inserts | Group INSERT operations (already partial) | dbs.py | +| Index optimization | Add indexes for frequent query patterns | dbs.py | +| WAL mode | Enable Write-Ahead Logging for better concurrency | mysqlite.py | +| Prepared statements | Cache compiled SQL statements | mysqlite.py | + +### 2.3 Threading Improvements + +| Task | Description | File(s) | +|------|-------------|---------| +| Dynamic thread scaling | Adjust thread count based on success rate | proxywatchd.py | +| Priority queue | Test high-value proxies (low fail count) first | proxywatchd.py | +| Stale proxy cleanup | Background thread to remove long-dead proxies | proxywatchd.py | + +--- + +## Phase 3: Reliability & Accuracy + +**Objective:** Improve proxy validation accuracy and system reliability + +### 3.1 Enhanced Validation + +| Task | Description | File(s) | +|------|-------------|---------| +| Latency tracking | Store and use connection latency metrics | proxywatchd.py, dbs.py | +| Geographic validation | Verify proxy actually routes through claimed location | proxywatchd.py | +| Protocol fingerprinting | Better SOCKS4/SOCKS5/HTTP detection | rocksock.py | +| HTTPS/SSL testing | Validate SSL proxy capabilities | proxywatchd.py | + +### 3.2 Target Management + +| Task | Description | File(s) | +|------|-------------|---------| +| Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py | +| Target health tracking | Remove unresponsive targets from pool | proxywatchd.py | +| Geographic target spread | Ensure targets span multiple regions | config.py | + +### 3.3 Failure Analysis + +| Task | Description | File(s) | +|------|-------------|---------| +| Failure categorization | Distinguish timeout vs refused vs auth-fail | proxywatchd.py | +| Retry strategies | Different retry logic per failure type | proxywatchd.py | +| Dead proxy quarantine | Separate storage for likely-dead proxies | dbs.py | + +--- + +## Phase 4: Features & Usability + +**Objective:** Add useful features while maintaining simplicity + +### 4.1 Reporting & Monitoring + +| Task | Description | File(s) | +|------|-------------|---------| +| Statistics collection | Track success rates, throughput, latency | proxywatchd.py | +| Periodic status output | Log summary stats every N minutes | ppf.py, proxywatchd.py | +| Export functionality | Export working proxies to file (txt, json) | new: export.py | + +### 4.2 Configuration + +| Task | Description | File(s) | +|------|-------------|---------| +| Config validation | Validate config.ini on startup | config.py | +| Runtime reconfiguration | Reload config without restart (SIGHUP) | proxywatchd.py | +| Sensible defaults | Document and improve default values | config.py | + +### 4.3 Proxy Source Expansion + +| Task | Description | File(s) | +|------|-------------|---------| +| Additional scrapers | Support more search engines beyond Searx | scraper.py | +| API sources | Integrate free proxy API endpoints | new: api_sources.py | +| Import formats | Support various proxy list formats | ppf.py | + +--- + +## Implementation Priority + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Priority Matrix │ +├──────────────────────────┬──────────────────────────────────────────────────┤ +│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │ +│ │ │ +│ ● Unify _known_proxies │ ● Connection pooling │ +│ ● Graceful DB errors │ ● Dynamic thread scaling │ +│ ● Batch inserts │ ● Unit test infrastructure │ +│ ● WAL mode for SQLite │ ● Latency tracking │ +│ │ │ +├──────────────────────────┼──────────────────────────────────────────────────┤ +│ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │ +│ │ │ +│ ● Standardize logging │ ● Geographic validation │ +│ ● Config validation │ ● Additional scrapers │ +│ ● Export functionality │ ● API sources │ +│ ● Status output │ ● Protocol fingerprinting │ +│ │ │ +└──────────────────────────┴──────────────────────────────────────────────────┘ +``` + +--- + +## Completed Work + +### Multi-Target Validation (Done) +- [x] Work-stealing queue with shared Queue.Queue() +- [x] Multi-target validation (2/3 majority voting) +- [x] Interleaved testing (jobs shuffled across proxies) +- [x] ProxyTestState and TargetTestJob classes + +### Code Cleanup (Done) +- [x] Removed dead HTTP server code from ppf.py +- [x] Removed dead gumbo code from soup_parser.py +- [x] Removed test code from comboparse.py +- [x] Removed unused functions from misc.py +- [x] Fixed IP/port cleansing in ppf.py extract_proxies() +- [x] Updated .gitignore, removed .pyc files + +--- + +## Technical Debt + +| Item | Description | Risk | +|------|-------------|------| +| Dual _known_proxies | ppf.py and fetch.py maintain separate caches | Medium - duplicates possible | +| Global config in fetch.py | set_config() pattern is fragile | Low - works but not clean | +| No input validation | Proxy strings parsed without validation | Medium - could crash on bad data | +| Silent exception catching | Some except: pass patterns hide errors | High - hard to debug | +| Hardcoded timeouts | Various timeout values scattered in code | Low - works but not configurable | + +--- + +## File Reference + +| File | Purpose | Status | +|------|---------|--------| +| ppf.py | Main URL harvester daemon | Active, cleaned | +| proxywatchd.py | Proxy validation daemon | Active, enhanced | +| scraper.py | Searx search integration | Active, cleaned | +| fetch.py | HTTP fetching with proxy support | Active | +| dbs.py | Database schema and inserts | Active | +| mysqlite.py | SQLite wrapper | Active | +| rocksock.py | Socket/proxy abstraction (3rd party) | Stable | +| http2.py | HTTP client implementation | Stable | +| config.py | Configuration management | Active | +| comboparse.py | Config/arg parser framework | Stable, cleaned | +| soup_parser.py | BeautifulSoup wrapper | Stable, cleaned | +| misc.py | Utilities (timestamp, logging) | Stable, cleaned | diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..fe08da0 --- /dev/null +++ b/TODO.md @@ -0,0 +1,562 @@ +# PPF Implementation Tasks + +## Legend + +``` +[ ] Not started +[~] In progress +[x] Completed +[!] Blocked/needs discussion +``` + +--- + +## Immediate Priority (Next Sprint) + +### [ ] 1. Unify _known_proxies Cache + +**Problem:** Both ppf.py and fetch.py maintain separate `_known_proxies` dictionaries. +Updates to one don't reflect in the other, causing potential duplicate processing. + +**Implementation:** +```python +# fetch.py - becomes the single source of truth +_known_proxies = {} +_known_proxies_lock = threading.Lock() + +def get_known_proxies(): + """Return reference to shared known proxies dict.""" + return _known_proxies + +def is_known_proxy(proxy): + """Thread-safe check if proxy is known.""" + with _known_proxies_lock: + return proxy in _known_proxies + +def mark_proxy_known(proxy): + """Thread-safe mark proxy as known.""" + with _known_proxies_lock: + _known_proxies[proxy] = True +``` + +**Files:** fetch.py, ppf.py +**Effort:** Low +**Risk:** Low + +--- + +### [ ] 2. Graceful SQLite Error Handling + +**Problem:** SQLite can throw "database is locked" errors under concurrent access. +Currently these bubble up and crash the application. + +**Implementation:** +```python +# mysqlite.py +import time + +class mysqlite(): + def execute(self, query, params=None, retries=5): + for attempt in range(retries): + try: + if params: + return self.cur.execute(query, params) + return self.cur.execute(query) + except sqlite3.OperationalError as e: + if 'locked' in str(e) and attempt < retries - 1: + time.sleep(0.1 * (attempt + 1)) # Exponential backoff + continue + raise +``` + +**Files:** mysqlite.py +**Effort:** Low +**Risk:** Low + +--- + +### [ ] 3. Enable SQLite WAL Mode + +**Problem:** Default SQLite journaling mode blocks concurrent readers during writes. + +**Implementation:** +```python +# mysqlite.py - in __init__ +def __init__(self, database, rowtype): + self.conn = sqlite3.connect(database, check_same_thread=False) + self.conn.execute('PRAGMA journal_mode=WAL') + self.conn.execute('PRAGMA synchronous=NORMAL') + # ... +``` + +**Files:** mysqlite.py +**Effort:** Low +**Risk:** Low (WAL is well-tested) + +--- + +### [ ] 4. Batch Database Inserts + +**Problem:** insert_proxies() and insert_urls() do individual INSERTs, causing +excessive disk I/O and lock contention. + +**Implementation:** +```python +# dbs.py +def insert_proxies(proxydb, proxies, source): + """Batch insert proxies.""" + if not proxies: + return + + mytime = int(time.time()) + values = [(p, source, mytime) for p in proxies] + + proxydb.executemany( + 'INSERT OR IGNORE INTO proxylist (proxy, source, first_seen) VALUES (?, ?, ?)', + values + ) + proxydb.commit() +``` + +**Files:** dbs.py +**Effort:** Low +**Risk:** Low + +--- + +### [ ] 5. Add Database Indexes + +**Problem:** Queries on large tables are slow without proper indexes. + +**Implementation:** +```python +# dbs.py - in create_table_if_not_exists +def create_indexes(db, table): + if table == 'proxylist': + db.execute('CREATE INDEX IF NOT EXISTS idx_proxy_failed ON proxylist(failed)') + db.execute('CREATE INDEX IF NOT EXISTS idx_proxy_proto ON proxylist(proto)') + elif table == 'uris': + db.execute('CREATE INDEX IF NOT EXISTS idx_uri_error ON uris(error)') + db.execute('CREATE INDEX IF NOT EXISTS idx_uri_checktime ON uris(check_time)') + db.commit() +``` + +**Files:** dbs.py +**Effort:** Low +**Risk:** Low + +--- + +## Short Term (This Month) + +### [ ] 6. Standardize Logging + +**Problem:** Inconsistent logging across files. Some use print(), some use _log(), +some silently swallow errors. + +**Implementation:** +```python +# misc.py - enhanced logging +import sys + +LOG_LEVELS = {'debug': 0, 'info': 1, 'warn': 2, 'error': 3} +_log_level = 1 # Default: info + +def set_log_level(level): + global _log_level + _log_level = LOG_LEVELS.get(level, 1) + +def _log(msg, level='info', module=None): + if LOG_LEVELS.get(level, 1) < _log_level: + return + prefix = '%s/%s' % (timestamp(), level) + if module: + prefix = '%s/%s' % (prefix, module) + output = sys.stderr if level in ('warn', 'error') else sys.stdout + print >> output, '\r%s\t%s' % (prefix, msg) +``` + +**Files:** misc.py, all other files +**Effort:** Medium +**Risk:** Low + +--- + +### [ ] 7. Connection Timeout Standardization + +**Problem:** Timeout values are scattered: config.ppf.timeout, config.watchd.timeout, +hardcoded values in rocksock.py. + +**Implementation:** +- Add to config.py: `[network]` section with timeout_connect, timeout_read, timeout_total +- Pass timeout explicitly to all network functions +- Remove hardcoded timeout values + +**Files:** config.py, fetch.py, proxywatchd.py, rocksock.py +**Effort:** Medium +**Risk:** Low + +--- + +### [ ] 8. Failure Categorization + +**Problem:** All failures treated equally. Timeout vs connection-refused vs +auth-failure have different implications. + +**Implementation:** +```python +# proxywatchd.py +class FailureType: + TIMEOUT = 'timeout' # Retry later + REFUSED = 'refused' # Proxy down, lower priority + AUTH_FAIL = 'auth_fail' # Wrong protocol, try others + TARGET_DOWN = 'target_down' # Not proxy's fault + UNKNOWN = 'unknown' + +def categorize_failure(exception): + """Categorize failure type from exception.""" + msg = str(exception).lower() + if 'timeout' in msg or 'timed out' in msg: + return FailureType.TIMEOUT + if 'refused' in msg: + return FailureType.REFUSED + if 'auth' in msg or 'handshake' in msg: + return FailureType.AUTH_FAIL + return FailureType.UNKNOWN +``` + +**Files:** proxywatchd.py +**Effort:** Medium +**Risk:** Low + +--- + +### [ ] 9. Priority Queue for Proxy Testing + +**Problem:** All proxies tested with equal priority. Should prioritize: +- Recently successful proxies (verify still working) +- New proxies (determine if usable) +- Low fail-count proxies + +**Implementation:** +```python +# proxywatchd.py +import heapq + +class PriorityJobQueue: + """Priority queue wrapper for proxy test jobs.""" + def __init__(self): + self.heap = [] + self.lock = threading.Lock() + + def put(self, job, priority): + """Lower priority number = higher priority.""" + with self.lock: + heapq.heappush(self.heap, (priority, id(job), job)) + + def get(self, timeout=None): + """Get highest priority job.""" + # ... implementation with timeout +``` + +Priority calculation: +- New proxy (retrievals=0): priority 0 +- Recent success (last_success < 1hr): priority 1 +- Low fail count (failed < 3): priority 2 +- Medium fail count: priority 3 +- High fail count: priority 4 + +**Files:** proxywatchd.py +**Effort:** Medium +**Risk:** Medium + +--- + +### [ ] 10. Periodic Statistics Output + +**Problem:** No visibility into system performance during operation. + +**Implementation:** +```python +# proxywatchd.py +class Stats: + def __init__(self): + self.lock = threading.Lock() + self.tested = 0 + self.passed = 0 + self.failed = 0 + self.start_time = time.time() + + def record(self, success): + with self.lock: + self.tested += 1 + if success: + self.passed += 1 + else: + self.failed += 1 + + def report(self): + with self.lock: + elapsed = time.time() - self.start_time + rate = self.tested / elapsed if elapsed > 0 else 0 + pct = (self.passed * 100.0 / self.tested) if self.tested > 0 else 0 + return 'tested=%d passed=%d (%.1f%%) rate=%.1f/s' % ( + self.tested, self.passed, pct, rate) + +# In main loop, every 5 minutes: +if time.time() - last_stats > 300: + _log(stats.report(), 'stats', 'watchd') + last_stats = time.time() +``` + +**Files:** proxywatchd.py +**Effort:** Low +**Risk:** Low + +--- + +## Medium Term (Next Quarter) + +### [ ] 11. Tor Connection Pooling + +**Problem:** Each proxy test creates a new Tor connection. Tor circuit establishment +is slow (~2-3 seconds). + +**Implementation:** +```python +# new file: connection_pool.py +class TorConnectionPool: + """Pool of reusable Tor SOCKS connections.""" + + def __init__(self, tor_hosts, pool_size=10): + self.tor_hosts = tor_hosts + self.pool_size = pool_size + self.connections = Queue.Queue(pool_size) + self.lock = threading.Lock() + + def get(self, timeout=5): + """Get a Tor connection from pool, or create new.""" + try: + return self.connections.get(timeout=0.1) + except Queue.Empty: + return self._create_connection() + + def release(self, conn): + """Return connection to pool.""" + try: + self.connections.put_nowait(conn) + except Queue.Full: + conn.close() + + def _create_connection(self): + """Create new Tor SOCKS connection.""" + host = random.choice(self.tor_hosts) + # ... establish connection +``` + +**Files:** new connection_pool.py, proxywatchd.py +**Effort:** High +**Risk:** Medium + +--- + +### [ ] 12. Dynamic Thread Scaling + +**Problem:** Fixed thread count regardless of success rate or system load. + +**Implementation:** +```python +# proxywatchd.py +class ThreadScaler: + """Dynamically adjust thread count based on performance.""" + + def __init__(self, min_threads=5, max_threads=50): + self.min = min_threads + self.max = max_threads + self.current = min_threads + self.success_rate_window = [] + + def record_result(self, success): + self.success_rate_window.append(success) + if len(self.success_rate_window) > 100: + self.success_rate_window.pop(0) + + def recommended_threads(self): + if len(self.success_rate_window) < 20: + return self.current + + success_rate = sum(self.success_rate_window) / len(self.success_rate_window) + + # High success rate -> can handle more threads + if success_rate > 0.7 and self.current < self.max: + return self.current + 5 + # Low success rate -> reduce load + elif success_rate < 0.3 and self.current > self.min: + return self.current - 5 + + return self.current +``` + +**Files:** proxywatchd.py +**Effort:** Medium +**Risk:** Medium + +--- + +### [ ] 13. Latency Tracking + +**Problem:** No visibility into proxy speed. A working but slow proxy may be +less useful than a fast one. + +**Implementation:** +```python +# dbs.py - add columns +# ALTER TABLE proxylist ADD COLUMN avg_latency REAL DEFAULT 0 +# ALTER TABLE proxylist ADD COLUMN latency_samples INTEGER DEFAULT 0 + +def update_proxy_latency(proxydb, proxy, latency): + """Update rolling average latency for proxy.""" + row = proxydb.execute( + 'SELECT avg_latency, latency_samples FROM proxylist WHERE proxy=?', + (proxy,) + ).fetchone() + + if row: + old_avg, samples = row + # Exponential moving average + new_avg = (old_avg * samples + latency) / (samples + 1) + new_samples = min(samples + 1, 100) # Cap at 100 samples + + proxydb.execute( + 'UPDATE proxylist SET avg_latency=?, latency_samples=? WHERE proxy=?', + (new_avg, new_samples, proxy) + ) +``` + +**Files:** dbs.py, proxywatchd.py +**Effort:** Medium +**Risk:** Low + +--- + +### [ ] 14. Export Functionality + +**Problem:** No easy way to export working proxies for use elsewhere. + +**Implementation:** +```python +# new file: export.py +def export_proxies(proxydb, format='txt', filters=None): + """Export working proxies to various formats.""" + + query = 'SELECT proto, proxy FROM proxylist WHERE failed=0' + if filters: + if 'proto' in filters: + query += ' AND proto=?' + + rows = proxydb.execute(query).fetchall() + + if format == 'txt': + return '\n'.join('%s://%s' % (r[0], r[1]) for r in rows) + elif format == 'json': + import json + return json.dumps([{'proto': r[0], 'address': r[1]} for r in rows]) + elif format == 'csv': + return 'proto,address\n' + '\n'.join('%s,%s' % r for r in rows) + +# CLI: python export.py --format json --proto socks5 > proxies.json +``` + +**Files:** new export.py +**Effort:** Low +**Risk:** Low + +--- + +### [ ] 15. Unit Test Infrastructure + +**Problem:** No automated tests. Changes can break existing functionality silently. + +**Implementation:** +``` +tests/ +├── __init__.py +├── test_proxy_utils.py # Test IP validation, cleansing +├── test_extract.py # Test proxy/URL extraction +├── test_database.py # Test DB operations with temp DB +└── mock_network.py # Mock rocksock for offline testing +``` + +```python +# tests/test_proxy_utils.py +import unittest +import sys +sys.path.insert(0, '..') +import fetch + +class TestProxyValidation(unittest.TestCase): + def test_valid_proxy(self): + self.assertTrue(fetch.is_usable_proxy('8.8.8.8:8080')) + + def test_private_ip_rejected(self): + self.assertFalse(fetch.is_usable_proxy('192.168.1.1:8080')) + self.assertFalse(fetch.is_usable_proxy('10.0.0.1:8080')) + self.assertFalse(fetch.is_usable_proxy('172.16.0.1:8080')) + + def test_invalid_port_rejected(self): + self.assertFalse(fetch.is_usable_proxy('8.8.8.8:0')) + self.assertFalse(fetch.is_usable_proxy('8.8.8.8:99999')) + +if __name__ == '__main__': + unittest.main() +``` + +**Files:** tests/ directory +**Effort:** High (initial), Low (ongoing) +**Risk:** Low + +--- + +## Long Term (Future) + +### [ ] 16. Geographic Validation +Verify proxy actually routes through claimed location using IP geolocation. + +### [ ] 17. HTTPS/SSL Proxy Testing +Add capability to test HTTPS CONNECT proxies. + +### [ ] 18. Additional Search Engines +Support Google, Bing, DuckDuckGo beyond Searx. + +### [ ] 19. REST API +Simple HTTP API to query proxy database. + +### [ ] 20. Web Dashboard +Status page showing live statistics. + +--- + +## Completed + +### [x] Work-Stealing Queue +- Implemented shared Queue.Queue() for job distribution +- Workers pull from shared queue instead of pre-assigned lists +- Better utilization across threads + +### [x] Multi-Target Validation +- Test each proxy against 3 random targets +- 2/3 majority required for success +- Reduces false negatives from single target failures + +### [x] Interleaved Testing +- Jobs shuffled across all proxies before queueing +- Prevents burst of 3 connections to same proxy +- ProxyTestState accumulates results from TargetTestJobs + +### [x] Code Cleanup +- Removed 93 lines dead HTTP server code (ppf.py) +- Removed dead gumbo parser (soup_parser.py) +- Removed test code (comboparse.py) +- Removed unused functions (misc.py) +- Fixed IP/port cleansing (ppf.py) +- Updated .gitignore