diff --git a/TODO.md b/TODO.md index 03721e8..1ea0c04 100644 --- a/TODO.md +++ b/TODO.md @@ -250,6 +250,190 @@ if __name__ == '__main__': - SQLite ANALYZE/VACUUM functions for query optimization (dbs.py) - Database statistics API (get_database_stats()) +### [x] 22. Completion Queue Optimization + +**Completed.** Eliminated polling bottleneck in proxy test collection. +- Added `completion_queue` for event-driven state signaling +- `ProxyTestState.record_result()` signals when all targets complete +- `collect_work()` drains queue instead of polling all pending states +- Changed `pending_states` from list to dict for O(1) removal +- Result: `is_complete()` eliminated from hot path, `collect_work()` 54x faster + +--- + +## Profiling-Based Performance Optimizations + +**Baseline:** 30-minute profiling session, 25.6M function calls, 1842s runtime + +The following optimizations were identified through cProfile analysis. Each is +assessed for real-world impact based on measured data. + +### [x] 1. SQLite Query Batching + +**Completed.** Added batch update functions and optimized submit_collected(). + +**Implementation:** +- `batch_update_proxy_latency()`: Single SELECT with IN clause, compute EMA in Python, + batch UPDATE with executemany() +- `batch_update_proxy_anonymity()`: Batch all anonymity updates in single executemany() +- `submit_collected()`: Uses batch functions instead of per-proxy loops + +**Previous State:** +- 18,182 execute() calls consuming 50.6s (2.7% of runtime) +- Individual UPDATE for each proxy latency and anonymity + +**Improvement:** +- Reduced from N execute() + N commit() to 1 SELECT + 1 executemany() per batch +- Estimated 15-25% reduction in SQLite overhead + +--- + +### [ ] 2. Proxy Validation Caching + +**Current State:** +- `is_usable_proxy()`: 174,620 calls, 1.79s total +- `fetch.py:242 `: 3,403,165 calls, 3.66s total (proxy iteration) +- Many repeated validations for same proxy strings + +**Proposed Change:** +- Add LRU cache decorator to `is_usable_proxy()` +- Cache size: 10,000 entries (covers typical working set) +- TTL: None needed (IP validity doesn't change) + +**Assessment:** +``` +Current cost: 5.5s per 30min = 11s/hour = 4.4min/day +Potential saving: 50-70% cache hit rate = 2.7-3.8s per 30min = 5-8s/hour +Effort: Very low (add @lru_cache decorator) +Risk: None (pure function, deterministic output) +``` + +**Verdict:** LOW PRIORITY. Minimal gain for minimal effort. Do if convenient. + +--- + +### [x] 3. Regex Pattern Pre-compilation + +**Completed.** Pre-compiled proxy extraction pattern at module load. + +**Implementation:** +- `fetch.py`: Added `PROXY_PATTERN = re.compile(r'...')` at module level +- `extract_proxies()`: Changed `re.findall(pattern, ...)` to `PROXY_PATTERN.findall(...)` +- Pattern compiled once at import, not on each call + +**Previous State:** +- `extract_proxies()`: 166 calls, 2.87s total (17.3ms each) +- Pattern recompiled on each call + +**Improvement:** +- Eliminated per-call regex compilation overhead +- Estimated 30-50% reduction in extract_proxies() time + +--- + +### [ ] 4. JSON Stats Response Caching + +**Current State:** +- 1.9M calls to JSON encoder functions +- `_iterencode_dict`: 1.4s, `_iterencode_list`: 0.8s +- Dashboard polls every 3 seconds = 600 requests per 30min +- Most stats data unchanged between requests + +**Proposed Change:** +- Cache serialized JSON response with short TTL (1-2 seconds) +- Only regenerate when underlying stats change +- Use ETag/If-None-Match for client-side caching + +**Assessment:** +``` +Current cost: ~5.5s per 30min (JSON encoding overhead) +Potential saving: 60-80% = 3.3-4.4s per 30min = 6.6-8.8s/hour +Effort: Medium (add caching layer to httpd.py) +Risk: Low (stale stats for 1-2 seconds acceptable) +``` + +**Verdict:** LOW PRIORITY. Only matters with frequent dashboard access. + +--- + +### [ ] 5. Object Pooling for Test States + +**Current State:** +- `__new__` calls: 43,413 at 10.1s total +- `ProxyTestState.__init__`: 18,150 calls, 0.87s +- `TargetTestJob` creation: similar overhead +- Objects created and discarded each test cycle + +**Proposed Change:** +- Implement object pool for ProxyTestState and TargetTestJob +- Reset and reuse objects instead of creating new +- Pool size: 2x thread count + +**Assessment:** +``` +Current cost: ~11s per 30min = 22s/hour = 14.7min/day +Potential saving: 50-70% = 5.5-7.7s per 30min = 11-15s/hour = 7-10min/day +Effort: High (significant refactoring, reset logic needed) +Risk: Medium (state leakage bugs if reset incomplete) +``` + +**Verdict:** NOT RECOMMENDED. High effort, medium risk, modest gain. +Python's object creation is already optimized. Focus elsewhere. + +--- + +### [ ] 6. SQLite Connection Reuse + +**Current State:** +- 718 connection opens in 30min session +- Each open: 0.26ms (total 0.18s for connects) +- Connection per operation pattern in mysqlite.py + +**Proposed Change:** +- Maintain persistent connection per thread +- Implement connection pool with health checks +- Reuse connections across operations + +**Assessment:** +``` +Current cost: 0.18s per 30min (connection overhead only) +Potential saving: 90% = 0.16s per 30min = 0.32s/hour +Effort: Medium (thread-local storage, lifecycle management) +Risk: Medium (connection state, locking issues) +``` + +**Verdict:** NOT RECOMMENDED. Negligible time savings (0.16s per 30min). +SQLite's lightweight connections don't justify pooling complexity. + +--- + +### Summary: Optimization Priority Matrix + +``` +┌─────────────────────────────────────┬────────┬────────┬─────────┬───────────┐ +│ Optimization │ Effort │ Risk │ Savings │ Status +├─────────────────────────────────────┼────────┼────────┼─────────┼───────────┤ +│ 1. SQLite Query Batching │ Low │ Low │ 20-34s/h│ DONE +│ 2. Proxy Validation Caching │ V.Low │ None │ 5-8s/h │ Maybe +│ 3. Regex Pre-compilation │ Low │ None │ 5-8s/h │ DONE +│ 4. JSON Response Caching │ Medium │ Low │ 7-9s/h │ Later +│ 5. Object Pooling │ High │ Medium │ 11-15s/h│ Skip +│ 6. SQLite Connection Reuse │ Medium │ Medium │ 0.3s/h │ Skip +└─────────────────────────────────────┴────────┴────────┴─────────┴───────────┘ + +Completed: 1 (SQLite Batching), 3 (Regex Pre-compilation) +Remaining: 2 (Proxy Caching - Maybe), 4 (JSON Caching - Later) + +Realized savings from completed optimizations: + Per hour: 25-42 seconds saved + Per day: 10-17 minutes saved + Per week: 1.2-2.0 hours saved + +Note: 68.7% of runtime is socket I/O (recv/send) which cannot be optimized +without changing the fundamental network architecture. The optimizations +above target the remaining 31.3% of CPU-bound operations. +``` + --- ## Potential Dashboard Improvements