docs: remove completed items from TODO and ROADMAP
This commit is contained in:
362
ROADMAP.md
362
ROADMAP.md
@@ -45,83 +45,15 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Stability & Code Quality
|
||||
## Open Work
|
||||
|
||||
**Objective:** Establish a solid, maintainable codebase
|
||||
|
||||
### 1.1 Error Handling Improvements
|
||||
### Validation
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Add connection retry logic | Implement exponential backoff for failed connections | rocksock.py, fetch.py |
|
||||
| Graceful database errors | Handle SQLite lock/busy errors with retry | mysqlite.py |
|
||||
| Timeout standardization | Consistent timeout handling across all network ops | proxywatchd.py, fetch.py |
|
||||
| Exception logging | Log exceptions with context, not just silently catch | all files |
|
||||
|
||||
### 1.2 Code Consolidation
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Unify _known_proxies | Single source of truth for known proxy cache | ppf.py, fetch.py |
|
||||
| Extract proxy utils | Create proxy_utils.py with cleanse/validate functions | new file |
|
||||
| Remove global config pattern | Pass config explicitly instead of set_config() | fetch.py |
|
||||
| Standardize logging | Consistent _log() usage with levels across all modules | all files |
|
||||
|
||||
### 1.3 Testing Infrastructure
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Add unit tests | Test proxy parsing, URL extraction, IP validation | tests/ |
|
||||
| Mock network layer | Allow testing without live network/Tor | tests/ |
|
||||
| Validation test suite | Verify multi-target voting logic | tests/ |
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Performance Optimization
|
||||
|
||||
**Objective:** Improve throughput and resource efficiency
|
||||
|
||||
### 2.1 Connection Pooling
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Tor connection reuse | Pool Tor SOCKS connections instead of reconnecting | proxywatchd.py |
|
||||
| HTTP keep-alive | Reuse connections to same target servers | http2.py |
|
||||
| Connection warm-up | Pre-establish connections before job assignment | proxywatchd.py |
|
||||
|
||||
### 2.2 Database Optimization
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Batch inserts | Group INSERT operations (already partial) | dbs.py |
|
||||
| Index optimization | Add indexes for frequent query patterns | dbs.py |
|
||||
| WAL mode | Enable Write-Ahead Logging for better concurrency | mysqlite.py |
|
||||
| Prepared statements | Cache compiled SQL statements | mysqlite.py |
|
||||
|
||||
### 2.3 Threading Improvements
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Dynamic thread scaling | Adjust thread count based on success rate | proxywatchd.py |
|
||||
| Priority queue | Test high-value proxies (low fail count) first | proxywatchd.py |
|
||||
| Stale proxy cleanup | Background thread to remove long-dead proxies | proxywatchd.py |
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Reliability & Accuracy
|
||||
|
||||
**Objective:** Improve proxy validation accuracy and system reliability
|
||||
|
||||
### 3.1 Enhanced Validation
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Latency tracking | Store and use connection latency metrics | proxywatchd.py, dbs.py |
|
||||
| Geographic validation | Verify proxy actually routes through claimed location | proxywatchd.py |
|
||||
| Protocol fingerprinting | Better SOCKS4/SOCKS5/HTTP detection | rocksock.py |
|
||||
| HTTPS/SSL testing | Validate SSL proxy capabilities | proxywatchd.py |
|
||||
|
||||
### 3.2 Target Management
|
||||
### Target Management
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
@@ -129,245 +61,11 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
| Target health tracking | Remove unresponsive targets from pool | proxywatchd.py |
|
||||
| Geographic target spread | Ensure targets span multiple regions | config.py |
|
||||
|
||||
### 3.3 Failure Analysis
|
||||
### Proxy Source Expansion
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Failure categorization | Distinguish timeout vs refused vs auth-fail | proxywatchd.py |
|
||||
| Retry strategies | Different retry logic per failure type | proxywatchd.py |
|
||||
| Dead proxy quarantine | Separate storage for likely-dead proxies | dbs.py |
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Features & Usability
|
||||
|
||||
**Objective:** Add useful features while maintaining simplicity
|
||||
|
||||
### 4.1 Reporting & Monitoring
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Statistics collection | Track success rates, throughput, latency | proxywatchd.py |
|
||||
| Periodic status output | Log summary stats every N minutes | ppf.py, proxywatchd.py |
|
||||
| Export functionality | Export working proxies to file (txt, json) | new: export.py |
|
||||
|
||||
### 4.2 Configuration
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Config validation | Validate config.ini on startup | config.py |
|
||||
| Runtime reconfiguration | Reload config without restart (SIGHUP) | proxywatchd.py |
|
||||
| Sensible defaults | Document and improve default values | config.py |
|
||||
|
||||
### 4.3 Proxy Source Expansion
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Additional scrapers | Support more search engines beyond Searx | scraper.py |
|
||||
| API sources | Integrate free proxy API endpoints | new: api_sources.py |
|
||||
| Import formats | Support various proxy list formats | ppf.py |
|
||||
|
||||
---
|
||||
|
||||
## Implementation Priority
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Priority Matrix │
|
||||
├──────────────────────────┬──────────────────────────────────────────────────┤
|
||||
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │
|
||||
│ │ │
|
||||
│ [x] Unify _known_proxies │ [x] Connection pooling │
|
||||
│ [x] Graceful DB errors │ [x] Dynamic thread scaling │
|
||||
│ [x] Batch inserts │ [x] Unit test infrastructure │
|
||||
│ [x] WAL mode for SQLite │ [x] Latency tracking │
|
||||
│ │ │
|
||||
├──────────────────────────┼──────────────────────────────────────────────────┤
|
||||
│ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │
|
||||
│ │ │
|
||||
│ [x] Standardize logging │ [x] Geographic validation │
|
||||
│ [x] Config validation │ [x] Additional scrapers (Bing, Yahoo, Mojeek) │
|
||||
│ [x] Export functionality │ [ ] API sources │
|
||||
│ [x] Status output │ [ ] Protocol fingerprinting │
|
||||
│ │ │
|
||||
└──────────────────────────┴──────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Completed Work
|
||||
|
||||
### Multi-Target Validation (Done)
|
||||
- [x] Work-stealing queue with shared Queue.Queue()
|
||||
- [x] Multi-target validation (2/3 majority voting)
|
||||
- [x] Interleaved testing (jobs shuffled across proxies)
|
||||
- [x] ProxyTestState and TargetTestJob classes
|
||||
|
||||
### Code Cleanup (Done)
|
||||
- [x] Removed dead HTTP server code from ppf.py
|
||||
- [x] Removed dead gumbo code from soup_parser.py
|
||||
- [x] Removed test code from comboparse.py
|
||||
- [x] Removed unused functions from misc.py
|
||||
- [x] Fixed IP/port cleansing in ppf.py extract_proxies()
|
||||
- [x] Updated .gitignore, removed .pyc files
|
||||
|
||||
### Database Optimization (Done)
|
||||
- [x] Enable SQLite WAL mode for better concurrency
|
||||
- [x] Add indexes for common query patterns (failed, tested, proto, error, check_time)
|
||||
- [x] Optimize batch inserts (remove redundant SELECT before INSERT OR IGNORE)
|
||||
|
||||
### Dependency Reduction (Done)
|
||||
- [x] Make lxml optional (removed from requirements)
|
||||
- [x] Make IP2Location optional (graceful fallback)
|
||||
- [x] Add --nobs flag for stdlib HTMLParser fallback (bs4 optional)
|
||||
|
||||
### Rate Limiting & Stability (Done)
|
||||
- [x] InstanceTracker class in scraper.py with exponential backoff
|
||||
- [x] Configurable backoff_base, backoff_max, fail_threshold
|
||||
- [x] Exception logging with context (replaced bare except blocks)
|
||||
- [x] Unified _known_proxies cache in fetch.py
|
||||
|
||||
### Monitoring & Maintenance (Done)
|
||||
- [x] Stats class in proxywatchd.py (tested/passed/failed tracking)
|
||||
- [x] Periodic stats reporting (configurable stats_interval)
|
||||
- [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days)
|
||||
- [x] Timeout config options (timeout_connect, timeout_read)
|
||||
|
||||
### Connection Pooling (Done)
|
||||
- [x] TorHostState class tracking per-host health and latency
|
||||
- [x] TorConnectionPool with worker affinity for circuit reuse
|
||||
- [x] Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures
|
||||
- [x] Pool warmup and health status reporting
|
||||
|
||||
### Priority Queue (Done)
|
||||
- [x] PriorityJobQueue class with heap-based ordering
|
||||
- [x] calculate_priority() assigns priority 0-4 by proxy state
|
||||
- [x] New proxies tested first, high-fail proxies last
|
||||
|
||||
### Dynamic Thread Scaling (Done)
|
||||
- [x] ThreadScaler class adjusts thread count dynamically
|
||||
- [x] Scales up when queue deep and success rate acceptable
|
||||
- [x] Scales down when queue shallow or success rate drops
|
||||
- [x] Respects min/max bounds with cooldown period
|
||||
|
||||
### Latency Tracking (Done)
|
||||
- [x] avg_latency, latency_samples columns in proxylist
|
||||
- [x] Exponential moving average calculation
|
||||
- [x] Migration function for existing databases
|
||||
- [x] Latency recorded for successful proxy tests
|
||||
|
||||
### Container Support (Done)
|
||||
- [x] Dockerfile with Python 2.7-slim base
|
||||
- [x] docker-compose.yml for local development
|
||||
- [x] Rootless podman deployment documentation
|
||||
- [x] Volume mounts for persistent data
|
||||
|
||||
### Code Style (Done)
|
||||
- [x] Normalized indentation (4-space, no tabs)
|
||||
- [x] Removed dead code and unused imports
|
||||
- [x] Added docstrings to classes and functions
|
||||
- [x] Python 2/3 compatible imports (Queue/queue)
|
||||
|
||||
### Geographic Validation (Done)
|
||||
- [x] IP2Location integration for country lookup
|
||||
- [x] pyasn integration for ASN lookup
|
||||
- [x] Graceful fallback when database files missing
|
||||
- [x] Country codes displayed in test output: `(US)`, `(IN)`, etc.
|
||||
- [x] Data files: IP2LOCATION-LITE-DB1.BIN, ipasn.dat
|
||||
|
||||
### SSL Proxy Testing (Done)
|
||||
- [x] Default checktype changed to 'ssl'
|
||||
- [x] ssl_targets list with major HTTPS sites
|
||||
- [x] TLS handshake validation with certificate verification
|
||||
- [x] Detects MITM proxies that intercept SSL connections
|
||||
|
||||
### Export Functionality (Done)
|
||||
- [x] export.py CLI tool for exporting working proxies
|
||||
- [x] Multiple formats: txt, json, csv, len (length-prefixed)
|
||||
- [x] Filters: proto, country, anonymity, max_latency
|
||||
- [x] Sort options: latency, added, tested, success
|
||||
- [x] Output to stdout or file
|
||||
|
||||
### Web Dashboard (Done)
|
||||
- [x] /dashboard endpoint with dark theme HTML UI
|
||||
- [x] /api/stats endpoint for JSON runtime statistics
|
||||
- [x] Auto-refresh with JavaScript fetch every 3 seconds
|
||||
- [x] Stats provider callback from proxywatchd.py to httpd.py
|
||||
- [x] Displays: tested/passed/success rate, thread count, uptime
|
||||
- [x] Tor pool health: per-host latency, success rate, availability
|
||||
- [x] Failure categories breakdown: timeout, proxy, ssl, closed
|
||||
|
||||
### Dashboard Enhancements v2 (Done)
|
||||
- [x] Prominent check type badge in header (SSL/JUDGES/HTTP/IRC)
|
||||
- [x] System monitor bar: load, memory, disk, process RSS
|
||||
- [x] Anonymity breakdown: elite/anonymous/transparent counts
|
||||
- [x] Database health: size, tested/hour, added/day, dead count
|
||||
- [x] Enhanced Tor pool stats: requests, success rate, healthy nodes, latency
|
||||
- [x] SQLite ANALYZE/VACUUM functions for query optimization
|
||||
- [x] Lightweight design: client-side polling, minimal DOM updates
|
||||
|
||||
### Dashboard Enhancements v3 (Done)
|
||||
- [x] Electric cyan theme with translucent glass-morphism effects
|
||||
- [x] Unified wrapper styling (.chart-wrap, .histo-wrap, .stats-wrap, .lb-wrap, .pie-wrap)
|
||||
- [x] Consistent backdrop-filter blur and electric glow borders
|
||||
- [x] Tor Exit Nodes cards with hover effects (.tor-card)
|
||||
- [x] Lighter background/tile color scheme (#1e2738 bg, #181f2a card)
|
||||
- [x] Map endpoint restyled to match dashboard (electric cyan theme)
|
||||
- [x] Map markers updated from gold to cyan for approximate locations
|
||||
|
||||
### Memory Profiling & Analysis (Done)
|
||||
- [x] /api/memory endpoint with comprehensive memory stats
|
||||
- [x] objgraph integration for object type counting
|
||||
- [x] pympler integration for memory summaries
|
||||
- [x] Memory sample history tracking (RSS over time)
|
||||
- [x] Process memory from /proc/self/status (VmRSS, VmPeak, VmData, etc.)
|
||||
- [x] GC statistics (collections, objects, thresholds)
|
||||
|
||||
### MITM Detection Optimization (Done)
|
||||
- [x] MITM re-test skip optimization - avoid redundant SSL checks for known MITM proxies
|
||||
- [x] mitm_retest_skipped stats counter for tracking optimization effectiveness
|
||||
- [x] Content hash deduplication for stale proxy list detection
|
||||
- [x] stale_count reset when content hash changes
|
||||
|
||||
### Distributed Workers (Done)
|
||||
- [x] Worker registration and heartbeat system
|
||||
- [x] /api/workers endpoint for worker status monitoring
|
||||
- [x] Tor connectivity check before workers claim work
|
||||
- [x] Worker test rate tracking with sliding window calculation
|
||||
- [x] Combined rate aggregation across all workers
|
||||
- [x] Dashboard worker cards showing per-worker stats
|
||||
|
||||
### Dashboard Performance (Done)
|
||||
- [x] Keyboard shortcuts: r=refresh, 1-9=tabs, t=theme, p=pause
|
||||
- [x] Tab-aware chart rendering - skip expensive renders for hidden tabs
|
||||
- [x] Visibility API - pause polling when browser tab hidden
|
||||
- [x] Dark/muted-dark/light theme cycling
|
||||
- [x] Stats export endpoint (/api/stats/export?format=json|csv)
|
||||
|
||||
### Proxy Validation Cache (Done)
|
||||
- [x] LRU cache for is_usable_proxy() using OrderedDict
|
||||
- [x] Thread-safe with lock for concurrent access
|
||||
- [x] Proper LRU eviction (move_to_end on hits, popitem oldest when full)
|
||||
|
||||
### Database Context Manager (Done)
|
||||
- [x] Refactored all DB operations to use `_db_context()` context manager
|
||||
- [x] Connections guaranteed to close even on exceptions
|
||||
- [x] Removed deprecated `_prep_db()` and `_close_db()` methods
|
||||
- [x] `fetch_rows()` now accepts db parameter for cleaner dependency injection
|
||||
|
||||
### Additional Search Engines (Done)
|
||||
- [x] Bing and Yahoo engine implementations in scraper.py
|
||||
- [x] Engine rotation for rate limit avoidance
|
||||
- [x] engines.py module with SearchEngine base class
|
||||
|
||||
### Worker Health Improvements (Done)
|
||||
- [x] Tor connectivity check before workers claim work
|
||||
- [x] Fixed interval Tor check (30s) instead of exponential backoff
|
||||
- [x] Graceful handling when Tor unavailable
|
||||
|
||||
### Memory Optimization (Done)
|
||||
- [x] `__slots__` on ProxyTestState (27 attrs) and TargetTestJob (4 attrs)
|
||||
- [x] Reduced per-object memory overhead for hot path objects
|
||||
|
||||
---
|
||||
|
||||
@@ -375,37 +73,33 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
|
||||
| Item | Description | Risk |
|
||||
|------|-------------|------|
|
||||
| ~~Dual _known_proxies~~ | ~~ppf.py and fetch.py maintain separate caches~~ | **Resolved** |
|
||||
| Global config in fetch.py | set_config() pattern is fragile | Low - works but not clean |
|
||||
| ~~No input validation~~ | ~~Proxy strings parsed without validation~~ | **Resolved** |
|
||||
| ~~Silent exception catching~~ | ~~Some except: pass patterns hide errors~~ | **Resolved** |
|
||||
| ~~Hardcoded timeouts~~ | ~~Various timeout values scattered in code~~ | **Resolved** |
|
||||
|
||||
---
|
||||
|
||||
## File Reference
|
||||
|
||||
| File | Purpose | Status |
|
||||
|------|---------|--------|
|
||||
| ppf.py | Main URL harvester daemon | Active, cleaned |
|
||||
| proxywatchd.py | Proxy validation daemon | Active, enhanced |
|
||||
| scraper.py | Searx search integration | Active, cleaned |
|
||||
| fetch.py | HTTP fetching with proxy support | Active, LRU cache |
|
||||
| dbs.py | Database schema and inserts | Active |
|
||||
| mysqlite.py | SQLite wrapper | Active |
|
||||
| rocksock.py | Socket/proxy abstraction (3rd party) | Stable |
|
||||
| http2.py | HTTP client implementation | Stable |
|
||||
| httpd.py | Web dashboard and REST API server | Active, enhanced |
|
||||
| config.py | Configuration management | Active |
|
||||
| comboparse.py | Config/arg parser framework | Stable, cleaned |
|
||||
| soup_parser.py | BeautifulSoup wrapper | Stable, cleaned |
|
||||
| misc.py | Utilities (timestamp, logging) | Stable, cleaned |
|
||||
| export.py | Proxy export CLI tool | Active |
|
||||
| engines.py | Search engine implementations | Active |
|
||||
| connection_pool.py | Tor connection pooling | Active |
|
||||
| network_stats.py | Network statistics tracking | Active |
|
||||
| dns.py | DNS resolution with caching | Active |
|
||||
| mitm.py | MITM certificate detection | Active |
|
||||
| job.py | Priority job queue | Active |
|
||||
| static/dashboard.js | Dashboard frontend logic | Active, enhanced |
|
||||
| static/dashboard.html | Dashboard HTML template | Active |
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| ppf.py | Main URL harvester daemon |
|
||||
| proxywatchd.py | Proxy validation daemon |
|
||||
| scraper.py | Searx search integration |
|
||||
| fetch.py | HTTP fetching with proxy support |
|
||||
| dbs.py | Database schema and inserts |
|
||||
| mysqlite.py | SQLite wrapper |
|
||||
| rocksock.py | Socket/proxy abstraction (3rd party) |
|
||||
| http2.py | HTTP client implementation |
|
||||
| httpd.py | Web dashboard and REST API server |
|
||||
| config.py | Configuration management |
|
||||
| comboparse.py | Config/arg parser framework |
|
||||
| soup_parser.py | BeautifulSoup wrapper |
|
||||
| misc.py | Utilities (timestamp, logging) |
|
||||
| export.py | Proxy export CLI tool |
|
||||
| engines.py | Search engine implementations |
|
||||
| connection_pool.py | Tor connection pooling |
|
||||
| network_stats.py | Network statistics tracking |
|
||||
| dns.py | DNS resolution with caching |
|
||||
| mitm.py | MITM certificate detection |
|
||||
| job.py | Priority job queue |
|
||||
| static/dashboard.js | Dashboard frontend logic |
|
||||
| static/dashboard.html | Dashboard HTML template |
|
||||
|
||||
875
TODO.md
875
TODO.md
@@ -1,866 +1,73 @@
|
||||
# PPF Implementation Tasks
|
||||
# PPF TODO
|
||||
|
||||
## Legend
|
||||
## Optimization
|
||||
|
||||
```
|
||||
[ ] Not started
|
||||
[~] In progress
|
||||
[x] Completed
|
||||
[!] Blocked/needs discussion
|
||||
```
|
||||
### [ ] JSON Stats Response Caching
|
||||
|
||||
---
|
||||
|
||||
## Immediate Priority (Next Sprint)
|
||||
|
||||
### [x] 1. Unify _known_proxies Cache
|
||||
|
||||
**Completed.** Added `init_known_proxies()`, `add_known_proxies()`, `is_known_proxy()`
|
||||
to fetch.py. Updated ppf.py to use these functions instead of local cache.
|
||||
|
||||
---
|
||||
|
||||
### [x] 2. Graceful SQLite Error Handling
|
||||
|
||||
**Completed.** mysqlite.py now retries on "locked" errors with exponential backoff.
|
||||
|
||||
---
|
||||
|
||||
### [x] 3. Enable SQLite WAL Mode
|
||||
|
||||
**Completed.** mysqlite.py enables WAL mode and NORMAL synchronous on init.
|
||||
|
||||
---
|
||||
|
||||
### [x] 4. Batch Database Inserts
|
||||
|
||||
**Completed.** dbs.py uses executemany() for batch inserts.
|
||||
|
||||
---
|
||||
|
||||
### [x] 5. Add Database Indexes
|
||||
|
||||
**Completed.** dbs.py creates indexes on failed, tested, proto, error, check_time.
|
||||
|
||||
---
|
||||
|
||||
## Short Term (This Month)
|
||||
|
||||
### [x] 6. Log Level Filtering
|
||||
|
||||
**Completed.** Added log level filtering with -q/--quiet and -v/--verbose CLI flags.
|
||||
- misc.py: LOG_LEVELS dict, set_log_level(), get_log_level()
|
||||
- config.py: Added -q/--quiet and -v/--verbose arguments
|
||||
- Log levels: debug=0, info=1, warn=2, error=3
|
||||
- --quiet: only show warn/error
|
||||
- --verbose: show debug messages
|
||||
|
||||
---
|
||||
|
||||
### [x] 7. Connection Timeout Standardization
|
||||
|
||||
**Completed.** Added timeout_connect and timeout_read to [common] section in config.py.
|
||||
|
||||
---
|
||||
|
||||
### [x] 8. Failure Categorization
|
||||
|
||||
**Completed.** Added failure categorization for proxy errors.
|
||||
- misc.py: categorize_error() function, FAIL_* constants
|
||||
- Categories: timeout, refused, auth, unreachable, dns, ssl, closed, proxy, other
|
||||
- proxywatchd.py: Stats.record() now accepts category parameter
|
||||
- Stats.report() shows failure breakdown by category
|
||||
- ProxyTestState.evaluate() returns (success, category) tuple
|
||||
|
||||
---
|
||||
|
||||
### [x] 9. Priority Queue for Proxy Testing
|
||||
|
||||
**Completed.** Added priority-based job scheduling for proxy tests.
|
||||
- PriorityJobQueue class with heap-based ordering
|
||||
- calculate_priority() assigns priority 0-4 based on proxy state
|
||||
- Priority 0: New proxies (never tested)
|
||||
- Priority 1: Working proxies (no failures)
|
||||
- Priority 2: Low fail count (< 3)
|
||||
- Priority 3-4: Medium/high fail count
|
||||
- Integrated into prepare_jobs() for automatic prioritization
|
||||
|
||||
---
|
||||
|
||||
### [x] 10. Periodic Statistics Output
|
||||
|
||||
**Completed.** Added Stats class to proxywatchd.py with record(), should_report(),
|
||||
and report() methods. Integrated into main loop with configurable stats_interval.
|
||||
|
||||
---
|
||||
|
||||
## Medium Term (Next Quarter)
|
||||
|
||||
### [x] 11. Tor Connection Pooling
|
||||
|
||||
**Completed.** Added connection pooling with worker-Tor affinity and health monitoring.
|
||||
- connection_pool.py: TorHostState class tracks per-host health, latency, backoff
|
||||
- connection_pool.py: TorConnectionPool with worker affinity, warmup, statistics
|
||||
- proxywatchd.py: Workers get consistent Tor host assignment for circuit reuse
|
||||
- Success/failure tracking with exponential backoff (5s, 10s, 20s, 40s, max 60s)
|
||||
- Latency tracking with rolling averages
|
||||
- Pool status reported alongside periodic stats
|
||||
|
||||
---
|
||||
|
||||
### [x] 12. Dynamic Thread Scaling
|
||||
|
||||
**Completed.** Added dynamic thread scaling based on queue depth and success rate.
|
||||
- ThreadScaler class in proxywatchd.py with should_scale(), status_line()
|
||||
- Scales up when queue is deep (2x target) and success rate > 10%
|
||||
- Scales down when queue is shallow or success rate drops
|
||||
- Min/max threads derived from config.watchd.threads (1/4x to 2x)
|
||||
- 30-second cooldown between scaling decisions
|
||||
- _spawn_thread(), _remove_thread(), _adjust_threads() helper methods
|
||||
- Scaler status reported alongside periodic stats
|
||||
|
||||
---
|
||||
|
||||
### [x] 13. Latency Tracking
|
||||
|
||||
**Completed.** Added per-proxy latency tracking with exponential moving average.
|
||||
- dbs.py: avg_latency, latency_samples columns added to proxylist schema
|
||||
- dbs.py: _migrate_latency_columns() for backward-compatible migration
|
||||
- dbs.py: update_proxy_latency() with EMA (alpha = 2/(samples+1))
|
||||
- proxywatchd.py: ProxyTestState.last_latency_ms field
|
||||
- proxywatchd.py: evaluate() calculates average latency from successful tests
|
||||
- proxywatchd.py: submit_collected() records latency for passing proxies
|
||||
|
||||
---
|
||||
|
||||
### [x] 14. Export Functionality
|
||||
|
||||
**Completed.** Added export.py CLI tool for exporting working proxies.
|
||||
- Formats: txt (default), json, csv, len (length-prefixed)
|
||||
- Filters: --proto, --country, --anonymity, --max-latency
|
||||
- Options: --sort (latency, added, tested, success), --limit, --pretty
|
||||
- Output: stdout or --output file
|
||||
- Usage: `python export.py --proto http --country US --sort latency --limit 100`
|
||||
|
||||
---
|
||||
|
||||
### [x] 15. Unit Test Infrastructure
|
||||
|
||||
**Completed.** Added pytest-based test suite with comprehensive coverage.
|
||||
- tests/conftest.py: Shared fixtures (temp_db, proxy_db, sample_proxies, etc.)
|
||||
- tests/test_dbs.py: 40 tests for database operations (CDN filtering, latency, stats)
|
||||
- tests/test_fetch.py: 60 tests for proxy validation (skipped in Python 3)
|
||||
- tests/test_misc.py: 39 tests for utilities (timestamp, log levels, SSL errors)
|
||||
- tests/mock_network.py: Network mocking infrastructure
|
||||
|
||||
```
|
||||
Test Results: 79 passed, 60 skipped (Python 2 only)
|
||||
Run with: python3 -m pytest tests/ -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Long Term (Future)
|
||||
|
||||
### [x] 16. Geographic Validation
|
||||
|
||||
**Completed.** Added IP2Location and pyasn for proxy geolocation.
|
||||
- requirements.txt: Added IP2Location package
|
||||
- proxywatchd.py: IP2Location for country lookup, pyasn for ASN lookup
|
||||
- proxywatchd.py: Fixed ValueError handling when database files missing
|
||||
- data/: IP2LOCATION-LITE-DB1.BIN (2.7M), ipasn.dat (23M)
|
||||
- Output shows country codes: `http://1.2.3.4:8080 (US)` or `(IN)`, `(DE)`, etc.
|
||||
|
||||
---
|
||||
|
||||
### [x] 17. SSL Proxy Testing
|
||||
|
||||
**Completed.** Added SSL checktype for TLS handshake validation.
|
||||
- config.py: Default checktype changed to 'ssl'
|
||||
- proxywatchd.py: ssl_targets list with major HTTPS sites
|
||||
- Validates TLS handshake with certificate verification
|
||||
- Detects MITM proxies that intercept SSL connections
|
||||
|
||||
### [x] 18. Additional Search Engines
|
||||
|
||||
**Completed.** Added modular search engine architecture.
|
||||
- engines.py: SearchEngine base class with build_url(), extract_urls(), is_rate_limited()
|
||||
- Engines: DuckDuckGo, Startpage, Mojeek (UK), Qwant (FR), Yandex (RU), Ecosia, Brave
|
||||
- Git hosters: GitHub, GitLab, Codeberg, Gitea
|
||||
- scraper.py: EngineTracker class for multi-engine rate limiting
|
||||
- Config: [scraper] engines, max_pages settings
|
||||
- searx.instances: Updated with 51 active SearXNG instances
|
||||
|
||||
### [x] 19. REST API
|
||||
|
||||
**Completed.** Added HTTP API server for querying working proxies.
|
||||
- httpd.py: ProxyAPIServer class with BaseHTTPServer
|
||||
- Endpoints: /proxies, /proxies/count, /health
|
||||
- Params: limit, proto, country, format (json/plain)
|
||||
- Integrated into proxywatchd.py (starts when httpd.enabled=True)
|
||||
- Config: [httpd] section with listenip, port, enabled
|
||||
|
||||
### [x] 20. Web Dashboard
|
||||
|
||||
**Completed.** Added web dashboard with live statistics.
|
||||
- httpd.py: DASHBOARD_HTML template with dark theme UI
|
||||
- Endpoint: /dashboard (HTML page with auto-refresh)
|
||||
- Endpoint: /api/stats (JSON runtime statistics)
|
||||
- Stats include: tested/passed counts, success rate, thread count, uptime
|
||||
- Tor pool health: per-host latency, success rate, availability
|
||||
- Failure categories: timeout, proxy, ssl, closed, etc.
|
||||
- proxywatchd.py: get_runtime_stats() method provides stats callback
|
||||
|
||||
### [x] 21. Dashboard Enhancements (v2)
|
||||
|
||||
**Completed.** Major dashboard improvements for better visibility.
|
||||
- Prominent check type badge in header (SSL/JUDGES/HTTP/IRC with color coding)
|
||||
- System monitor bar: load average, memory usage, disk usage, process RSS
|
||||
- Anonymity breakdown: elite/anonymous/transparent proxy counts
|
||||
- Database health indicators: size, tested/hour, added/day, dead count
|
||||
- Enhanced Tor pool: total requests, success rate, healthy nodes, avg latency
|
||||
- SQLite ANALYZE/VACUUM functions for query optimization (dbs.py)
|
||||
- Database statistics API (get_database_stats())
|
||||
|
||||
### [x] 22. Completion Queue Optimization
|
||||
|
||||
**Completed.** Eliminated polling bottleneck in proxy test collection.
|
||||
- Added `completion_queue` for event-driven state signaling
|
||||
- `ProxyTestState.record_result()` signals when all targets complete
|
||||
- `collect_work()` drains queue instead of polling all pending states
|
||||
- Changed `pending_states` from list to dict for O(1) removal
|
||||
- Result: `is_complete()` eliminated from hot path, `collect_work()` 54x faster
|
||||
|
||||
---
|
||||
|
||||
### [x] 23. Batch API Endpoint
|
||||
|
||||
**Completed.** Added `/api/dashboard` batch endpoint combining stats, workers, and countries.
|
||||
|
||||
**Implementation:**
|
||||
- `httpd.py`: New `/api/dashboard` endpoint returns combined data in single response
|
||||
- `httpd.py`: Refactored `/api/workers` to use `_get_workers_data()` helper method
|
||||
- `dashboard.js`: Updated `fetchStats()` to use batch endpoint instead of multiple calls
|
||||
|
||||
**Response Structure:**
|
||||
```json
|
||||
{
|
||||
"stats": { /* same as /api/stats */ },
|
||||
"workers": { /* same as /api/workers */ },
|
||||
"countries": { /* same as /api/countries */ }
|
||||
}
|
||||
```
|
||||
|
||||
**Benefit:**
|
||||
- Reduces dashboard polling from 2 HTTP requests to 1 per poll cycle
|
||||
- Lower RTT impact over SSH tunnels and high-latency connections
|
||||
- Single database connection serves all data
|
||||
|
||||
---
|
||||
|
||||
## Profiling-Based Performance Optimizations
|
||||
|
||||
**Baseline:** 30-minute profiling session, 25.6M function calls, 1842s runtime
|
||||
|
||||
The following optimizations were identified through cProfile analysis. Each is
|
||||
assessed for real-world impact based on measured data.
|
||||
|
||||
### [x] 1. SQLite Query Batching
|
||||
|
||||
**Completed.** Added batch update functions and optimized submit_collected().
|
||||
|
||||
**Implementation:**
|
||||
- `batch_update_proxy_latency()`: Single SELECT with IN clause, compute EMA in Python,
|
||||
batch UPDATE with executemany()
|
||||
- `batch_update_proxy_anonymity()`: Batch all anonymity updates in single executemany()
|
||||
- `submit_collected()`: Uses batch functions instead of per-proxy loops
|
||||
|
||||
**Previous State:**
|
||||
- 18,182 execute() calls consuming 50.6s (2.7% of runtime)
|
||||
- Individual UPDATE for each proxy latency and anonymity
|
||||
|
||||
**Improvement:**
|
||||
- Reduced from N execute() + N commit() to 1 SELECT + 1 executemany() per batch
|
||||
- Estimated 15-25% reduction in SQLite overhead
|
||||
|
||||
---
|
||||
|
||||
### [x] 2. Proxy Validation Caching
|
||||
|
||||
**Completed.** Converted is_usable_proxy() cache to proper LRU with OrderedDict.
|
||||
|
||||
**Implementation:**
|
||||
- fetch.py: Changed _proxy_valid_cache from dict to OrderedDict
|
||||
- Added thread-safe _proxy_valid_cache_lock
|
||||
- move_to_end() on cache hits to maintain LRU order
|
||||
- Evict oldest entries when cache reaches max size (10,000)
|
||||
- Proper LRU eviction instead of stopping inserts when full
|
||||
|
||||
---
|
||||
|
||||
### [x] 3. Regex Pattern Pre-compilation
|
||||
|
||||
**Completed.** Pre-compiled proxy extraction pattern at module load.
|
||||
|
||||
**Implementation:**
|
||||
- `fetch.py`: Added `PROXY_PATTERN = re.compile(r'...')` at module level
|
||||
- `extract_proxies()`: Changed `re.findall(pattern, ...)` to `PROXY_PATTERN.findall(...)`
|
||||
- Pattern compiled once at import, not on each call
|
||||
|
||||
**Previous State:**
|
||||
- `extract_proxies()`: 166 calls, 2.87s total (17.3ms each)
|
||||
- Pattern recompiled on each call
|
||||
|
||||
**Improvement:**
|
||||
- Eliminated per-call regex compilation overhead
|
||||
- Estimated 30-50% reduction in extract_proxies() time
|
||||
|
||||
---
|
||||
|
||||
### [ ] 4. JSON Stats Response Caching
|
||||
|
||||
**Current State:**
|
||||
- 1.9M calls to JSON encoder functions
|
||||
- `_iterencode_dict`: 1.4s, `_iterencode_list`: 0.8s
|
||||
- Dashboard polls every 3 seconds = 600 requests per 30min
|
||||
- Most stats data unchanged between requests
|
||||
|
||||
**Proposed Change:**
|
||||
- Cache serialized JSON response with short TTL (1-2 seconds)
|
||||
- Cache serialized JSON response with short TTL (1-2s)
|
||||
- Only regenerate when underlying stats change
|
||||
- Use ETag/If-None-Match for client-side caching
|
||||
- Savings: ~7-9s/hour. Low priority, only matters with frequent dashboard access.
|
||||
|
||||
**Assessment:**
|
||||
```
|
||||
Current cost: ~5.5s per 30min (JSON encoding overhead)
|
||||
Potential saving: 60-80% = 3.3-4.4s per 30min = 6.6-8.8s/hour
|
||||
Effort: Medium (add caching layer to httpd.py)
|
||||
Risk: Low (stale stats for 1-2 seconds acceptable)
|
||||
```
|
||||
### [ ] Object Pooling for Test States
|
||||
|
||||
**Verdict:** LOW PRIORITY. Only matters with frequent dashboard access.
|
||||
- Pool ProxyTestState and TargetTestJob, reset and reuse
|
||||
- Savings: ~11-15s/hour. **Not recommended** - high effort, medium risk, modest gain.
|
||||
|
||||
### [ ] SQLite Connection Reuse
|
||||
|
||||
- Persistent connection per thread with health checks
|
||||
- Savings: ~0.3s/hour. **Not recommended** - negligible benefit.
|
||||
|
||||
---
|
||||
|
||||
### [ ] 5. Object Pooling for Test States
|
||||
## Dashboard
|
||||
|
||||
**Current State:**
|
||||
- `__new__` calls: 43,413 at 10.1s total
|
||||
- `ProxyTestState.__init__`: 18,150 calls, 0.87s
|
||||
- `TargetTestJob` creation: similar overhead
|
||||
- Objects created and discarded each test cycle
|
||||
### [ ] Performance
|
||||
|
||||
**Proposed Change:**
|
||||
- Implement object pool for ProxyTestState and TargetTestJob
|
||||
- Reset and reuse objects instead of creating new
|
||||
- Pool size: 2x thread count
|
||||
- Cache expensive DB queries (top countries, protocol breakdown)
|
||||
- Lazy-load historical data (only when scrolled into view)
|
||||
- WebSocket option for push updates (reduce polling overhead)
|
||||
- Configurable refresh interval via URL param or localStorage
|
||||
|
||||
**Assessment:**
|
||||
```
|
||||
Current cost: ~11s per 30min = 22s/hour = 14.7min/day
|
||||
Potential saving: 50-70% = 5.5-7.7s per 30min = 11-15s/hour = 7-10min/day
|
||||
Effort: High (significant refactoring, reset logic needed)
|
||||
Risk: Medium (state leakage bugs if reset incomplete)
|
||||
```
|
||||
### [ ] Features
|
||||
|
||||
**Verdict:** NOT RECOMMENDED. High effort, medium risk, modest gain.
|
||||
Python's object creation is already optimized. Focus elsewhere.
|
||||
- Historical graphs (24h, 7d) using stats_history table
|
||||
- Per-ASN performance analysis
|
||||
- Alert thresholds (success rate < X%, MITM detected)
|
||||
- Mobile-responsive improvements
|
||||
|
||||
---
|
||||
|
||||
### [ ] 6. SQLite Connection Reuse
|
||||
## Memory
|
||||
|
||||
**Current State:**
|
||||
- 718 connection opens in 30min session
|
||||
- Each open: 0.26ms (total 0.18s for connects)
|
||||
- Connection per operation pattern in mysqlite.py
|
||||
|
||||
**Proposed Change:**
|
||||
- Maintain persistent connection per thread
|
||||
- Implement connection pool with health checks
|
||||
- Reuse connections across operations
|
||||
|
||||
**Assessment:**
|
||||
```
|
||||
Current cost: 0.18s per 30min (connection overhead only)
|
||||
Potential saving: 90% = 0.16s per 30min = 0.32s/hour
|
||||
Effort: Medium (thread-local storage, lifecycle management)
|
||||
Risk: Medium (connection state, locking issues)
|
||||
```
|
||||
|
||||
**Verdict:** NOT RECOMMENDED. Negligible time savings (0.16s per 30min).
|
||||
SQLite's lightweight connections don't justify pooling complexity.
|
||||
|
||||
---
|
||||
|
||||
### Summary: Optimization Priority Matrix
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┬────────┬────────┬─────────┬───────────┐
|
||||
│ Optimization │ Effort │ Risk │ Savings │ Status
|
||||
├─────────────────────────────────────┼────────┼────────┼─────────┼───────────┤
|
||||
│ 1. SQLite Query Batching │ Low │ Low │ 20-34s/h│ DONE
|
||||
│ 2. Proxy Validation Caching │ V.Low │ None │ 5-8s/h │ DONE
|
||||
│ 3. Regex Pre-compilation │ Low │ None │ 5-8s/h │ DONE
|
||||
│ 4. JSON Response Caching │ Medium │ Low │ 7-9s/h │ Later
|
||||
│ 5. Object Pooling │ High │ Medium │ 11-15s/h│ Skip
|
||||
│ 6. SQLite Connection Reuse │ Medium │ Medium │ 0.3s/h │ Skip
|
||||
└─────────────────────────────────────┴────────┴────────┴─────────┴───────────┘
|
||||
|
||||
Completed: 1 (SQLite Batching), 2 (Proxy Caching), 3 (Regex Pre-compilation)
|
||||
Remaining: 4 (JSON Caching - Later)
|
||||
|
||||
Realized savings from completed optimizations:
|
||||
Per hour: 25-42 seconds saved
|
||||
Per day: 10-17 minutes saved
|
||||
Per week: 1.2-2.0 hours saved
|
||||
|
||||
Note: 68.7% of runtime is socket I/O (recv/send) which cannot be optimized
|
||||
without changing the fundamental network architecture. The optimizations
|
||||
above target the remaining 31.3% of CPU-bound operations.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Potential Dashboard Improvements
|
||||
|
||||
### [ ] Dashboard Performance Optimizations
|
||||
|
||||
**Goal:** Ensure dashboard remains lightweight and doesn't impact system performance.
|
||||
|
||||
**Current safeguards:**
|
||||
- No polling on server side (client-initiated via fetch)
|
||||
- 3-second refresh interval (configurable)
|
||||
- Minimal DOM updates (targeted element updates, not full re-render)
|
||||
- Static CSS/JS (no server-side templating per request)
|
||||
- No persistent connections (stateless HTTP)
|
||||
|
||||
**Future considerations:**
|
||||
- [x] Add rate limiting on /api/stats endpoint (300 req/60s sliding window)
|
||||
- [ ] Cache expensive DB queries (top countries, protocol breakdown)
|
||||
- [ ] Lazy-load historical data (only when scrolled into view)
|
||||
- [ ] WebSocket option for push updates (reduce polling overhead)
|
||||
- [ ] Configurable refresh interval via URL param or localStorage
|
||||
- [x] Pause polling when browser tab not visible (Page Visibility API)
|
||||
- [x] Skip chart rendering for inactive dashboard tabs (reduces CPU)
|
||||
- [ ] Batch API endpoints - combine /api/stats, /api/workers, /api/countries into
|
||||
single /api/dashboard call to reduce round-trips (helps SSH tunnel latency)
|
||||
|
||||
### [ ] Dashboard Feature Ideas
|
||||
|
||||
**Low priority - consider when time permits:**
|
||||
- [x] Geographic map visualization - /map endpoint with Leaflet.js
|
||||
- [x] Dark/light/muted theme toggle - t key cycles themes
|
||||
- [x] Export stats as CSV/JSON from dashboard (/api/stats/export?format=json|csv)
|
||||
- [ ] Historical graphs (24h, 7d) using stats_history table
|
||||
- [ ] Per-ASN performance analysis
|
||||
- [ ] Alert thresholds (success rate < X%, MITM detected)
|
||||
- [ ] Mobile-responsive improvements
|
||||
- [x] Keyboard shortcuts (r=refresh, 1-9=tabs, t=theme, p=pause)
|
||||
|
||||
### [x] Local JS Library Serving
|
||||
|
||||
**Completed.** All JavaScript libraries now served locally from /static/lib/ endpoint.
|
||||
|
||||
**Bundled libraries (static/lib/):**
|
||||
- Leaflet.js 1.9.4 (leaflet.js, leaflet.css)
|
||||
- Leaflet.markercluster (MarkerCluster.Default.css)
|
||||
- Chart.js 4.x (chart.umd.min.js)
|
||||
- uPlot (uPlot.iife.min.js, uPlot.min.css)
|
||||
|
||||
**Candidate libraries for future enhancements:**
|
||||
|
||||
```
|
||||
┌─────────────────┬─────────┬───────────────────────────────────────────────┐
|
||||
│ Library │ Size │ Use Case
|
||||
├─────────────────┼─────────┼───────────────────────────────────────────────┤
|
||||
│ Chart.js │ 65 KB │ Line/bar/pie charts (simpler API than D3)
|
||||
│ uPlot │ 15 KB │ Fast time-series charts (minimal, performant)
|
||||
│ ApexCharts │ 125 KB │ Modern charts with animations
|
||||
│ Frappe Charts │ 25 KB │ Simple, modern SVG charts
|
||||
│ Sparkline │ 2 KB │ Tiny inline charts (already have custom impl)
|
||||
├─────────────────┼─────────┼───────────────────────────────────────────────┤
|
||||
│ D3.js │ 85 KB │ Full control, complex visualizations
|
||||
│ D3-geo │ 30 KB │ Geographic projections (alternative to Leaflet)
|
||||
├─────────────────┼─────────┼───────────────────────────────────────────────┤
|
||||
│ Leaflet │ 40 KB │ Interactive maps (already using)
|
||||
│ Leaflet.heat │ 5 KB │ Heatmap layer for proxy density
|
||||
│ Leaflet.cluster │ 10 KB │ Marker clustering for many points
|
||||
└─────────────────┴─────────┴───────────────────────────────────────────────┘
|
||||
|
||||
Recommendations:
|
||||
● uPlot - Best for time-series (rate history, success rate history)
|
||||
● Chart.js - Best for pie/bar charts (failure breakdown, protocol stats)
|
||||
● Leaflet - Keep for maps, add heatmap plugin for density viz
|
||||
```
|
||||
|
||||
**Current custom implementations (no library):**
|
||||
- Sparkline charts (Test Rate History, Success Rate History) - inline SVG
|
||||
- Histogram bars (Response Time Distribution) - CSS divs
|
||||
- Pie charts (Failure Breakdown, Protocol Stats) - CSS conic-gradient
|
||||
|
||||
**Decision:** Current custom implementations are lightweight and sufficient.
|
||||
Add libraries only when custom becomes unmaintainable or new features needed.
|
||||
|
||||
### [ ] Memory Optimization Candidates
|
||||
|
||||
**Based on memory analysis (production metrics):**
|
||||
```
|
||||
Current State (260k queue):
|
||||
Start RSS: 442 MB
|
||||
Current RSS: 1,615 MB
|
||||
Per-job: ~4.5 KB overhead
|
||||
|
||||
Object Distribution:
|
||||
259,863 TargetTestJob (1 per job)
|
||||
259,863 ProxyTestState (1 per job)
|
||||
259,950 LockType (1 per job - threading locks)
|
||||
523,395 dict (2 per job - state + metadata)
|
||||
522,807 list (2 per job - results + targets)
|
||||
```
|
||||
|
||||
**Potential optimizations:**
|
||||
- [ ] Lock consolidation - reduce per-proxy locks (260k LockType objects)
|
||||
- [ ] Leaner state objects - reduce dict/list count per job
|
||||
- [x] Slot-based classes - use `__slots__` on ProxyTestState (27 attrs), TargetTestJob (4 attrs)
|
||||
- [ ] Object pooling - reuse ProxyTestState/TargetTestJob objects (not recommended)
|
||||
|
||||
**Verdict:** Memory scales linearly with queue (~4.5 KB/job). No leaks detected.
|
||||
Current usage acceptable for production workloads. Optimize only if memory
|
||||
becomes a constraint.
|
||||
Memory scales linearly with queue (~4.5 KB/job). No leaks detected.
|
||||
Optimize only if memory becomes a constraint.
|
||||
|
||||
---
|
||||
|
||||
## Completed
|
||||
## Known Issues
|
||||
|
||||
### [x] Work-Stealing Queue
|
||||
- Implemented shared Queue.Queue() for job distribution
|
||||
- Workers pull from shared queue instead of pre-assigned lists
|
||||
- Better utilization across threads
|
||||
### [!] Podman Container Metadata Disappears
|
||||
|
||||
### [x] Multi-Target Validation
|
||||
- Test each proxy against 3 random targets
|
||||
- 2/3 majority required for success
|
||||
- Reduces false negatives from single target failures
|
||||
|
||||
### [x] Interleaved Testing
|
||||
- Jobs shuffled across all proxies before queueing
|
||||
- Prevents burst of 3 connections to same proxy
|
||||
- ProxyTestState accumulates results from TargetTestJobs
|
||||
|
||||
### [x] Code Cleanup
|
||||
- Removed 93 lines dead HTTP server code (ppf.py)
|
||||
- Removed dead gumbo parser (soup_parser.py)
|
||||
- Removed test code (comboparse.py)
|
||||
- Removed unused functions (misc.py)
|
||||
- Fixed IP/port cleansing (ppf.py)
|
||||
- Updated .gitignore
|
||||
|
||||
### [x] Rate Limiting & Instance Tracking (scraper.py)
|
||||
- InstanceTracker class with exponential backoff
|
||||
- Configurable backoff_base, backoff_max, fail_threshold
|
||||
- Instance cycling when rate limited
|
||||
|
||||
### [x] Exception Logging with Context
|
||||
- Replaced bare `except:` with typed exceptions across all files
|
||||
- Added context logging to exception handlers (e.g., URL, error message)
|
||||
|
||||
### [x] Timeout Standardization
|
||||
- Added timeout_connect, timeout_read to [common] config section
|
||||
- Added stale_days, stats_interval to [watchd] config section
|
||||
|
||||
### [x] Periodic Stats & Stale Cleanup (proxywatchd.py)
|
||||
- Stats class tracks tested/passed/failed with thread-safe counters
|
||||
- Configurable stats_interval (default: 300s)
|
||||
- cleanup_stale() removes dead proxies older than stale_days (default: 30)
|
||||
|
||||
### [x] Unified Proxy Cache
|
||||
- Moved _known_proxies to fetch.py with helper functions
|
||||
- init_known_proxies(), add_known_proxies(), is_known_proxy()
|
||||
- ppf.py now uses shared cache via fetch module
|
||||
|
||||
### [x] Config Validation
|
||||
- config.py: validate() method checks config values on startup
|
||||
- Validates: port ranges, timeout values, thread counts, engine names
|
||||
- Warns on missing source_file, unknown engines
|
||||
- Errors on unwritable database directories
|
||||
- Integrated into ppf.py, proxywatchd.py, scraper.py main entry points
|
||||
|
||||
### [x] Profiling Support
|
||||
- config.py: Added --profile CLI argument
|
||||
- ppf.py: Refactored main logic into main() function
|
||||
- ppf.py: cProfile wrapper with stats output to profile.stats
|
||||
- Prints top 20 functions by cumulative time on exit
|
||||
- Usage: `python2 ppf.py --profile`
|
||||
|
||||
### [x] SIGTERM Graceful Shutdown
|
||||
- ppf.py: Added signal handler converting SIGTERM to KeyboardInterrupt
|
||||
- Ensures profile stats are written before container exit
|
||||
- Allows clean thread shutdown in containerized environments
|
||||
- Podman stop now triggers proper cleanup instead of SIGKILL
|
||||
|
||||
### [x] Unicode Exception Handling (Python 2)
|
||||
- Problem: `repr(e)` on exceptions with unicode content caused encoding errors
|
||||
- Files affected: ppf.py, scraper.py (3 exception handlers)
|
||||
- Solution: Check `isinstance(err_msg, unicode)` then encode with 'backslashreplace'
|
||||
- Pattern applied:
|
||||
```python
|
||||
try:
|
||||
err_msg = repr(e)
|
||||
if isinstance(err_msg, unicode):
|
||||
err_msg = err_msg.encode('ascii', 'backslashreplace')
|
||||
except:
|
||||
err_msg = type(e).__name__
|
||||
```
|
||||
- Handles Korean/CJK characters in search queries without crashing
|
||||
|
||||
### [x] Interactive World Map (/map endpoint)
|
||||
- Added Leaflet.js interactive map showing proxy distribution by country
|
||||
- Modern glassmorphism UI with `backdrop-filter: blur(12px)`
|
||||
- CartoDB dark tiles for dark theme
|
||||
- Circle markers sized proportionally to proxy count per country
|
||||
- Hover effects with smooth transitions
|
||||
- Stats overlay showing total countries/proxies
|
||||
- Legend with proxy count scale
|
||||
- Country coordinates and names lookup tables
|
||||
|
||||
### [x] Dashboard v3 - Electric Cyan Theme
|
||||
- Translucent glass-morphism effects with `backdrop-filter: blur()`
|
||||
- Electric cyan glow borders `rgba(56,189,248,...)` on all graph wrappers
|
||||
- Gradient overlays using `::before` pseudo-elements
|
||||
- Unified styling across: .chart-wrap, .histo-wrap, .stats-wrap, .lb-wrap, .pie-wrap
|
||||
- New .tor-card wrapper for Tor Exit Nodes with hover effects
|
||||
- Lighter background color scheme (#1e2738 bg, #181f2a card)
|
||||
|
||||
### [x] Map Endpoint Styling Update
|
||||
- Converted from gold/bronze theme (#c8b48c) to electric cyan (#38bdf8)
|
||||
- Glass panels with electric glow matching dashboard
|
||||
- Map markers for approximate locations now cyan instead of gold
|
||||
- Unified map_bg color with dashboard background (#1e2738)
|
||||
- Updated Leaflet controls, popups, and legend to cyan theme
|
||||
|
||||
### [x] MITM Re-test Optimization
|
||||
- Skip redundant SSL checks for proxies already known to be MITM
|
||||
- Added `mitm_retest_skipped` counter to Stats class
|
||||
- Optimization in `_try_ssl_check()` checks existing MITM flag before testing
|
||||
- Avoids 6k+ unnecessary re-tests per session (based on production metrics)
|
||||
|
||||
### [x] Memory Profiling Endpoint
|
||||
- /api/memory endpoint with comprehensive memory analysis
|
||||
- objgraph integration for object type distribution
|
||||
- pympler integration for memory summaries
|
||||
- Memory sample history tracking (RSS over time)
|
||||
- Process memory from /proc/self/status
|
||||
- GC statistics and collection counts
|
||||
|
||||
### [x] Database Context Manager Refactoring
|
||||
- Refactored all DB operations to use `_db_context()` context manager
|
||||
- `prepare_jobs()`, `submit_collected()`, `_run()` now use `with self._db_context() as db:`
|
||||
- `fetch_rows()` accepts db parameter for dependency injection
|
||||
- Removed deprecated `_prep_db()` and `_close_db()` methods
|
||||
- Connections guaranteed to close even on exceptions
|
||||
`podman ps -a` shows empty even though process is running. Service functions
|
||||
correctly despite missing metadata. Monitor via `ss -tlnp`, `ps aux`, or
|
||||
`curl localhost:8081/health`. Low impact.
|
||||
|
||||
---
|
||||
|
||||
## Deployment Troubleshooting Log
|
||||
|
||||
### [x] Container Crash on Startup (2024-12-24)
|
||||
|
||||
**Symptoms:**
|
||||
- Container starts then immediately disappears
|
||||
- `podman ps` shows no running containers
|
||||
- `podman logs ppf` returns "no such container"
|
||||
- Port 8081 not listening
|
||||
|
||||
**Debugging Process:**
|
||||
|
||||
1. **Initial diagnosis** - SSH to odin, checked container state:
|
||||
```bash
|
||||
sudo -u podman podman ps -a # Empty
|
||||
sudo ss -tlnp | grep 8081 # Nothing listening
|
||||
```
|
||||
|
||||
2. **Ran container in foreground** to capture output:
|
||||
```bash
|
||||
sudo -u podman bash -c 'cd /home/podman/ppf && \
|
||||
timeout 25 podman run --rm --name ppf --network=host \
|
||||
-v ./src:/app:ro -v ./data:/app/data \
|
||||
-v ./config.ini:/app/config.ini:ro \
|
||||
localhost/ppf python2 -u proxywatchd.py 2>&1'
|
||||
```
|
||||
|
||||
3. **Found the error** in httpd thread startup:
|
||||
```
|
||||
error: [Errno 98] Address already in use: ('0.0.0.0', 8081)
|
||||
```
|
||||
Container started, httpd failed to bind, process continued but HTTP unavailable.
|
||||
|
||||
4. **Identified root cause** - orphaned processes from previous debug attempts:
|
||||
```bash
|
||||
ps aux | grep -E "[p]pf|[p]roxy"
|
||||
# Found: python2 ppf.py (PID 6421) still running, holding port 8081
|
||||
# Found: conmon, timeout, bash processes from stale container
|
||||
```
|
||||
|
||||
5. **Why orphans existed:**
|
||||
- Previous `timeout 15 podman run` commands timed out
|
||||
- `podman rm -f` doesn't kill processes when container metadata is corrupted
|
||||
- Orphaned python2 process kept running with port bound
|
||||
|
||||
**Root Cause:**
|
||||
Stale container processes from interrupted debug sessions held port 8081.
|
||||
The container started successfully but httpd thread failed to bind,
|
||||
causing silent failure (no HTTP endpoints) while proxy testing continued.
|
||||
|
||||
**Fix Applied:**
|
||||
```bash
|
||||
# Force kill all orphaned processes
|
||||
sudo pkill -9 -f "ppf.py"
|
||||
sudo pkill -9 -f "proxywatchd.py"
|
||||
sudo pkill -9 -f "conmon.*ppf"
|
||||
sleep 2
|
||||
|
||||
# Verify port is free
|
||||
sudo ss -tlnp | grep 8081 # Should show nothing
|
||||
|
||||
# Clean podman state
|
||||
sudo -u podman podman rm -f -a
|
||||
sudo -u podman podman container prune -f
|
||||
|
||||
# Start fresh
|
||||
sudo -u podman bash -c 'cd /home/podman/ppf && \
|
||||
podman run -d --rm --name ppf --network=host \
|
||||
-v ./src:/app:ro -v ./data:/app/data \
|
||||
-v ./config.ini:/app/config.ini:ro \
|
||||
localhost/ppf python2 -u proxywatchd.py'
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
curl -sf http://localhost:8081/health
|
||||
# {"status": "ok", "timestamp": 1766573885}
|
||||
```
|
||||
|
||||
**Prevention:**
|
||||
- Use `podman-compose` for reliable container management
|
||||
- Use `pkill -9 -f` to kill orphaned processes before restart
|
||||
- Check port availability before starting: `ss -tlnp | grep 8081`
|
||||
- Run container foreground first to capture startup errors
|
||||
|
||||
**Correct Deployment Procedure:**
|
||||
```bash
|
||||
# As root or with sudo
|
||||
sudo -i -u podman bash
|
||||
cd /home/podman/ppf
|
||||
podman-compose down
|
||||
podman-compose up -d
|
||||
podman ps
|
||||
podman logs -f ppf
|
||||
```
|
||||
|
||||
**docker-compose.yml (updated):**
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
ppf:
|
||||
image: localhost/ppf:latest
|
||||
container_name: ppf
|
||||
network_mode: host
|
||||
volumes:
|
||||
- ./src:/app:ro
|
||||
- ./data:/app/data
|
||||
- ./config.ini:/app/config.ini:ro
|
||||
command: python2 -u proxywatchd.py
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- PYTHONUNBUFFERED=1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### [x] SSH Connection Flooding / fail2ban (2024-12-24)
|
||||
|
||||
**Symptoms:**
|
||||
- SSH connections timing out or reset
|
||||
- "Connection refused" errors
|
||||
- Intermittent access to odin
|
||||
|
||||
**Root Cause:**
|
||||
Multiple individual SSH commands triggered fail2ban rate limiting.
|
||||
|
||||
**Fix Applied:**
|
||||
Created `~/.claude/rules/ssh-usage.md` with batching best practices.
|
||||
|
||||
**Key Pattern:**
|
||||
```bash
|
||||
# BAD: 5 separate connections
|
||||
ssh host 'cmd1'
|
||||
ssh host 'cmd2'
|
||||
ssh host 'cmd3'
|
||||
|
||||
# GOOD: 1 connection, all commands
|
||||
ssh host bash <<'EOF'
|
||||
cmd1
|
||||
cmd2
|
||||
cmd3
|
||||
EOF
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### [!] Podman Container Metadata Disappears (2024-12-24)
|
||||
|
||||
**Symptoms:**
|
||||
- `podman ps -a` shows empty even though process is running
|
||||
- `podman logs ppf` returns "no such container"
|
||||
- Port is listening and service responds to health checks
|
||||
|
||||
**Observed Behavior:**
|
||||
```
|
||||
# Container starts
|
||||
podman run -d --name ppf ...
|
||||
# Returns container ID: dc55f0a218b7...
|
||||
|
||||
# Immediately after
|
||||
podman ps -a # Empty!
|
||||
ss -tlnp | grep 8081 # Shows python2 listening
|
||||
curl localhost:8081/health # {"status": "ok"}
|
||||
```
|
||||
|
||||
**Analysis:**
|
||||
- The process runs correctly inside the container namespace
|
||||
- Container metadata in podman's database is lost/corrupted
|
||||
- May be related to `--rm` flag interaction with detached mode
|
||||
- Rootless podman with overlayfs can have state sync issues
|
||||
|
||||
**Workaround:**
|
||||
Service works despite missing metadata. Monitor via:
|
||||
- `ss -tlnp | grep 8081` - port listening
|
||||
- `ps aux | grep proxywatchd` - process running
|
||||
- `curl localhost:8081/health` - service responding
|
||||
|
||||
**Impact:** Low. Service functions correctly. Only `podman logs` unavailable.
|
||||
|
||||
---
|
||||
|
||||
### Container Debugging Checklist
|
||||
|
||||
When container fails to start or crashes:
|
||||
## Container Debugging Checklist
|
||||
|
||||
```
|
||||
┌───┬─────────────────────────────────────────────────────────────────────────┐
|
||||
│ 1 │ Check for orphans: ps aux | grep -E "[p]rocess_name"
|
||||
│ 2 │ Check port conflicts: ss -tlnp | grep PORT
|
||||
│ 3 │ Run foreground: podman run --rm (no -d) to see output
|
||||
│ 4 │ Check podman state: podman ps -a
|
||||
│ 5 │ Clean stale: pkill -9 -f "pattern" && podman rm -f -a
|
||||
│ 6 │ Verify deps: config files, data dirs, volumes exist
|
||||
│ 7 │ Check logs: podman logs container_name 2>&1 | tail -50
|
||||
│ 8 │ Health check: curl -sf http://localhost:PORT/health
|
||||
└───┴─────────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
Note: If podman ps shows empty but port is listening and health check passes,
|
||||
the service is running correctly despite metadata issues. See "Podman Container
|
||||
Metadata Disappears" section above.
|
||||
1. Check for orphans: ps aux | grep -E "[p]rocess_name"
|
||||
2. Check port conflicts: ss -tlnp | grep PORT
|
||||
3. Run foreground: podman run --rm (no -d) to see output
|
||||
4. Check podman state: podman ps -a
|
||||
5. Clean stale: pkill -9 -f "pattern" && podman rm -f -a
|
||||
6. Verify deps: config files, data dirs, volumes exist
|
||||
7. Check logs: podman logs container_name 2>&1 | tail -50
|
||||
8. Health check: curl -sf http://localhost:PORT/health
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user