diff --git a/ROADMAP.md b/ROADMAP.md index 758e025..b0b49ee 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -1,69 +1,92 @@ -# PPF Project Roadmap +# PPF Roadmap -## Project Purpose - -PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to: - -1. **Discover** proxy addresses by crawling websites and search engines -2. **Validate** proxies through multi-target testing via Tor -3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP) - -## Architecture Overview +## Architecture ``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ PPF Architecture │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ -│ │ scraper.py │ │ ppf.py │ │proxywatchd │ │ -│ │ │ │ │ │ │ │ -│ │ Searx query │───>│ URL harvest │───>│ Proxy test │ │ -│ │ URL finding │ │ Proxy extract│ │ Validation │ │ -│ └─────────────┘ └─────────────┘ └─────────────┘ │ -│ │ │ │ │ -│ v v v │ -│ ┌─────────────────────────────────────────────────────────────────┐ │ -│ │ SQLite Databases │ │ -│ │ uris.db (URLs) proxies.db (proxy list) │ │ -│ └─────────────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────┐ │ -│ │ Network Layer │ │ -│ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │ -│ └─────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ + ┌──────────────────────────────────────────┐ + │ Odin (Master) │ + │ httpd.py ─ API + SSL-only verification │ + │ proxywatchd.py ─ proxy recheck daemon │ + │ SQLite: proxies.db, websites.db │ + └──────────┬───────────────────────────────┘ + │ WireGuard (10.200.1.0/24) + ┌────────────────┼────────────────┐ + v v v + ┌───────────┐ ┌───────────┐ ┌───────────┐ + │ cassius │ │ edge │ │ sentinel │ + │ Worker │ │ Worker │ │ Worker │ + │ ppf.py │ │ ppf.py │ │ ppf.py │ + └───────────┘ └───────────┘ └───────────┘ ``` +Workers claim URLs, extract proxies, test them, report back. +Master verifies (SSL-only), serves API, coordinates distribution. + ## Constraints -- **Python 2.7** compatibility required -- **Minimal external dependencies** (avoid adding new modules) -- Current dependencies: beautifulsoup4, pyasn, IP2Location -- Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN) +- Python 2.7 runtime (container-based) +- Minimal external dependencies +- All traffic via Tor + +--- + +## Phase 1: Performance and Quality (current) + +Profiling-driven optimizations and source pipeline hardening. + +| Item | Status | Description | +|------|--------|-------------| +| Extraction short-circuits | done | Guard clauses in fetch.py extractors | +| Skip shutdown on failed sockets | pending | Avoid 39s/session wasted on dead connections | +| SQLite connection reuse (odin) | pending | Cache per-greenlet, eliminate 2.7k opens/session | +| Lazy-load ASN database | pending | Defer 3.6s startup cost to first lookup | +| Add more seed sources (100+) | pending | Expand beyond 37 hardcoded URLs | +| Protocol-aware source weighting | pending | Prioritize SOCKS5-yielding sources | + +## Phase 2: Proxy Diversity and Consumer API + +Address customer-reported quality gaps. + +| Item | Status | Description | +|------|--------|-------------| +| ASN diversity scoring | pending | Deprioritize over-represented ASNs in testing | +| Graduated recheck intervals | pending | Fresh proxies rechecked more often than stale | +| API filters (proto/country/ASN/latency) | pending | Consumer-facing query parameters on /proxies | +| Latency-based ranking | pending | Expose latency percentiles per proxy | + +## Phase 3: Self-Expanding Source Pool + +Worker-driven link discovery from productive pages. + +| Item | Status | Description | +|------|--------|-------------| +| Link extraction from productive pages | pending | Parse HTML for links when page yields proxies | +| Report discovered URLs to master | pending | New endpoint for worker URL submissions | +| Conditional discovery | pending | Only extract links from confirmed-productive pages | + +## Phase 4: Long-Term + +| Item | Status | Description | +|------|--------|-------------| +| Python 3 migration | deferred | Unblocks modern deps, security patches, pyasn native | +| Worker trust scoring | pending | Activate spot-check verification framework | +| Dynamic target pool | pending | Auto-discover and rotate validation targets | +| Geographic target spread | pending | Ensure targets span multiple regions | --- ## Completed -### Target Management - -| Task | Description | File(s) | -|------|-------------|---------| -| Target health tracking | Cooldown-based health tracking for all target pools (head, SSL, IRC, judges) | stats.py, proxywatchd.py | -| MITM field in proxy list | Expose mitm boolean in JSON proxy list endpoints | httpd.py | - ---- - -## Open Work - -### Target Management - -| Task | Description | File(s) | -|------|-------------|---------| -| Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py | -| Geographic target spread | Ensure targets span multiple regions | config.py | +| Item | Date | Description | +|------|------|-------------| +| last_seen freshness fix | 2026-02-22 | Watchd updates last_seen on verification | +| Periodic re-seeding | 2026-02-22 | Reset errored sources every 6h | +| ASN enrichment | 2026-02-22 | Pure-Python ipasn.dat reader + backfill | +| URL pipeline stats | 2026-02-22 | /api/stats exposes source health metrics | +| Extraction short-circuits | 2026-02-22 | Guard clauses + precompiled table regexes | +| Target health tracking | prior | Cooldown-based health for all target pools | +| MITM field in proxy list | prior | Expose mitm boolean in JSON endpoints | +| V1 worker protocol removal | prior | Cleaned up legacy --worker code path | --- @@ -71,31 +94,12 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design | File | Purpose | |------|---------| -| ppf.py | Main URL harvester daemon | +| ppf.py | URL harvester, worker main loop | | proxywatchd.py | Proxy validation daemon | -| scraper.py | Searx search integration | -| fetch.py | HTTP fetching with proxy support | -| dbs.py | Database schema and inserts | -| mysqlite.py | SQLite wrapper | -| rocksock.py | Socket/proxy abstraction (3rd party) | -| http2.py | HTTP client implementation | -| httpd.py | Web dashboard and REST API server | +| fetch.py | HTTP fetching, proxy extraction | +| httpd.py | API server, worker coordination | +| dbs.py | Database schema, seed sources | | config.py | Configuration management | -| comboparse.py | Config/arg parser framework | -| soup_parser.py | BeautifulSoup wrapper | -| misc.py | Utilities (timestamp, logging) | -| export.py | Proxy export CLI tool | -| engines.py | Search engine implementations | -| connection_pool.py | Tor connection pooling | -| network_stats.py | Network statistics tracking | -| dns.py | DNS resolution with caching | -| mitm.py | MITM certificate detection | -| job.py | Priority job queue | -| static/dashboard.js | Dashboard frontend logic | -| static/dashboard.html | Dashboard HTML template | -| tools/lib/ppf-common.sh | Shared ops library (hosts, wrappers, colors) | -| tools/ppf-deploy | Deploy wrapper (validation + playbook) | -| tools/ppf-logs | View container logs | -| tools/ppf-service | Container lifecycle management | -| tools/playbooks/deploy.yml | Ansible deploy playbook | -| tools/playbooks/inventory.ini | Host inventory (WireGuard IPs) | +| rocksock.py | Socket/proxy abstraction | +| http2.py | HTTP client implementation | +| tools/ppf-deploy | Deployment wrapper | diff --git a/TASKLIST.md b/TASKLIST.md new file mode 100644 index 0000000..8a6c0c8 --- /dev/null +++ b/TASKLIST.md @@ -0,0 +1,32 @@ +# PPF Tasklist + +Active execution queue. Ordered by priority. + +--- + +## In Progress + +| # | Task | File(s) | Notes | +|---|------|---------|-------| +| 1 | Skip socket.shutdown on failed connections | rocksock.py | ~39s/session saved on workers | +| 4 | Add more seed sources (100+) | dbs.py | Expand PROXY_SOURCES list | +| 6 | Protocol-aware source weighting | httpd.py, ppf.py | Prioritize SOCKS5-yielding sources | + +## Queued + +| # | Task | File(s) | Notes | +|---|------|---------|-------| +| 2 | SQLite connection reuse on odin | httpd.py | Cache per-greenlet handle | +| 3 | Lazy-load ASN database | httpd.py | Defer to first lookup | +| 12 | API filters on /proxies (proto/country/ASN) | httpd.py | Consumer query params | +| 8 | Graduated recheck intervals | proxywatchd.py | Fresh proxies checked more often | + +## Done + +| # | Task | Date | +|---|------|------| +| - | Extraction short-circuits | 2026-02-22 | +| - | last_seen freshness fix | 2026-02-22 | +| - | Periodic re-seeding | 2026-02-22 | +| - | ASN enrichment | 2026-02-22 | +| - | URL pipeline stats | 2026-02-22 | diff --git a/TODO.md b/TODO.md index 2a10e81..f9feab5 100644 --- a/TODO.md +++ b/TODO.md @@ -1,83 +1,35 @@ # PPF TODO -## Optimization - -### [ ] JSON Stats Response Caching - -- Cache serialized JSON response with short TTL (1-2s) -- Only regenerate when underlying stats change -- Use ETag/If-None-Match for client-side caching -- Savings: ~7-9s/hour. Low priority, only matters with frequent dashboard access. - -### [ ] Object Pooling for Test States - -- Pool ProxyTestState and TargetTestJob, reset and reuse -- Savings: ~11-15s/hour. **Not recommended** - high effort, medium risk, modest gain. - -### [ ] SQLite Connection Reuse - -- Persistent connection per thread with health checks -- Savings: ~0.3s/hour. **Not recommended** - negligible benefit. +Intake buffer. Items refined here move to TASKLIST.md. --- ## Dashboard -### [ ] Performance - -- Cache expensive DB queries (top countries, protocol breakdown) -- Lazy-load historical data (only when scrolled into view) -- WebSocket option for push updates (reduce polling overhead) -- Configurable refresh interval via URL param or localStorage - -### [ ] Features - -- Historical graphs (24h, 7d) using stats_history table -- Per-ASN performance analysis -- Alert thresholds (success rate < X%, MITM detected) -- Mobile-responsive improvements - ---- +- [ ] Cache expensive DB queries (top countries, protocol breakdown) +- [ ] Historical graphs (24h, 7d) using stats_history table +- [ ] Per-ASN performance analysis +- [ ] Alert thresholds (success rate < X%, MITM detected) +- [ ] WebSocket push updates (reduce polling overhead) +- [ ] Mobile-responsive improvements ## Memory -- [ ] Lock consolidation - reduce per-proxy locks (260k LockType objects) -- [ ] Leaner state objects - reduce dict/list count per job +- [ ] Lock consolidation (260k LockType objects at scale) +- [ ] Leaner state objects per job -Memory scales linearly with queue (~4.5 KB/job). No leaks detected. -Optimize only if memory becomes a constraint. +Memory scales ~4.5 KB/job. No leaks detected. Optimize only if constrained. ---- +## Source Pipeline -## Deprecation - -### [x] Remove V1 worker protocol - -Completed. Removed `--worker` flag, `worker_main()`, `claim_work()`, -`submit_results()`, `/api/work`, `/api/results`, and related config -options. `--worker` now routes to the URL-driven protocol. - ---- +- [ ] PasteBin/GitHub API scrapers for proxy lists +- [ ] Telegram channel scrapers (beyond t.me/s/ HTML) +- [ ] Source quality decay tracking (flag sources going stale) +- [ ] Deduplication of sources across different URL forms ## Known Issues ### [!] Podman Container Metadata Disappears -`podman ps -a` shows empty even though process is running. Service functions -correctly despite missing metadata. Monitor via `ss -tlnp`, `ps aux`, or -`curl localhost:8081/health`. Low impact. - ---- - -## Container Debugging Checklist - -``` -1. Check for orphans: ps aux | grep -E "[p]rocess_name" -2. Check port conflicts: ss -tlnp | grep PORT -3. Run foreground: podman run --rm (no -d) to see output -4. Check podman state: podman ps -a -5. Clean stale: pkill -9 -f "pattern" && podman rm -f -a -6. Verify deps: config files, data dirs, volumes exist -7. Check logs: podman logs container_name 2>&1 | tail -50 -8. Health check: curl -sf http://localhost:PORT/health -``` +`podman ps -a` shows empty even though process is running. +Monitor via `ss -tlnp`, `ps aux`, or `curl localhost:8081/health`.