docs: update roadmap, todo, and add tasklist
All checks were successful
CI / validate (push) Successful in 21s
All checks were successful
CI / validate (push) Successful in 21s
Restructure roadmap into phases. Clean up todo as intake buffer. Add execution tasklist with prioritized items.
This commit is contained in:
164
ROADMAP.md
164
ROADMAP.md
@@ -1,69 +1,92 @@
|
||||
# PPF Project Roadmap
|
||||
# PPF Roadmap
|
||||
|
||||
## Project Purpose
|
||||
|
||||
PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:
|
||||
|
||||
1. **Discover** proxy addresses by crawling websites and search engines
|
||||
2. **Validate** proxies through multi-target testing via Tor
|
||||
3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)
|
||||
|
||||
## Architecture Overview
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ PPF Architecture │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ scraper.py │ │ ppf.py │ │proxywatchd │ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ Searx query │───>│ URL harvest │───>│ Proxy test │ │
|
||||
│ │ URL finding │ │ Proxy extract│ │ Validation │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ v v v │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ SQLite Databases │ │
|
||||
│ │ uris.db (URLs) proxies.db (proxy list) │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Network Layer │ │
|
||||
│ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
┌──────────────────────────────────────────┐
|
||||
│ Odin (Master) │
|
||||
│ httpd.py ─ API + SSL-only verification │
|
||||
│ proxywatchd.py ─ proxy recheck daemon │
|
||||
│ SQLite: proxies.db, websites.db │
|
||||
└──────────┬───────────────────────────────┘
|
||||
│ WireGuard (10.200.1.0/24)
|
||||
┌────────────────┼────────────────┐
|
||||
v v v
|
||||
┌───────────┐ ┌───────────┐ ┌───────────┐
|
||||
│ cassius │ │ edge │ │ sentinel │
|
||||
│ Worker │ │ Worker │ │ Worker │
|
||||
│ ppf.py │ │ ppf.py │ │ ppf.py │
|
||||
└───────────┘ └───────────┘ └───────────┘
|
||||
```
|
||||
|
||||
Workers claim URLs, extract proxies, test them, report back.
|
||||
Master verifies (SSL-only), serves API, coordinates distribution.
|
||||
|
||||
## Constraints
|
||||
|
||||
- **Python 2.7** compatibility required
|
||||
- **Minimal external dependencies** (avoid adding new modules)
|
||||
- Current dependencies: beautifulsoup4, pyasn, IP2Location
|
||||
- Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)
|
||||
- Python 2.7 runtime (container-based)
|
||||
- Minimal external dependencies
|
||||
- All traffic via Tor
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Performance and Quality (current)
|
||||
|
||||
Profiling-driven optimizations and source pipeline hardening.
|
||||
|
||||
| Item | Status | Description |
|
||||
|------|--------|-------------|
|
||||
| Extraction short-circuits | done | Guard clauses in fetch.py extractors |
|
||||
| Skip shutdown on failed sockets | pending | Avoid 39s/session wasted on dead connections |
|
||||
| SQLite connection reuse (odin) | pending | Cache per-greenlet, eliminate 2.7k opens/session |
|
||||
| Lazy-load ASN database | pending | Defer 3.6s startup cost to first lookup |
|
||||
| Add more seed sources (100+) | pending | Expand beyond 37 hardcoded URLs |
|
||||
| Protocol-aware source weighting | pending | Prioritize SOCKS5-yielding sources |
|
||||
|
||||
## Phase 2: Proxy Diversity and Consumer API
|
||||
|
||||
Address customer-reported quality gaps.
|
||||
|
||||
| Item | Status | Description |
|
||||
|------|--------|-------------|
|
||||
| ASN diversity scoring | pending | Deprioritize over-represented ASNs in testing |
|
||||
| Graduated recheck intervals | pending | Fresh proxies rechecked more often than stale |
|
||||
| API filters (proto/country/ASN/latency) | pending | Consumer-facing query parameters on /proxies |
|
||||
| Latency-based ranking | pending | Expose latency percentiles per proxy |
|
||||
|
||||
## Phase 3: Self-Expanding Source Pool
|
||||
|
||||
Worker-driven link discovery from productive pages.
|
||||
|
||||
| Item | Status | Description |
|
||||
|------|--------|-------------|
|
||||
| Link extraction from productive pages | pending | Parse HTML for links when page yields proxies |
|
||||
| Report discovered URLs to master | pending | New endpoint for worker URL submissions |
|
||||
| Conditional discovery | pending | Only extract links from confirmed-productive pages |
|
||||
|
||||
## Phase 4: Long-Term
|
||||
|
||||
| Item | Status | Description |
|
||||
|------|--------|-------------|
|
||||
| Python 3 migration | deferred | Unblocks modern deps, security patches, pyasn native |
|
||||
| Worker trust scoring | pending | Activate spot-check verification framework |
|
||||
| Dynamic target pool | pending | Auto-discover and rotate validation targets |
|
||||
| Geographic target spread | pending | Ensure targets span multiple regions |
|
||||
|
||||
---
|
||||
|
||||
## Completed
|
||||
|
||||
### Target Management
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Target health tracking | Cooldown-based health tracking for all target pools (head, SSL, IRC, judges) | stats.py, proxywatchd.py |
|
||||
| MITM field in proxy list | Expose mitm boolean in JSON proxy list endpoints | httpd.py |
|
||||
|
||||
---
|
||||
|
||||
## Open Work
|
||||
|
||||
### Target Management
|
||||
|
||||
| Task | Description | File(s) |
|
||||
|------|-------------|---------|
|
||||
| Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py |
|
||||
| Geographic target spread | Ensure targets span multiple regions | config.py |
|
||||
| Item | Date | Description |
|
||||
|------|------|-------------|
|
||||
| last_seen freshness fix | 2026-02-22 | Watchd updates last_seen on verification |
|
||||
| Periodic re-seeding | 2026-02-22 | Reset errored sources every 6h |
|
||||
| ASN enrichment | 2026-02-22 | Pure-Python ipasn.dat reader + backfill |
|
||||
| URL pipeline stats | 2026-02-22 | /api/stats exposes source health metrics |
|
||||
| Extraction short-circuits | 2026-02-22 | Guard clauses + precompiled table regexes |
|
||||
| Target health tracking | prior | Cooldown-based health for all target pools |
|
||||
| MITM field in proxy list | prior | Expose mitm boolean in JSON endpoints |
|
||||
| V1 worker protocol removal | prior | Cleaned up legacy --worker code path |
|
||||
|
||||
---
|
||||
|
||||
@@ -71,31 +94,12 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| ppf.py | Main URL harvester daemon |
|
||||
| ppf.py | URL harvester, worker main loop |
|
||||
| proxywatchd.py | Proxy validation daemon |
|
||||
| scraper.py | Searx search integration |
|
||||
| fetch.py | HTTP fetching with proxy support |
|
||||
| dbs.py | Database schema and inserts |
|
||||
| mysqlite.py | SQLite wrapper |
|
||||
| rocksock.py | Socket/proxy abstraction (3rd party) |
|
||||
| http2.py | HTTP client implementation |
|
||||
| httpd.py | Web dashboard and REST API server |
|
||||
| fetch.py | HTTP fetching, proxy extraction |
|
||||
| httpd.py | API server, worker coordination |
|
||||
| dbs.py | Database schema, seed sources |
|
||||
| config.py | Configuration management |
|
||||
| comboparse.py | Config/arg parser framework |
|
||||
| soup_parser.py | BeautifulSoup wrapper |
|
||||
| misc.py | Utilities (timestamp, logging) |
|
||||
| export.py | Proxy export CLI tool |
|
||||
| engines.py | Search engine implementations |
|
||||
| connection_pool.py | Tor connection pooling |
|
||||
| network_stats.py | Network statistics tracking |
|
||||
| dns.py | DNS resolution with caching |
|
||||
| mitm.py | MITM certificate detection |
|
||||
| job.py | Priority job queue |
|
||||
| static/dashboard.js | Dashboard frontend logic |
|
||||
| static/dashboard.html | Dashboard HTML template |
|
||||
| tools/lib/ppf-common.sh | Shared ops library (hosts, wrappers, colors) |
|
||||
| tools/ppf-deploy | Deploy wrapper (validation + playbook) |
|
||||
| tools/ppf-logs | View container logs |
|
||||
| tools/ppf-service | Container lifecycle management |
|
||||
| tools/playbooks/deploy.yml | Ansible deploy playbook |
|
||||
| tools/playbooks/inventory.ini | Host inventory (WireGuard IPs) |
|
||||
| rocksock.py | Socket/proxy abstraction |
|
||||
| http2.py | HTTP client implementation |
|
||||
| tools/ppf-deploy | Deployment wrapper |
|
||||
|
||||
32
TASKLIST.md
Normal file
32
TASKLIST.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# PPF Tasklist
|
||||
|
||||
Active execution queue. Ordered by priority.
|
||||
|
||||
---
|
||||
|
||||
## In Progress
|
||||
|
||||
| # | Task | File(s) | Notes |
|
||||
|---|------|---------|-------|
|
||||
| 1 | Skip socket.shutdown on failed connections | rocksock.py | ~39s/session saved on workers |
|
||||
| 4 | Add more seed sources (100+) | dbs.py | Expand PROXY_SOURCES list |
|
||||
| 6 | Protocol-aware source weighting | httpd.py, ppf.py | Prioritize SOCKS5-yielding sources |
|
||||
|
||||
## Queued
|
||||
|
||||
| # | Task | File(s) | Notes |
|
||||
|---|------|---------|-------|
|
||||
| 2 | SQLite connection reuse on odin | httpd.py | Cache per-greenlet handle |
|
||||
| 3 | Lazy-load ASN database | httpd.py | Defer to first lookup |
|
||||
| 12 | API filters on /proxies (proto/country/ASN) | httpd.py | Consumer query params |
|
||||
| 8 | Graduated recheck intervals | proxywatchd.py | Fresh proxies checked more often |
|
||||
|
||||
## Done
|
||||
|
||||
| # | Task | Date |
|
||||
|---|------|------|
|
||||
| - | Extraction short-circuits | 2026-02-22 |
|
||||
| - | last_seen freshness fix | 2026-02-22 |
|
||||
| - | Periodic re-seeding | 2026-02-22 |
|
||||
| - | ASN enrichment | 2026-02-22 |
|
||||
| - | URL pipeline stats | 2026-02-22 |
|
||||
82
TODO.md
82
TODO.md
@@ -1,83 +1,35 @@
|
||||
# PPF TODO
|
||||
|
||||
## Optimization
|
||||
|
||||
### [ ] JSON Stats Response Caching
|
||||
|
||||
- Cache serialized JSON response with short TTL (1-2s)
|
||||
- Only regenerate when underlying stats change
|
||||
- Use ETag/If-None-Match for client-side caching
|
||||
- Savings: ~7-9s/hour. Low priority, only matters with frequent dashboard access.
|
||||
|
||||
### [ ] Object Pooling for Test States
|
||||
|
||||
- Pool ProxyTestState and TargetTestJob, reset and reuse
|
||||
- Savings: ~11-15s/hour. **Not recommended** - high effort, medium risk, modest gain.
|
||||
|
||||
### [ ] SQLite Connection Reuse
|
||||
|
||||
- Persistent connection per thread with health checks
|
||||
- Savings: ~0.3s/hour. **Not recommended** - negligible benefit.
|
||||
Intake buffer. Items refined here move to TASKLIST.md.
|
||||
|
||||
---
|
||||
|
||||
## Dashboard
|
||||
|
||||
### [ ] Performance
|
||||
|
||||
- Cache expensive DB queries (top countries, protocol breakdown)
|
||||
- Lazy-load historical data (only when scrolled into view)
|
||||
- WebSocket option for push updates (reduce polling overhead)
|
||||
- Configurable refresh interval via URL param or localStorage
|
||||
|
||||
### [ ] Features
|
||||
|
||||
- Historical graphs (24h, 7d) using stats_history table
|
||||
- Per-ASN performance analysis
|
||||
- Alert thresholds (success rate < X%, MITM detected)
|
||||
- Mobile-responsive improvements
|
||||
|
||||
---
|
||||
- [ ] Cache expensive DB queries (top countries, protocol breakdown)
|
||||
- [ ] Historical graphs (24h, 7d) using stats_history table
|
||||
- [ ] Per-ASN performance analysis
|
||||
- [ ] Alert thresholds (success rate < X%, MITM detected)
|
||||
- [ ] WebSocket push updates (reduce polling overhead)
|
||||
- [ ] Mobile-responsive improvements
|
||||
|
||||
## Memory
|
||||
|
||||
- [ ] Lock consolidation - reduce per-proxy locks (260k LockType objects)
|
||||
- [ ] Leaner state objects - reduce dict/list count per job
|
||||
- [ ] Lock consolidation (260k LockType objects at scale)
|
||||
- [ ] Leaner state objects per job
|
||||
|
||||
Memory scales linearly with queue (~4.5 KB/job). No leaks detected.
|
||||
Optimize only if memory becomes a constraint.
|
||||
Memory scales ~4.5 KB/job. No leaks detected. Optimize only if constrained.
|
||||
|
||||
---
|
||||
## Source Pipeline
|
||||
|
||||
## Deprecation
|
||||
|
||||
### [x] Remove V1 worker protocol
|
||||
|
||||
Completed. Removed `--worker` flag, `worker_main()`, `claim_work()`,
|
||||
`submit_results()`, `/api/work`, `/api/results`, and related config
|
||||
options. `--worker` now routes to the URL-driven protocol.
|
||||
|
||||
---
|
||||
- [ ] PasteBin/GitHub API scrapers for proxy lists
|
||||
- [ ] Telegram channel scrapers (beyond t.me/s/ HTML)
|
||||
- [ ] Source quality decay tracking (flag sources going stale)
|
||||
- [ ] Deduplication of sources across different URL forms
|
||||
|
||||
## Known Issues
|
||||
|
||||
### [!] Podman Container Metadata Disappears
|
||||
|
||||
`podman ps -a` shows empty even though process is running. Service functions
|
||||
correctly despite missing metadata. Monitor via `ss -tlnp`, `ps aux`, or
|
||||
`curl localhost:8081/health`. Low impact.
|
||||
|
||||
---
|
||||
|
||||
## Container Debugging Checklist
|
||||
|
||||
```
|
||||
1. Check for orphans: ps aux | grep -E "[p]rocess_name"
|
||||
2. Check port conflicts: ss -tlnp | grep PORT
|
||||
3. Run foreground: podman run --rm (no -d) to see output
|
||||
4. Check podman state: podman ps -a
|
||||
5. Clean stale: pkill -9 -f "pattern" && podman rm -f -a
|
||||
6. Verify deps: config files, data dirs, volumes exist
|
||||
7. Check logs: podman logs container_name 2>&1 | tail -50
|
||||
8. Health check: curl -sf http://localhost:PORT/health
|
||||
```
|
||||
`podman ps -a` shows empty even though process is running.
|
||||
Monitor via `ss -tlnp`, `ps aux`, or `curl localhost:8081/health`.
|
||||
|
||||
Reference in New Issue
Block a user