docs: update roadmap, todo, and add tasklist
All checks were successful
CI / validate (push) Successful in 21s

Restructure roadmap into phases. Clean up todo as intake buffer.
Add execution tasklist with prioritized items.
This commit is contained in:
Username
2026-02-22 13:58:37 +01:00
parent f9d237fe0d
commit 93eb395727
3 changed files with 133 additions and 145 deletions

View File

@@ -1,69 +1,92 @@
# PPF Project Roadmap
# PPF Roadmap
## Project Purpose
PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:
1. **Discover** proxy addresses by crawling websites and search engines
2. **Validate** proxies through multi-target testing via Tor
3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)
## Architecture Overview
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
PPF Architecture
├─────────────────────────────────────────────────────────────────────────────┤
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ scraper.py │ │ ppf.py │ │proxywatchd │ │
│ │ │ │ │
│ Searx query │───>│ URL harvest │───>│ Proxy test
│ URL finding │ │ Proxy extract│ │ Validation │
─────────────┘ └───────────── ─────────────┘ │
v v v
│ ┌─────────────────────────────────────────────────────────────────┐
│ │ SQLite Databases │ │
│ │ uris.db (URLs) proxies.db (proxy list) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Network Layer │ │
│ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
──────────────────────────────────────────┐
Odin (Master)
│ httpd.py ─ API + SSL-only verification │
proxywatchd.py ─ proxy recheck daemon
SQLite: proxies.db, websites.db
└──────────┬───────────────────────────────┘
WireGuard (10.200.1.0/24)
┌────────────────┼────────────────┐
v v v
───────────┐ ┌─────────── ───────────
cassius edge │ │ sentinel
│ Worker Worker Worker
│ ppf.py │ │ ppf.py │ ppf.py
└───────────┘ └───────────┘ └───────────┘
```
Workers claim URLs, extract proxies, test them, report back.
Master verifies (SSL-only), serves API, coordinates distribution.
## Constraints
- **Python 2.7** compatibility required
- **Minimal external dependencies** (avoid adding new modules)
- Current dependencies: beautifulsoup4, pyasn, IP2Location
- Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)
- Python 2.7 runtime (container-based)
- Minimal external dependencies
- All traffic via Tor
---
## Phase 1: Performance and Quality (current)
Profiling-driven optimizations and source pipeline hardening.
| Item | Status | Description |
|------|--------|-------------|
| Extraction short-circuits | done | Guard clauses in fetch.py extractors |
| Skip shutdown on failed sockets | pending | Avoid 39s/session wasted on dead connections |
| SQLite connection reuse (odin) | pending | Cache per-greenlet, eliminate 2.7k opens/session |
| Lazy-load ASN database | pending | Defer 3.6s startup cost to first lookup |
| Add more seed sources (100+) | pending | Expand beyond 37 hardcoded URLs |
| Protocol-aware source weighting | pending | Prioritize SOCKS5-yielding sources |
## Phase 2: Proxy Diversity and Consumer API
Address customer-reported quality gaps.
| Item | Status | Description |
|------|--------|-------------|
| ASN diversity scoring | pending | Deprioritize over-represented ASNs in testing |
| Graduated recheck intervals | pending | Fresh proxies rechecked more often than stale |
| API filters (proto/country/ASN/latency) | pending | Consumer-facing query parameters on /proxies |
| Latency-based ranking | pending | Expose latency percentiles per proxy |
## Phase 3: Self-Expanding Source Pool
Worker-driven link discovery from productive pages.
| Item | Status | Description |
|------|--------|-------------|
| Link extraction from productive pages | pending | Parse HTML for links when page yields proxies |
| Report discovered URLs to master | pending | New endpoint for worker URL submissions |
| Conditional discovery | pending | Only extract links from confirmed-productive pages |
## Phase 4: Long-Term
| Item | Status | Description |
|------|--------|-------------|
| Python 3 migration | deferred | Unblocks modern deps, security patches, pyasn native |
| Worker trust scoring | pending | Activate spot-check verification framework |
| Dynamic target pool | pending | Auto-discover and rotate validation targets |
| Geographic target spread | pending | Ensure targets span multiple regions |
---
## Completed
### Target Management
| Task | Description | File(s) |
|------|-------------|---------|
| Target health tracking | Cooldown-based health tracking for all target pools (head, SSL, IRC, judges) | stats.py, proxywatchd.py |
| MITM field in proxy list | Expose mitm boolean in JSON proxy list endpoints | httpd.py |
---
## Open Work
### Target Management
| Task | Description | File(s) |
|------|-------------|---------|
| Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py |
| Geographic target spread | Ensure targets span multiple regions | config.py |
| Item | Date | Description |
|------|------|-------------|
| last_seen freshness fix | 2026-02-22 | Watchd updates last_seen on verification |
| Periodic re-seeding | 2026-02-22 | Reset errored sources every 6h |
| ASN enrichment | 2026-02-22 | Pure-Python ipasn.dat reader + backfill |
| URL pipeline stats | 2026-02-22 | /api/stats exposes source health metrics |
| Extraction short-circuits | 2026-02-22 | Guard clauses + precompiled table regexes |
| Target health tracking | prior | Cooldown-based health for all target pools |
| MITM field in proxy list | prior | Expose mitm boolean in JSON endpoints |
| V1 worker protocol removal | prior | Cleaned up legacy --worker code path |
---
@@ -71,31 +94,12 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
| File | Purpose |
|------|---------|
| ppf.py | Main URL harvester daemon |
| ppf.py | URL harvester, worker main loop |
| proxywatchd.py | Proxy validation daemon |
| scraper.py | Searx search integration |
| fetch.py | HTTP fetching with proxy support |
| dbs.py | Database schema and inserts |
| mysqlite.py | SQLite wrapper |
| rocksock.py | Socket/proxy abstraction (3rd party) |
| http2.py | HTTP client implementation |
| httpd.py | Web dashboard and REST API server |
| fetch.py | HTTP fetching, proxy extraction |
| httpd.py | API server, worker coordination |
| dbs.py | Database schema, seed sources |
| config.py | Configuration management |
| comboparse.py | Config/arg parser framework |
| soup_parser.py | BeautifulSoup wrapper |
| misc.py | Utilities (timestamp, logging) |
| export.py | Proxy export CLI tool |
| engines.py | Search engine implementations |
| connection_pool.py | Tor connection pooling |
| network_stats.py | Network statistics tracking |
| dns.py | DNS resolution with caching |
| mitm.py | MITM certificate detection |
| job.py | Priority job queue |
| static/dashboard.js | Dashboard frontend logic |
| static/dashboard.html | Dashboard HTML template |
| tools/lib/ppf-common.sh | Shared ops library (hosts, wrappers, colors) |
| tools/ppf-deploy | Deploy wrapper (validation + playbook) |
| tools/ppf-logs | View container logs |
| tools/ppf-service | Container lifecycle management |
| tools/playbooks/deploy.yml | Ansible deploy playbook |
| tools/playbooks/inventory.ini | Host inventory (WireGuard IPs) |
| rocksock.py | Socket/proxy abstraction |
| http2.py | HTTP client implementation |
| tools/ppf-deploy | Deployment wrapper |