docs: update roadmap, todo, and add tasklist
All checks were successful
CI / validate (push) Successful in 21s

Restructure roadmap into phases. Clean up todo as intake buffer.
Add execution tasklist with prioritized items.
This commit is contained in:
Username
2026-02-22 13:58:37 +01:00
parent f9d237fe0d
commit 93eb395727
3 changed files with 133 additions and 145 deletions

View File

@@ -1,69 +1,92 @@
# PPF Project Roadmap
# PPF Roadmap
## Project Purpose
PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:
1. **Discover** proxy addresses by crawling websites and search engines
2. **Validate** proxies through multi-target testing via Tor
3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)
## Architecture Overview
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
PPF Architecture
├─────────────────────────────────────────────────────────────────────────────┤
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ scraper.py │ │ ppf.py │ │proxywatchd │ │
│ │ │ │ │
│ Searx query │───>│ URL harvest │───>│ Proxy test
│ URL finding │ │ Proxy extract│ │ Validation │
─────────────┘ └───────────── ─────────────┘ │
v v v
│ ┌─────────────────────────────────────────────────────────────────┐
│ │ SQLite Databases │ │
│ │ uris.db (URLs) proxies.db (proxy list) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Network Layer │ │
│ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
──────────────────────────────────────────┐
Odin (Master)
│ httpd.py ─ API + SSL-only verification │
proxywatchd.py ─ proxy recheck daemon
SQLite: proxies.db, websites.db
└──────────┬───────────────────────────────┘
WireGuard (10.200.1.0/24)
┌────────────────┼────────────────┐
v v v
───────────┐ ┌─────────── ───────────
cassius edge │ │ sentinel
│ Worker Worker Worker
│ ppf.py │ │ ppf.py │ ppf.py
└───────────┘ └───────────┘ └───────────┘
```
Workers claim URLs, extract proxies, test them, report back.
Master verifies (SSL-only), serves API, coordinates distribution.
## Constraints
- **Python 2.7** compatibility required
- **Minimal external dependencies** (avoid adding new modules)
- Current dependencies: beautifulsoup4, pyasn, IP2Location
- Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)
- Python 2.7 runtime (container-based)
- Minimal external dependencies
- All traffic via Tor
---
## Phase 1: Performance and Quality (current)
Profiling-driven optimizations and source pipeline hardening.
| Item | Status | Description |
|------|--------|-------------|
| Extraction short-circuits | done | Guard clauses in fetch.py extractors |
| Skip shutdown on failed sockets | pending | Avoid 39s/session wasted on dead connections |
| SQLite connection reuse (odin) | pending | Cache per-greenlet, eliminate 2.7k opens/session |
| Lazy-load ASN database | pending | Defer 3.6s startup cost to first lookup |
| Add more seed sources (100+) | pending | Expand beyond 37 hardcoded URLs |
| Protocol-aware source weighting | pending | Prioritize SOCKS5-yielding sources |
## Phase 2: Proxy Diversity and Consumer API
Address customer-reported quality gaps.
| Item | Status | Description |
|------|--------|-------------|
| ASN diversity scoring | pending | Deprioritize over-represented ASNs in testing |
| Graduated recheck intervals | pending | Fresh proxies rechecked more often than stale |
| API filters (proto/country/ASN/latency) | pending | Consumer-facing query parameters on /proxies |
| Latency-based ranking | pending | Expose latency percentiles per proxy |
## Phase 3: Self-Expanding Source Pool
Worker-driven link discovery from productive pages.
| Item | Status | Description |
|------|--------|-------------|
| Link extraction from productive pages | pending | Parse HTML for links when page yields proxies |
| Report discovered URLs to master | pending | New endpoint for worker URL submissions |
| Conditional discovery | pending | Only extract links from confirmed-productive pages |
## Phase 4: Long-Term
| Item | Status | Description |
|------|--------|-------------|
| Python 3 migration | deferred | Unblocks modern deps, security patches, pyasn native |
| Worker trust scoring | pending | Activate spot-check verification framework |
| Dynamic target pool | pending | Auto-discover and rotate validation targets |
| Geographic target spread | pending | Ensure targets span multiple regions |
---
## Completed
### Target Management
| Task | Description | File(s) |
|------|-------------|---------|
| Target health tracking | Cooldown-based health tracking for all target pools (head, SSL, IRC, judges) | stats.py, proxywatchd.py |
| MITM field in proxy list | Expose mitm boolean in JSON proxy list endpoints | httpd.py |
---
## Open Work
### Target Management
| Task | Description | File(s) |
|------|-------------|---------|
| Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py |
| Geographic target spread | Ensure targets span multiple regions | config.py |
| Item | Date | Description |
|------|------|-------------|
| last_seen freshness fix | 2026-02-22 | Watchd updates last_seen on verification |
| Periodic re-seeding | 2026-02-22 | Reset errored sources every 6h |
| ASN enrichment | 2026-02-22 | Pure-Python ipasn.dat reader + backfill |
| URL pipeline stats | 2026-02-22 | /api/stats exposes source health metrics |
| Extraction short-circuits | 2026-02-22 | Guard clauses + precompiled table regexes |
| Target health tracking | prior | Cooldown-based health for all target pools |
| MITM field in proxy list | prior | Expose mitm boolean in JSON endpoints |
| V1 worker protocol removal | prior | Cleaned up legacy --worker code path |
---
@@ -71,31 +94,12 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
| File | Purpose |
|------|---------|
| ppf.py | Main URL harvester daemon |
| ppf.py | URL harvester, worker main loop |
| proxywatchd.py | Proxy validation daemon |
| scraper.py | Searx search integration |
| fetch.py | HTTP fetching with proxy support |
| dbs.py | Database schema and inserts |
| mysqlite.py | SQLite wrapper |
| rocksock.py | Socket/proxy abstraction (3rd party) |
| http2.py | HTTP client implementation |
| httpd.py | Web dashboard and REST API server |
| fetch.py | HTTP fetching, proxy extraction |
| httpd.py | API server, worker coordination |
| dbs.py | Database schema, seed sources |
| config.py | Configuration management |
| comboparse.py | Config/arg parser framework |
| soup_parser.py | BeautifulSoup wrapper |
| misc.py | Utilities (timestamp, logging) |
| export.py | Proxy export CLI tool |
| engines.py | Search engine implementations |
| connection_pool.py | Tor connection pooling |
| network_stats.py | Network statistics tracking |
| dns.py | DNS resolution with caching |
| mitm.py | MITM certificate detection |
| job.py | Priority job queue |
| static/dashboard.js | Dashboard frontend logic |
| static/dashboard.html | Dashboard HTML template |
| tools/lib/ppf-common.sh | Shared ops library (hosts, wrappers, colors) |
| tools/ppf-deploy | Deploy wrapper (validation + playbook) |
| tools/ppf-logs | View container logs |
| tools/ppf-service | Container lifecycle management |
| tools/playbooks/deploy.yml | Ansible deploy playbook |
| tools/playbooks/inventory.ini | Host inventory (WireGuard IPs) |
| rocksock.py | Socket/proxy abstraction |
| http2.py | HTTP client implementation |
| tools/ppf-deploy | Deployment wrapper |

32
TASKLIST.md Normal file
View File

@@ -0,0 +1,32 @@
# PPF Tasklist
Active execution queue. Ordered by priority.
---
## In Progress
| # | Task | File(s) | Notes |
|---|------|---------|-------|
| 1 | Skip socket.shutdown on failed connections | rocksock.py | ~39s/session saved on workers |
| 4 | Add more seed sources (100+) | dbs.py | Expand PROXY_SOURCES list |
| 6 | Protocol-aware source weighting | httpd.py, ppf.py | Prioritize SOCKS5-yielding sources |
## Queued
| # | Task | File(s) | Notes |
|---|------|---------|-------|
| 2 | SQLite connection reuse on odin | httpd.py | Cache per-greenlet handle |
| 3 | Lazy-load ASN database | httpd.py | Defer to first lookup |
| 12 | API filters on /proxies (proto/country/ASN) | httpd.py | Consumer query params |
| 8 | Graduated recheck intervals | proxywatchd.py | Fresh proxies checked more often |
## Done
| # | Task | Date |
|---|------|------|
| - | Extraction short-circuits | 2026-02-22 |
| - | last_seen freshness fix | 2026-02-22 |
| - | Periodic re-seeding | 2026-02-22 |
| - | ASN enrichment | 2026-02-22 |
| - | URL pipeline stats | 2026-02-22 |

82
TODO.md
View File

@@ -1,83 +1,35 @@
# PPF TODO
## Optimization
### [ ] JSON Stats Response Caching
- Cache serialized JSON response with short TTL (1-2s)
- Only regenerate when underlying stats change
- Use ETag/If-None-Match for client-side caching
- Savings: ~7-9s/hour. Low priority, only matters with frequent dashboard access.
### [ ] Object Pooling for Test States
- Pool ProxyTestState and TargetTestJob, reset and reuse
- Savings: ~11-15s/hour. **Not recommended** - high effort, medium risk, modest gain.
### [ ] SQLite Connection Reuse
- Persistent connection per thread with health checks
- Savings: ~0.3s/hour. **Not recommended** - negligible benefit.
Intake buffer. Items refined here move to TASKLIST.md.
---
## Dashboard
### [ ] Performance
- Cache expensive DB queries (top countries, protocol breakdown)
- Lazy-load historical data (only when scrolled into view)
- WebSocket option for push updates (reduce polling overhead)
- Configurable refresh interval via URL param or localStorage
### [ ] Features
- Historical graphs (24h, 7d) using stats_history table
- Per-ASN performance analysis
- Alert thresholds (success rate < X%, MITM detected)
- Mobile-responsive improvements
---
- [ ] Cache expensive DB queries (top countries, protocol breakdown)
- [ ] Historical graphs (24h, 7d) using stats_history table
- [ ] Per-ASN performance analysis
- [ ] Alert thresholds (success rate < X%, MITM detected)
- [ ] WebSocket push updates (reduce polling overhead)
- [ ] Mobile-responsive improvements
## Memory
- [ ] Lock consolidation - reduce per-proxy locks (260k LockType objects)
- [ ] Leaner state objects - reduce dict/list count per job
- [ ] Lock consolidation (260k LockType objects at scale)
- [ ] Leaner state objects per job
Memory scales linearly with queue (~4.5 KB/job). No leaks detected.
Optimize only if memory becomes a constraint.
Memory scales ~4.5 KB/job. No leaks detected. Optimize only if constrained.
---
## Source Pipeline
## Deprecation
### [x] Remove V1 worker protocol
Completed. Removed `--worker` flag, `worker_main()`, `claim_work()`,
`submit_results()`, `/api/work`, `/api/results`, and related config
options. `--worker` now routes to the URL-driven protocol.
---
- [ ] PasteBin/GitHub API scrapers for proxy lists
- [ ] Telegram channel scrapers (beyond t.me/s/ HTML)
- [ ] Source quality decay tracking (flag sources going stale)
- [ ] Deduplication of sources across different URL forms
## Known Issues
### [!] Podman Container Metadata Disappears
`podman ps -a` shows empty even though process is running. Service functions
correctly despite missing metadata. Monitor via `ss -tlnp`, `ps aux`, or
`curl localhost:8081/health`. Low impact.
---
## Container Debugging Checklist
```
1. Check for orphans: ps aux | grep -E "[p]rocess_name"
2. Check port conflicts: ss -tlnp | grep PORT
3. Run foreground: podman run --rm (no -d) to see output
4. Check podman state: podman ps -a
5. Clean stale: pkill -9 -f "pattern" && podman rm -f -a
6. Verify deps: config files, data dirs, volumes exist
7. Check logs: podman logs container_name 2>&1 | tail -50
8. Health check: curl -sf http://localhost:PORT/health
```
`podman ps -a` shows empty even though process is running.
Monitor via `ss -tlnp`, `ps aux`, or `curl localhost:8081/health`.