Files
ppf/documentation/design-worker-driven-discovery.md
Username 96e6f06e0d docs: add worker-driven discovery design doc
Architecture proposal to move proxy list fetching from master to
workers. Workers claim URLs, fetch lists, extract and test proxies,
report working proxies and URL health back to master. Trust-based
model: workers report working proxies only, no consensus needed.
2026-02-17 13:32:42 +01:00

573 lines
18 KiB
Markdown

# Design: Worker-Driven Discovery
## Status
**Proposal** -- Not yet implemented.
## Problem
The current architecture centralizes all proxy list fetching on the master
node (odin). Workers only test proxies handed to them. This creates several
issues:
1. **Single point of fetch** -- If odin can't reach a source (blocked IP,
transient failure), that source is dead for everyone.
2. **Bandwidth concentration** -- Odin fetches 40 proxy lists every cycle,
extracts proxies, deduplicates, and stores them before workers ever see
them.
3. **Wasted vantage points** -- Workers sit behind different Tor exits and
IPs, but never use that diversity for fetching.
4. **Tight coupling** -- Workers can't operate at all without the master's
claim queue. If odin restarts, all workers stall.
## Proposed Architecture
Move proxy list fetching to workers. Master becomes a coordinator and
aggregator rather than a fetcher.
```
Current: Proposed:
Master Master
Fetch URLs --------+ Manage URL database
Extract proxies | Score URLs from feedback
Store proxylist | Aggregate working proxies
Serve /api/work ---+-> Workers Serve /api/claim-urls ----> Workers
<- /api/results |
<- /api/report-urls ------+
<- /api/report-proxies ---+
```
### Role Changes
```
+--------+---------------------------+----------------------------------+
| Host | Current Role | New Role |
+--------+---------------------------+----------------------------------+
| odin | Fetch URLs | Maintain URL database |
| | Extract proxies | Score URLs from worker feedback |
| | Store proxylist | Aggregate reported proxies |
| | Distribute proxy batches | Distribute URL batches |
| | Collect test results | Collect URL + proxy reports |
+--------+---------------------------+----------------------------------+
| worker | Claim proxy batch | Claim URL batch |
| | Test each proxy | Fetch URL, extract proxies |
| | Report pass/fail | Test extracted proxies |
| | | Report URL health + proxy results|
+--------+---------------------------+----------------------------------+
```
## Data Flow
### Phase 1: URL Claiming
Worker requests a batch of URLs to process.
```
Worker Master
| |
| GET /api/claim-urls |
| ?key=...&count=5 |
|------------------------------>|
| | Select due URLs from uris table
| | Mark as claimed (in-memory)
| [{url, last_hash, proto_hint}, ...]
|<------------------------------|
```
**Claim response:**
```json
{
"worker_id": "abc123",
"urls": [
{
"url": "https://raw.githubusercontent.com/.../http.txt",
"last_hash": "a1b2c3d4...",
"proto_hint": "http",
"priority": 1
}
]
}
```
Fields:
- `last_hash` -- MD5 of last extracted proxy list. Worker can skip
extraction and report "unchanged" if hash matches, saving CPU.
- `proto_hint` -- Protocol inferred from URL path. Worker uses this for
extraction confidence scoring.
- `priority` -- Higher = fetch sooner. Based on URL score.
### Phase 2: Fetch and Extract
Worker fetches each URL through Tor, extracts proxies using the existing
`fetch.extract_proxies()` pipeline.
```
Worker
|
| For each claimed URL:
| 1. Fetch through Tor (fetch_contents)
| 2. Compute content hash (MD5)
| 3. If hash == last_hash: skip extraction, report unchanged
| 4. Else: extract_proxies() -> list of (addr, proto, confidence)
| 5. Queue extracted proxies for testing
|
```
### Phase 3: URL Feedback
Worker reports fetch results for each URL back to master.
```
Worker Master
| |
| POST /api/report-urls |
| {reports: [...]} |
|------------------------------>|
| | Update uris table:
| | check_time, error, stale_count,
| | retrievals, proxies_added,
| | content_hash, worker_scores
| {ok: true} |
|<------------------------------|
```
**URL report payload:**
```json
{
"reports": [
{
"url": "https://...",
"success": true,
"content_hash": "a1b2c3d4...",
"proxy_count": 1523,
"fetch_time_ms": 2340,
"changed": true,
"error": null
},
{
"url": "https://...",
"success": false,
"content_hash": null,
"proxy_count": 0,
"fetch_time_ms": 0,
"changed": false,
"error": "timeout"
}
]
}
```
### Phase 4: Proxy Testing and Reporting
Worker tests extracted proxies locally using the existing `TargetTestJob`
pipeline. **Only working proxies are reported to master.** Failed proxies
are discarded silently -- no point wasting bandwidth on negatives.
Workers are trusted. If a worker says a proxy works, master accepts it.
```
Worker Master
| |
| Test proxies locally |
| (same TargetTestJob flow) |
| Discard failures |
| |
| POST /api/report-proxies |
| {proxies: [...]} | (working only)
|------------------------------>|
| | Upsert into proxylist:
| | INSERT OR REPLACE
| | Set failed=0, update last_seen
| {ok: true} |
|<------------------------------|
```
**Proxy report payload (working only):**
```json
{
"proxies": [
{
"ip": "1.2.3.4",
"port": 8080,
"proto": "socks5",
"source_proto": "socks5",
"latency": 1.234,
"exit_ip": "5.6.7.8",
"anonymity": "elite",
"source_url": "https://..."
}
]
}
```
No `working` field needed -- everything in the report is working by
definition. The `source_url` links proxy provenance to the URL that
yielded it, enabling URL quality scoring.
### Complete Cycle
```
Worker main loop:
1. GET /api/claim-urls Claim batch of URLs
2. For each URL:
a. Fetch through Tor
b. Extract proxies (or skip if unchanged)
c. Test extracted proxies
3. POST /api/report-urls Report URL health
4. POST /api/report-proxies Report proxy results
5. POST /api/heartbeat Health check
6. Sleep, repeat
```
## Master-Side Changes
### URL Scheduling
Current `Leechered` threads fetch URLs on a timer based on error/stale
count. Replace with a scoring system that workers consume.
**URL score** (higher = fetch sooner):
```
score = base_score
+ freshness_bonus # High-frequency sources score higher
- error_penalty # Consecutive errors reduce score
- stale_penalty # Unchanged content reduces score
+ yield_bonus # URLs that produce many proxies score higher
+ quality_bonus # URLs whose proxies actually work score higher
```
Concrete formula:
```python
def url_score(url_row):
age = now - url_row.check_time
base = age / url_row.check_interval # 1.0 when due
# Yield: proxies found per fetch (rolling average)
yield_rate = url_row.proxies_added / max(url_row.retrievals, 1)
yield_bonus = min(yield_rate / 100.0, 1.0) # Cap at 1.0
# Quality: what % of extracted proxies actually worked
quality_bonus = url_row.working_ratio * 0.5 # 0.0 to 0.5
# Penalties
error_penalty = min(url_row.error * 0.3, 2.0)
stale_penalty = min(url_row.stale_count * 0.1, 1.0)
return base + yield_bonus + quality_bonus - error_penalty - stale_penalty
```
URLs with `score >= 1.0` are due for fetching. Claimed URLs are locked
in memory for `claim_timeout` seconds (existing pattern).
### New uris Columns
```sql
ALTER TABLE uris ADD COLUMN check_interval INT DEFAULT 3600;
ALTER TABLE uris ADD COLUMN working_ratio REAL DEFAULT 0.0;
ALTER TABLE uris ADD COLUMN avg_fetch_time INT DEFAULT 0;
ALTER TABLE uris ADD COLUMN last_worker TEXT;
ALTER TABLE uris ADD COLUMN yield_rate REAL DEFAULT 0.0;
```
- `check_interval` -- Adaptive: decreases for high-yield URLs, increases
for stale/erroring ones. Replaces the `checktime + error * perfail`
formula with a persisted value.
- `working_ratio` -- EMA of (working_proxies / total_proxies) from worker
feedback. URLs that yield dead proxies get deprioritized.
- `avg_fetch_time` -- EMA of fetch duration in ms. Helps identify slow
sources.
- `last_worker` -- Which worker last fetched this URL. Useful for
debugging, and to distribute URLs across workers evenly.
- `yield_rate` -- EMA of proxies extracted per fetch.
### Proxy Aggregation
Trust model: **workers are trusted.** If any worker reports a proxy as
working, master accepts it. Failed proxies are never reported -- workers
discard them locally.
```
Worker A reports: 1.2.3.4:8080 working, latency 1.2s
Worker B reports: 1.2.3.4:8080 working, latency 1.5s
Master action:
- INSERT OR REPLACE with latest report
- Update last_seen, latency EMA
- Set failed = 0
```
No consensus, no voting, no trust scoring. A proxy lives as long as at
least one worker keeps confirming it. It dies when nobody reports it for
`proxy_ttl` seconds.
New `proxylist` column:
```sql
ALTER TABLE proxylist ADD COLUMN last_seen INT DEFAULT 0;
```
- `last_seen` -- Unix timestamp of most recent "working" report. Proxies
not seen in N hours are expired by the master's periodic cleanup.
### Proxy Expiry
Working proxies that haven't been reported by any worker within
`proxy_ttl` (default: 4 hours) are marked stale and re-queued for
testing. After `proxy_ttl * 3` with no reports, they're marked failed.
```python
def expire_stale_proxies(db, proxy_ttl):
cutoff_stale = now - proxy_ttl
cutoff_dead = now - (proxy_ttl * 3)
# Mark stale proxies for retesting
db.execute('''
UPDATE proxylist SET failed = 1
WHERE failed = 0 AND last_seen < ? AND last_seen > 0
''', (cutoff_stale,))
# Kill proxies not seen in a long time
db.execute('''
UPDATE proxylist SET failed = -1
WHERE failed > 0 AND last_seen < ? AND last_seen > 0
''', (cutoff_dead,))
```
## Worker-Side Changes
### New Worker Loop
Replace the current claim-test-report loop with a two-phase loop:
```python
def worker_main_v2(config):
register()
verify_tor()
while True:
# Phase 1: Fetch URLs and extract proxies
urls = claim_urls(server, key, count=5)
url_reports = []
proxy_batch = []
for url_info in urls:
report, proxies = fetch_and_extract(url_info)
url_reports.append(report)
proxy_batch.extend(proxies)
report_urls(server, key, url_reports)
# Phase 2: Test extracted proxies, report working only
if proxy_batch:
working = test_proxies(proxy_batch)
if working:
report_proxies(server, key, working)
heartbeat(server, key)
sleep(1)
```
### fetch_and_extract()
New function that combines fetching + extraction on the worker side:
```python
def fetch_and_extract(url_info):
url = url_info['url']
last_hash = url_info.get('last_hash')
proto_hint = url_info.get('proto_hint')
start = time.time()
try:
content = fetch_contents(url, head=False, proxy=tor_proxy)
except Exception as e:
return {'url': url, 'success': False, 'error': str(e)}, []
elapsed = int((time.time() - start) * 1000)
content_hash = hashlib.md5(content).hexdigest()
if content_hash == last_hash:
return {
'url': url, 'success': True, 'content_hash': content_hash,
'proxy_count': 0, 'fetch_time_ms': elapsed,
'changed': False, 'error': None
}, []
proxies = extract_proxies(content, url)
return {
'url': url, 'success': True, 'content_hash': content_hash,
'proxy_count': len(proxies), 'fetch_time_ms': elapsed,
'changed': True, 'error': None
}, proxies
```
### Deduplication
Workers may extract the same proxies from different URLs. Local
deduplication before testing:
```python
seen = set()
unique = []
for addr, proto, confidence in proxy_batch:
if addr not in seen:
seen.add(addr)
unique.append((addr, proto, confidence))
proxy_batch = unique
```
### Proxy Testing
Reuse the existing `TargetTestJob` / `WorkerThread` pipeline. The only
change: proxies come from local extraction instead of master's claim
response. The test loop, result collection, and evaluation logic remain
identical.
## API Changes Summary
### New Endpoints
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/api/claim-urls` | GET | Worker claims batch of due URLs |
| `/api/report-urls` | POST | Worker reports URL fetch results |
| `/api/report-proxies` | POST | Worker reports proxy test results |
### Modified Endpoints
| Endpoint | Change |
|----------|--------|
| `/api/work` | Deprecated but kept for backward compatibility |
| `/api/results` | Deprecated but kept for backward compatibility |
### Unchanged Endpoints
| Endpoint | Reason |
|----------|--------|
| `/api/register` | Same registration flow |
| `/api/heartbeat` | Same health reporting |
| `/dashboard` | Still reads from same DB |
| `/proxies` | Still reads from proxylist |
## Schema Changes
### uris table additions
```sql
ALTER TABLE uris ADD COLUMN check_interval INT DEFAULT 3600;
ALTER TABLE uris ADD COLUMN working_ratio REAL DEFAULT 0.0;
ALTER TABLE uris ADD COLUMN avg_fetch_time INT DEFAULT 0;
ALTER TABLE uris ADD COLUMN last_worker TEXT;
ALTER TABLE uris ADD COLUMN yield_rate REAL DEFAULT 0.0;
```
### proxylist table additions
```sql
ALTER TABLE proxylist ADD COLUMN last_seen INT DEFAULT 0;
```
## Migration Strategy
### Phase 1: Add New Endpoints (non-breaking)
Add `/api/claim-urls`, `/api/report-urls`, `/api/report-proxies` to
`httpd.py`. Keep all existing endpoints working. Master still runs its
own `Leechered` threads.
Files: `httpd.py`, `dbs.py` (migrations)
### Phase 2: Worker V2 Mode
Add `--worker-v2` flag to `ppf.py`. When set, worker uses the new
URL-claiming loop instead of the proxy-claiming loop. Both modes coexist.
Old workers (`--worker`) continue working against `/api/work` and
`/api/results`. New workers (`--worker-v2`) use the new endpoints.
Files: `ppf.py`, `config.py`
### Phase 3: URL Scoring
Implement URL scoring in master based on worker feedback. Replace
`Leechered` timer-based scheduling with score-based scheduling. Master's
own fetching becomes a fallback for URLs no worker has claimed recently.
Files: `httpd.py`, `dbs.py`
### Phase 4: Remove Legacy
Once all workers run V2, remove `/api/work`, `/api/results`, and
master-side `Leechered` threads. Master no longer fetches proxy lists
directly.
Files: `ppf.py`, `httpd.py`
## Configuration
### New config.ini Options
```ini
[worker]
# V2 mode: worker fetches URLs instead of proxy batches
mode = v2 # v1 (legacy) or v2 (url-driven)
url_batch_size = 5 # URLs per claim cycle
max_proxies_per_cycle = 500 # Cap on proxies tested per cycle
fetch_timeout = 30 # Timeout for URL fetching (seconds)
[ppf]
# URL scoring weights
score_yield_weight = 1.0
score_quality_weight = 0.5
score_error_penalty = 0.3
score_stale_penalty = 0.1
# Proxy expiry
proxy_ttl = 14400 # Seconds before unseen proxy goes stale (4h)
proxy_ttl_dead = 43200 # Seconds before unseen proxy is killed (12h)
# Fallback: master fetches URLs not claimed by any worker
fallback_fetch = true
fallback_interval = 7200 # Seconds before master fetches unclaimed URL
```
## Risks and Mitigations
| Risk | Impact | Mitigation |
|------|--------|------------|
| Workers extract different proxy counts from same URL | Inconsistent proxy_count in reports | Use content_hash for dedup; only update yield_rate when hash changes |
| Tor exit blocks a source for one worker | Worker reports error for a working URL | Require 2+ consecutive errors before incrementing URL error count |
| Workers test same proxies redundantly | Wasted CPU | Master tracks which URLs are assigned to which workers; avoid assigning same URL to multiple workers in same cycle |
| Large proxy lists overwhelm worker memory | OOM on worker | Cap `max_proxies_per_cycle`; worker discards excess after dedup |
| Master restart loses claim state | Workers refetch recently-fetched URLs | Harmless -- just a redundant fetch. content_hash prevents duplicate work |
| `fetch.py` imports unavailable on worker image | ImportError | Verify worker Dockerfile includes fetch.py and dependencies |
## What Stays the Same
- `rocksock.py` -- No changes to proxy chain logic
- `connection_pool.py` -- Tor host selection unchanged
- `proxywatchd.py` core -- `TargetTestJob`, `WorkerThread`, `ProxyTestState`
remain identical. Only the job source changes.
- `fetch.py` -- Used on workers now, but the code itself doesn't change
- `httpd.py` dashboard/proxies -- Still reads from same `proxylist` table
- SQLite as storage -- No database engine change
## Open Questions
1. **Should workers share extracted proxy lists with each other?** Peer
exchange would reduce redundant fetching but adds protocol complexity.
Recommendation: no, keep it simple. Master deduplicates via
`INSERT OR REPLACE`.
2. **Should URL claiming be weighted by worker geography?** Some sources
may be accessible from certain Tor exits but not others.
Recommendation: defer. Let natural retries handle this; track
per-worker URL success rates for future optimization.
3. **What's the right `proxy_ttl`?** Too short and we churn proxies
needlessly. Too long and we serve stale data. Start with 4 hours,
tune based on observed proxy lifetime distribution.