Architecture proposal to move proxy list fetching from master to workers. Workers claim URLs, fetch lists, extract and test proxies, report working proxies and URL health back to master. Trust-based model: workers report working proxies only, no consensus needed.
573 lines
18 KiB
Markdown
573 lines
18 KiB
Markdown
# Design: Worker-Driven Discovery
|
|
|
|
## Status
|
|
|
|
**Proposal** -- Not yet implemented.
|
|
|
|
## Problem
|
|
|
|
The current architecture centralizes all proxy list fetching on the master
|
|
node (odin). Workers only test proxies handed to them. This creates several
|
|
issues:
|
|
|
|
1. **Single point of fetch** -- If odin can't reach a source (blocked IP,
|
|
transient failure), that source is dead for everyone.
|
|
2. **Bandwidth concentration** -- Odin fetches 40 proxy lists every cycle,
|
|
extracts proxies, deduplicates, and stores them before workers ever see
|
|
them.
|
|
3. **Wasted vantage points** -- Workers sit behind different Tor exits and
|
|
IPs, but never use that diversity for fetching.
|
|
4. **Tight coupling** -- Workers can't operate at all without the master's
|
|
claim queue. If odin restarts, all workers stall.
|
|
|
|
## Proposed Architecture
|
|
|
|
Move proxy list fetching to workers. Master becomes a coordinator and
|
|
aggregator rather than a fetcher.
|
|
|
|
```
|
|
Current: Proposed:
|
|
|
|
Master Master
|
|
Fetch URLs --------+ Manage URL database
|
|
Extract proxies | Score URLs from feedback
|
|
Store proxylist | Aggregate working proxies
|
|
Serve /api/work ---+-> Workers Serve /api/claim-urls ----> Workers
|
|
<- /api/results |
|
|
<- /api/report-urls ------+
|
|
<- /api/report-proxies ---+
|
|
```
|
|
|
|
### Role Changes
|
|
|
|
```
|
|
+--------+---------------------------+----------------------------------+
|
|
| Host | Current Role | New Role |
|
|
+--------+---------------------------+----------------------------------+
|
|
| odin | Fetch URLs | Maintain URL database |
|
|
| | Extract proxies | Score URLs from worker feedback |
|
|
| | Store proxylist | Aggregate reported proxies |
|
|
| | Distribute proxy batches | Distribute URL batches |
|
|
| | Collect test results | Collect URL + proxy reports |
|
|
+--------+---------------------------+----------------------------------+
|
|
| worker | Claim proxy batch | Claim URL batch |
|
|
| | Test each proxy | Fetch URL, extract proxies |
|
|
| | Report pass/fail | Test extracted proxies |
|
|
| | | Report URL health + proxy results|
|
|
+--------+---------------------------+----------------------------------+
|
|
```
|
|
|
|
## Data Flow
|
|
|
|
### Phase 1: URL Claiming
|
|
|
|
Worker requests a batch of URLs to process.
|
|
|
|
```
|
|
Worker Master
|
|
| |
|
|
| GET /api/claim-urls |
|
|
| ?key=...&count=5 |
|
|
|------------------------------>|
|
|
| | Select due URLs from uris table
|
|
| | Mark as claimed (in-memory)
|
|
| [{url, last_hash, proto_hint}, ...]
|
|
|<------------------------------|
|
|
```
|
|
|
|
**Claim response:**
|
|
```json
|
|
{
|
|
"worker_id": "abc123",
|
|
"urls": [
|
|
{
|
|
"url": "https://raw.githubusercontent.com/.../http.txt",
|
|
"last_hash": "a1b2c3d4...",
|
|
"proto_hint": "http",
|
|
"priority": 1
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Fields:
|
|
- `last_hash` -- MD5 of last extracted proxy list. Worker can skip
|
|
extraction and report "unchanged" if hash matches, saving CPU.
|
|
- `proto_hint` -- Protocol inferred from URL path. Worker uses this for
|
|
extraction confidence scoring.
|
|
- `priority` -- Higher = fetch sooner. Based on URL score.
|
|
|
|
### Phase 2: Fetch and Extract
|
|
|
|
Worker fetches each URL through Tor, extracts proxies using the existing
|
|
`fetch.extract_proxies()` pipeline.
|
|
|
|
```
|
|
Worker
|
|
|
|
|
| For each claimed URL:
|
|
| 1. Fetch through Tor (fetch_contents)
|
|
| 2. Compute content hash (MD5)
|
|
| 3. If hash == last_hash: skip extraction, report unchanged
|
|
| 4. Else: extract_proxies() -> list of (addr, proto, confidence)
|
|
| 5. Queue extracted proxies for testing
|
|
|
|
|
```
|
|
|
|
### Phase 3: URL Feedback
|
|
|
|
Worker reports fetch results for each URL back to master.
|
|
|
|
```
|
|
Worker Master
|
|
| |
|
|
| POST /api/report-urls |
|
|
| {reports: [...]} |
|
|
|------------------------------>|
|
|
| | Update uris table:
|
|
| | check_time, error, stale_count,
|
|
| | retrievals, proxies_added,
|
|
| | content_hash, worker_scores
|
|
| {ok: true} |
|
|
|<------------------------------|
|
|
```
|
|
|
|
**URL report payload:**
|
|
```json
|
|
{
|
|
"reports": [
|
|
{
|
|
"url": "https://...",
|
|
"success": true,
|
|
"content_hash": "a1b2c3d4...",
|
|
"proxy_count": 1523,
|
|
"fetch_time_ms": 2340,
|
|
"changed": true,
|
|
"error": null
|
|
},
|
|
{
|
|
"url": "https://...",
|
|
"success": false,
|
|
"content_hash": null,
|
|
"proxy_count": 0,
|
|
"fetch_time_ms": 0,
|
|
"changed": false,
|
|
"error": "timeout"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Phase 4: Proxy Testing and Reporting
|
|
|
|
Worker tests extracted proxies locally using the existing `TargetTestJob`
|
|
pipeline. **Only working proxies are reported to master.** Failed proxies
|
|
are discarded silently -- no point wasting bandwidth on negatives.
|
|
|
|
Workers are trusted. If a worker says a proxy works, master accepts it.
|
|
|
|
```
|
|
Worker Master
|
|
| |
|
|
| Test proxies locally |
|
|
| (same TargetTestJob flow) |
|
|
| Discard failures |
|
|
| |
|
|
| POST /api/report-proxies |
|
|
| {proxies: [...]} | (working only)
|
|
|------------------------------>|
|
|
| | Upsert into proxylist:
|
|
| | INSERT OR REPLACE
|
|
| | Set failed=0, update last_seen
|
|
| {ok: true} |
|
|
|<------------------------------|
|
|
```
|
|
|
|
**Proxy report payload (working only):**
|
|
```json
|
|
{
|
|
"proxies": [
|
|
{
|
|
"ip": "1.2.3.4",
|
|
"port": 8080,
|
|
"proto": "socks5",
|
|
"source_proto": "socks5",
|
|
"latency": 1.234,
|
|
"exit_ip": "5.6.7.8",
|
|
"anonymity": "elite",
|
|
"source_url": "https://..."
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
No `working` field needed -- everything in the report is working by
|
|
definition. The `source_url` links proxy provenance to the URL that
|
|
yielded it, enabling URL quality scoring.
|
|
|
|
### Complete Cycle
|
|
|
|
```
|
|
Worker main loop:
|
|
1. GET /api/claim-urls Claim batch of URLs
|
|
2. For each URL:
|
|
a. Fetch through Tor
|
|
b. Extract proxies (or skip if unchanged)
|
|
c. Test extracted proxies
|
|
3. POST /api/report-urls Report URL health
|
|
4. POST /api/report-proxies Report proxy results
|
|
5. POST /api/heartbeat Health check
|
|
6. Sleep, repeat
|
|
```
|
|
|
|
## Master-Side Changes
|
|
|
|
### URL Scheduling
|
|
|
|
Current `Leechered` threads fetch URLs on a timer based on error/stale
|
|
count. Replace with a scoring system that workers consume.
|
|
|
|
**URL score** (higher = fetch sooner):
|
|
|
|
```
|
|
score = base_score
|
|
+ freshness_bonus # High-frequency sources score higher
|
|
- error_penalty # Consecutive errors reduce score
|
|
- stale_penalty # Unchanged content reduces score
|
|
+ yield_bonus # URLs that produce many proxies score higher
|
|
+ quality_bonus # URLs whose proxies actually work score higher
|
|
```
|
|
|
|
Concrete formula:
|
|
|
|
```python
|
|
def url_score(url_row):
|
|
age = now - url_row.check_time
|
|
base = age / url_row.check_interval # 1.0 when due
|
|
|
|
# Yield: proxies found per fetch (rolling average)
|
|
yield_rate = url_row.proxies_added / max(url_row.retrievals, 1)
|
|
yield_bonus = min(yield_rate / 100.0, 1.0) # Cap at 1.0
|
|
|
|
# Quality: what % of extracted proxies actually worked
|
|
quality_bonus = url_row.working_ratio * 0.5 # 0.0 to 0.5
|
|
|
|
# Penalties
|
|
error_penalty = min(url_row.error * 0.3, 2.0)
|
|
stale_penalty = min(url_row.stale_count * 0.1, 1.0)
|
|
|
|
return base + yield_bonus + quality_bonus - error_penalty - stale_penalty
|
|
```
|
|
|
|
URLs with `score >= 1.0` are due for fetching. Claimed URLs are locked
|
|
in memory for `claim_timeout` seconds (existing pattern).
|
|
|
|
### New uris Columns
|
|
|
|
```sql
|
|
ALTER TABLE uris ADD COLUMN check_interval INT DEFAULT 3600;
|
|
ALTER TABLE uris ADD COLUMN working_ratio REAL DEFAULT 0.0;
|
|
ALTER TABLE uris ADD COLUMN avg_fetch_time INT DEFAULT 0;
|
|
ALTER TABLE uris ADD COLUMN last_worker TEXT;
|
|
ALTER TABLE uris ADD COLUMN yield_rate REAL DEFAULT 0.0;
|
|
```
|
|
|
|
- `check_interval` -- Adaptive: decreases for high-yield URLs, increases
|
|
for stale/erroring ones. Replaces the `checktime + error * perfail`
|
|
formula with a persisted value.
|
|
- `working_ratio` -- EMA of (working_proxies / total_proxies) from worker
|
|
feedback. URLs that yield dead proxies get deprioritized.
|
|
- `avg_fetch_time` -- EMA of fetch duration in ms. Helps identify slow
|
|
sources.
|
|
- `last_worker` -- Which worker last fetched this URL. Useful for
|
|
debugging, and to distribute URLs across workers evenly.
|
|
- `yield_rate` -- EMA of proxies extracted per fetch.
|
|
|
|
### Proxy Aggregation
|
|
|
|
Trust model: **workers are trusted.** If any worker reports a proxy as
|
|
working, master accepts it. Failed proxies are never reported -- workers
|
|
discard them locally.
|
|
|
|
```
|
|
Worker A reports: 1.2.3.4:8080 working, latency 1.2s
|
|
Worker B reports: 1.2.3.4:8080 working, latency 1.5s
|
|
|
|
Master action:
|
|
- INSERT OR REPLACE with latest report
|
|
- Update last_seen, latency EMA
|
|
- Set failed = 0
|
|
```
|
|
|
|
No consensus, no voting, no trust scoring. A proxy lives as long as at
|
|
least one worker keeps confirming it. It dies when nobody reports it for
|
|
`proxy_ttl` seconds.
|
|
|
|
New `proxylist` column:
|
|
|
|
```sql
|
|
ALTER TABLE proxylist ADD COLUMN last_seen INT DEFAULT 0;
|
|
```
|
|
|
|
- `last_seen` -- Unix timestamp of most recent "working" report. Proxies
|
|
not seen in N hours are expired by the master's periodic cleanup.
|
|
|
|
### Proxy Expiry
|
|
|
|
Working proxies that haven't been reported by any worker within
|
|
`proxy_ttl` (default: 4 hours) are marked stale and re-queued for
|
|
testing. After `proxy_ttl * 3` with no reports, they're marked failed.
|
|
|
|
```python
|
|
def expire_stale_proxies(db, proxy_ttl):
|
|
cutoff_stale = now - proxy_ttl
|
|
cutoff_dead = now - (proxy_ttl * 3)
|
|
|
|
# Mark stale proxies for retesting
|
|
db.execute('''
|
|
UPDATE proxylist SET failed = 1
|
|
WHERE failed = 0 AND last_seen < ? AND last_seen > 0
|
|
''', (cutoff_stale,))
|
|
|
|
# Kill proxies not seen in a long time
|
|
db.execute('''
|
|
UPDATE proxylist SET failed = -1
|
|
WHERE failed > 0 AND last_seen < ? AND last_seen > 0
|
|
''', (cutoff_dead,))
|
|
```
|
|
|
|
## Worker-Side Changes
|
|
|
|
### New Worker Loop
|
|
|
|
Replace the current claim-test-report loop with a two-phase loop:
|
|
|
|
```python
|
|
def worker_main_v2(config):
|
|
register()
|
|
verify_tor()
|
|
|
|
while True:
|
|
# Phase 1: Fetch URLs and extract proxies
|
|
urls = claim_urls(server, key, count=5)
|
|
url_reports = []
|
|
proxy_batch = []
|
|
|
|
for url_info in urls:
|
|
report, proxies = fetch_and_extract(url_info)
|
|
url_reports.append(report)
|
|
proxy_batch.extend(proxies)
|
|
|
|
report_urls(server, key, url_reports)
|
|
|
|
# Phase 2: Test extracted proxies, report working only
|
|
if proxy_batch:
|
|
working = test_proxies(proxy_batch)
|
|
if working:
|
|
report_proxies(server, key, working)
|
|
|
|
heartbeat(server, key)
|
|
sleep(1)
|
|
```
|
|
|
|
### fetch_and_extract()
|
|
|
|
New function that combines fetching + extraction on the worker side:
|
|
|
|
```python
|
|
def fetch_and_extract(url_info):
|
|
url = url_info['url']
|
|
last_hash = url_info.get('last_hash')
|
|
proto_hint = url_info.get('proto_hint')
|
|
|
|
start = time.time()
|
|
try:
|
|
content = fetch_contents(url, head=False, proxy=tor_proxy)
|
|
except Exception as e:
|
|
return {'url': url, 'success': False, 'error': str(e)}, []
|
|
|
|
elapsed = int((time.time() - start) * 1000)
|
|
content_hash = hashlib.md5(content).hexdigest()
|
|
|
|
if content_hash == last_hash:
|
|
return {
|
|
'url': url, 'success': True, 'content_hash': content_hash,
|
|
'proxy_count': 0, 'fetch_time_ms': elapsed,
|
|
'changed': False, 'error': None
|
|
}, []
|
|
|
|
proxies = extract_proxies(content, url)
|
|
return {
|
|
'url': url, 'success': True, 'content_hash': content_hash,
|
|
'proxy_count': len(proxies), 'fetch_time_ms': elapsed,
|
|
'changed': True, 'error': None
|
|
}, proxies
|
|
```
|
|
|
|
### Deduplication
|
|
|
|
Workers may extract the same proxies from different URLs. Local
|
|
deduplication before testing:
|
|
|
|
```python
|
|
seen = set()
|
|
unique = []
|
|
for addr, proto, confidence in proxy_batch:
|
|
if addr not in seen:
|
|
seen.add(addr)
|
|
unique.append((addr, proto, confidence))
|
|
proxy_batch = unique
|
|
```
|
|
|
|
### Proxy Testing
|
|
|
|
Reuse the existing `TargetTestJob` / `WorkerThread` pipeline. The only
|
|
change: proxies come from local extraction instead of master's claim
|
|
response. The test loop, result collection, and evaluation logic remain
|
|
identical.
|
|
|
|
## API Changes Summary
|
|
|
|
### New Endpoints
|
|
|
|
| Endpoint | Method | Purpose |
|
|
|----------|--------|---------|
|
|
| `/api/claim-urls` | GET | Worker claims batch of due URLs |
|
|
| `/api/report-urls` | POST | Worker reports URL fetch results |
|
|
| `/api/report-proxies` | POST | Worker reports proxy test results |
|
|
|
|
### Modified Endpoints
|
|
|
|
| Endpoint | Change |
|
|
|----------|--------|
|
|
| `/api/work` | Deprecated but kept for backward compatibility |
|
|
| `/api/results` | Deprecated but kept for backward compatibility |
|
|
|
|
### Unchanged Endpoints
|
|
|
|
| Endpoint | Reason |
|
|
|----------|--------|
|
|
| `/api/register` | Same registration flow |
|
|
| `/api/heartbeat` | Same health reporting |
|
|
| `/dashboard` | Still reads from same DB |
|
|
| `/proxies` | Still reads from proxylist |
|
|
|
|
## Schema Changes
|
|
|
|
### uris table additions
|
|
|
|
```sql
|
|
ALTER TABLE uris ADD COLUMN check_interval INT DEFAULT 3600;
|
|
ALTER TABLE uris ADD COLUMN working_ratio REAL DEFAULT 0.0;
|
|
ALTER TABLE uris ADD COLUMN avg_fetch_time INT DEFAULT 0;
|
|
ALTER TABLE uris ADD COLUMN last_worker TEXT;
|
|
ALTER TABLE uris ADD COLUMN yield_rate REAL DEFAULT 0.0;
|
|
```
|
|
|
|
### proxylist table additions
|
|
|
|
```sql
|
|
ALTER TABLE proxylist ADD COLUMN last_seen INT DEFAULT 0;
|
|
```
|
|
|
|
## Migration Strategy
|
|
|
|
### Phase 1: Add New Endpoints (non-breaking)
|
|
|
|
Add `/api/claim-urls`, `/api/report-urls`, `/api/report-proxies` to
|
|
`httpd.py`. Keep all existing endpoints working. Master still runs its
|
|
own `Leechered` threads.
|
|
|
|
Files: `httpd.py`, `dbs.py` (migrations)
|
|
|
|
### Phase 2: Worker V2 Mode
|
|
|
|
Add `--worker-v2` flag to `ppf.py`. When set, worker uses the new
|
|
URL-claiming loop instead of the proxy-claiming loop. Both modes coexist.
|
|
|
|
Old workers (`--worker`) continue working against `/api/work` and
|
|
`/api/results`. New workers (`--worker-v2`) use the new endpoints.
|
|
|
|
Files: `ppf.py`, `config.py`
|
|
|
|
### Phase 3: URL Scoring
|
|
|
|
Implement URL scoring in master based on worker feedback. Replace
|
|
`Leechered` timer-based scheduling with score-based scheduling. Master's
|
|
own fetching becomes a fallback for URLs no worker has claimed recently.
|
|
|
|
Files: `httpd.py`, `dbs.py`
|
|
|
|
### Phase 4: Remove Legacy
|
|
|
|
Once all workers run V2, remove `/api/work`, `/api/results`, and
|
|
master-side `Leechered` threads. Master no longer fetches proxy lists
|
|
directly.
|
|
|
|
Files: `ppf.py`, `httpd.py`
|
|
|
|
## Configuration
|
|
|
|
### New config.ini Options
|
|
|
|
```ini
|
|
[worker]
|
|
# V2 mode: worker fetches URLs instead of proxy batches
|
|
mode = v2 # v1 (legacy) or v2 (url-driven)
|
|
url_batch_size = 5 # URLs per claim cycle
|
|
max_proxies_per_cycle = 500 # Cap on proxies tested per cycle
|
|
fetch_timeout = 30 # Timeout for URL fetching (seconds)
|
|
|
|
[ppf]
|
|
# URL scoring weights
|
|
score_yield_weight = 1.0
|
|
score_quality_weight = 0.5
|
|
score_error_penalty = 0.3
|
|
score_stale_penalty = 0.1
|
|
|
|
# Proxy expiry
|
|
proxy_ttl = 14400 # Seconds before unseen proxy goes stale (4h)
|
|
proxy_ttl_dead = 43200 # Seconds before unseen proxy is killed (12h)
|
|
|
|
# Fallback: master fetches URLs not claimed by any worker
|
|
fallback_fetch = true
|
|
fallback_interval = 7200 # Seconds before master fetches unclaimed URL
|
|
```
|
|
|
|
## Risks and Mitigations
|
|
|
|
| Risk | Impact | Mitigation |
|
|
|------|--------|------------|
|
|
| Workers extract different proxy counts from same URL | Inconsistent proxy_count in reports | Use content_hash for dedup; only update yield_rate when hash changes |
|
|
| Tor exit blocks a source for one worker | Worker reports error for a working URL | Require 2+ consecutive errors before incrementing URL error count |
|
|
| Workers test same proxies redundantly | Wasted CPU | Master tracks which URLs are assigned to which workers; avoid assigning same URL to multiple workers in same cycle |
|
|
| Large proxy lists overwhelm worker memory | OOM on worker | Cap `max_proxies_per_cycle`; worker discards excess after dedup |
|
|
| Master restart loses claim state | Workers refetch recently-fetched URLs | Harmless -- just a redundant fetch. content_hash prevents duplicate work |
|
|
| `fetch.py` imports unavailable on worker image | ImportError | Verify worker Dockerfile includes fetch.py and dependencies |
|
|
|
|
## What Stays the Same
|
|
|
|
- `rocksock.py` -- No changes to proxy chain logic
|
|
- `connection_pool.py` -- Tor host selection unchanged
|
|
- `proxywatchd.py` core -- `TargetTestJob`, `WorkerThread`, `ProxyTestState`
|
|
remain identical. Only the job source changes.
|
|
- `fetch.py` -- Used on workers now, but the code itself doesn't change
|
|
- `httpd.py` dashboard/proxies -- Still reads from same `proxylist` table
|
|
- SQLite as storage -- No database engine change
|
|
|
|
## Open Questions
|
|
|
|
1. **Should workers share extracted proxy lists with each other?** Peer
|
|
exchange would reduce redundant fetching but adds protocol complexity.
|
|
Recommendation: no, keep it simple. Master deduplicates via
|
|
`INSERT OR REPLACE`.
|
|
|
|
2. **Should URL claiming be weighted by worker geography?** Some sources
|
|
may be accessible from certain Tor exits but not others.
|
|
Recommendation: defer. Let natural retries handle this; track
|
|
per-worker URL success rates for future optimization.
|
|
|
|
3. **What's the right `proxy_ttl`?** Too short and we churn proxies
|
|
needlessly. Too long and we serve stale data. Start with 4 hours,
|
|
tune based on observed proxy lifetime distribution.
|