Files

Username 96e6f06e0d docs: add worker-driven discovery design doc

Architecture proposal to move proxy list fetching from master to
workers. Workers claim URLs, fetch lists, extract and test proxies,
report working proxies and URL health back to master. Trust-based
model: workers report working proxies only, no consensus needed.

2026-02-17 13:32:42 +01:00

18 KiB

Raw Blame History

Design: Worker-Driven Discovery

Status

Proposal -- Not yet implemented.

Problem

The current architecture centralizes all proxy list fetching on the master node (odin). Workers only test proxies handed to them. This creates several issues:

Single point of fetch -- If odin can't reach a source (blocked IP, transient failure), that source is dead for everyone.
Bandwidth concentration -- Odin fetches 40 proxy lists every cycle, extracts proxies, deduplicates, and stores them before workers ever see them.
Wasted vantage points -- Workers sit behind different Tor exits and IPs, but never use that diversity for fetching.
Tight coupling -- Workers can't operate at all without the master's claim queue. If odin restarts, all workers stall.

Proposed Architecture

Move proxy list fetching to workers. Master becomes a coordinator and aggregator rather than a fetcher.

Current:                             Proposed:

Master                               Master
  Fetch URLs --------+                  Manage URL database
  Extract proxies    |                  Score URLs from feedback
  Store proxylist    |                  Aggregate working proxies
  Serve /api/work ---+-> Workers        Serve /api/claim-urls ----> Workers
                     <- /api/results                              |
                                        <- /api/report-urls ------+
                                        <- /api/report-proxies ---+

Role Changes

+--------+---------------------------+----------------------------------+
| Host   | Current Role              | New Role                         |
+--------+---------------------------+----------------------------------+
| odin   | Fetch URLs                | Maintain URL database            |
|        | Extract proxies           | Score URLs from worker feedback  |
|        | Store proxylist           | Aggregate reported proxies       |
|        | Distribute proxy batches  | Distribute URL batches           |
|        | Collect test results      | Collect URL + proxy reports      |
+--------+---------------------------+----------------------------------+
| worker | Claim proxy batch         | Claim URL batch                  |
|        | Test each proxy           | Fetch URL, extract proxies       |
|        | Report pass/fail          | Test extracted proxies           |
|        |                           | Report URL health + proxy results|
+--------+---------------------------+----------------------------------+

Data Flow

Phase 1: URL Claiming

Worker requests a batch of URLs to process.

Worker                          Master
  |                               |
  |  GET /api/claim-urls          |
  |  ?key=...&count=5             |
  |------------------------------>|
  |                               |  Select due URLs from uris table
  |                               |  Mark as claimed (in-memory)
  |  [{url, last_hash, proto_hint}, ...]
  |<------------------------------|

Claim response:

{
  "worker_id": "abc123",
  "urls": [
    {
      "url": "https://raw.githubusercontent.com/.../http.txt",
      "last_hash": "a1b2c3d4...",
      "proto_hint": "http",
      "priority": 1
    }
  ]
}

Fields:

last_hash -- MD5 of last extracted proxy list. Worker can skip extraction and report "unchanged" if hash matches, saving CPU.
proto_hint -- Protocol inferred from URL path. Worker uses this for extraction confidence scoring.
priority -- Higher = fetch sooner. Based on URL score.

Phase 2: Fetch and Extract

Worker fetches each URL through Tor, extracts proxies using the existing fetch.extract_proxies() pipeline.

Worker
  |
  |  For each claimed URL:
  |    1. Fetch through Tor (fetch_contents)
  |    2. Compute content hash (MD5)
  |    3. If hash == last_hash: skip extraction, report unchanged
  |    4. Else: extract_proxies() -> list of (addr, proto, confidence)
  |    5. Queue extracted proxies for testing
  |

Phase 3: URL Feedback

Worker reports fetch results for each URL back to master.

Worker                          Master
  |                               |
  |  POST /api/report-urls        |
  |  {reports: [...]}             |
  |------------------------------>|
  |                               |  Update uris table:
  |                               |    check_time, error, stale_count,
  |                               |    retrievals, proxies_added,
  |                               |    content_hash, worker_scores
  |  {ok: true}                   |
  |<------------------------------|

URL report payload:

{
  "reports": [
    {
      "url": "https://...",
      "success": true,
      "content_hash": "a1b2c3d4...",
      "proxy_count": 1523,
      "fetch_time_ms": 2340,
      "changed": true,
      "error": null
    },
    {
      "url": "https://...",
      "success": false,
      "content_hash": null,
      "proxy_count": 0,
      "fetch_time_ms": 0,
      "changed": false,
      "error": "timeout"
    }
  ]
}

Phase 4: Proxy Testing and Reporting

Worker tests extracted proxies locally using the existing TargetTestJob pipeline. Only working proxies are reported to master. Failed proxies are discarded silently -- no point wasting bandwidth on negatives.

Workers are trusted. If a worker says a proxy works, master accepts it.

Worker                          Master
  |                               |
  |  Test proxies locally         |
  |  (same TargetTestJob flow)    |
  |  Discard failures             |
  |                               |
  |  POST /api/report-proxies     |
  |  {proxies: [...]}             |  (working only)
  |------------------------------>|
  |                               |  Upsert into proxylist:
  |                               |    INSERT OR REPLACE
  |                               |    Set failed=0, update last_seen
  |  {ok: true}                   |
  |<------------------------------|

Proxy report payload (working only):

{
  "proxies": [
    {
      "ip": "1.2.3.4",
      "port": 8080,
      "proto": "socks5",
      "source_proto": "socks5",
      "latency": 1.234,
      "exit_ip": "5.6.7.8",
      "anonymity": "elite",
      "source_url": "https://..."
    }
  ]
}

No working field needed -- everything in the report is working by definition. The source_url links proxy provenance to the URL that yielded it, enabling URL quality scoring.

Complete Cycle

Worker main loop:
  1. GET  /api/claim-urls          Claim batch of URLs
  2. For each URL:
     a. Fetch through Tor
     b. Extract proxies (or skip if unchanged)
     c. Test extracted proxies
  3. POST /api/report-urls         Report URL health
  4. POST /api/report-proxies      Report proxy results
  5. POST /api/heartbeat           Health check
  6. Sleep, repeat

Master-Side Changes

URL Scheduling

Current Leechered threads fetch URLs on a timer based on error/stale count. Replace with a scoring system that workers consume.

URL score (higher = fetch sooner):

score = base_score
      + freshness_bonus          # High-frequency sources score higher
      - error_penalty            # Consecutive errors reduce score
      - stale_penalty            # Unchanged content reduces score
      + yield_bonus              # URLs that produce many proxies score higher
      + quality_bonus            # URLs whose proxies actually work score higher

Concrete formula:

def url_score(url_row):
    age = now - url_row.check_time
    base = age / url_row.check_interval       # 1.0 when due

    # Yield: proxies found per fetch (rolling average)
    yield_rate = url_row.proxies_added / max(url_row.retrievals, 1)
    yield_bonus = min(yield_rate / 100.0, 1.0)  # Cap at 1.0

    # Quality: what % of extracted proxies actually worked
    quality_bonus = url_row.working_ratio * 0.5  # 0.0 to 0.5

    # Penalties
    error_penalty = min(url_row.error * 0.3, 2.0)
    stale_penalty = min(url_row.stale_count * 0.1, 1.0)

    return base + yield_bonus + quality_bonus - error_penalty - stale_penalty

URLs with score >= 1.0 are due for fetching. Claimed URLs are locked in memory for claim_timeout seconds (existing pattern).

New uris Columns

ALTER TABLE uris ADD COLUMN check_interval INT DEFAULT 3600;
ALTER TABLE uris ADD COLUMN working_ratio REAL DEFAULT 0.0;
ALTER TABLE uris ADD COLUMN avg_fetch_time INT DEFAULT 0;
ALTER TABLE uris ADD COLUMN last_worker TEXT;
ALTER TABLE uris ADD COLUMN yield_rate REAL DEFAULT 0.0;

check_interval -- Adaptive: decreases for high-yield URLs, increases for stale/erroring ones. Replaces the checktime + error * perfail formula with a persisted value.
working_ratio -- EMA of (working_proxies / total_proxies) from worker feedback. URLs that yield dead proxies get deprioritized.
avg_fetch_time -- EMA of fetch duration in ms. Helps identify slow sources.
last_worker -- Which worker last fetched this URL. Useful for debugging, and to distribute URLs across workers evenly.
yield_rate -- EMA of proxies extracted per fetch.

Proxy Aggregation

Trust model: workers are trusted. If any worker reports a proxy as working, master accepts it. Failed proxies are never reported -- workers discard them locally.

Worker A reports: 1.2.3.4:8080 working, latency 1.2s
Worker B reports: 1.2.3.4:8080 working, latency 1.5s

Master action:
  - INSERT OR REPLACE with latest report
  - Update last_seen, latency EMA
  - Set failed = 0

No consensus, no voting, no trust scoring. A proxy lives as long as at least one worker keeps confirming it. It dies when nobody reports it for proxy_ttl seconds.

New proxylist column:

ALTER TABLE proxylist ADD COLUMN last_seen INT DEFAULT 0;

last_seen -- Unix timestamp of most recent "working" report. Proxies not seen in N hours are expired by the master's periodic cleanup.

Proxy Expiry

Working proxies that haven't been reported by any worker within proxy_ttl (default: 4 hours) are marked stale and re-queued for testing. After proxy_ttl * 3 with no reports, they're marked failed.

def expire_stale_proxies(db, proxy_ttl):
    cutoff_stale = now - proxy_ttl
    cutoff_dead = now - (proxy_ttl * 3)

    # Mark stale proxies for retesting
    db.execute('''
        UPDATE proxylist SET failed = 1
        WHERE failed = 0 AND last_seen < ? AND last_seen > 0
    ''', (cutoff_stale,))

    # Kill proxies not seen in a long time
    db.execute('''
        UPDATE proxylist SET failed = -1
        WHERE failed > 0 AND last_seen < ? AND last_seen > 0
    ''', (cutoff_dead,))

Worker-Side Changes

New Worker Loop

Replace the current claim-test-report loop with a two-phase loop:

def worker_main_v2(config):
    register()
    verify_tor()

    while True:
        # Phase 1: Fetch URLs and extract proxies
        urls = claim_urls(server, key, count=5)
        url_reports = []
        proxy_batch = []

        for url_info in urls:
            report, proxies = fetch_and_extract(url_info)
            url_reports.append(report)
            proxy_batch.extend(proxies)

        report_urls(server, key, url_reports)

        # Phase 2: Test extracted proxies, report working only
        if proxy_batch:
            working = test_proxies(proxy_batch)
            if working:
                report_proxies(server, key, working)

        heartbeat(server, key)
        sleep(1)

fetch_and_extract()

New function that combines fetching + extraction on the worker side:

def fetch_and_extract(url_info):
    url = url_info['url']
    last_hash = url_info.get('last_hash')
    proto_hint = url_info.get('proto_hint')

    start = time.time()
    try:
        content = fetch_contents(url, head=False, proxy=tor_proxy)
    except Exception as e:
        return {'url': url, 'success': False, 'error': str(e)}, []

    elapsed = int((time.time() - start) * 1000)
    content_hash = hashlib.md5(content).hexdigest()

    if content_hash == last_hash:
        return {
            'url': url, 'success': True, 'content_hash': content_hash,
            'proxy_count': 0, 'fetch_time_ms': elapsed,
            'changed': False, 'error': None
        }, []

    proxies = extract_proxies(content, url)
    return {
        'url': url, 'success': True, 'content_hash': content_hash,
        'proxy_count': len(proxies), 'fetch_time_ms': elapsed,
        'changed': True, 'error': None
    }, proxies

Deduplication

Workers may extract the same proxies from different URLs. Local deduplication before testing:

seen = set()
unique = []
for addr, proto, confidence in proxy_batch:
    if addr not in seen:
        seen.add(addr)
        unique.append((addr, proto, confidence))
proxy_batch = unique

Proxy Testing

Reuse the existing TargetTestJob / WorkerThread pipeline. The only change: proxies come from local extraction instead of master's claim response. The test loop, result collection, and evaluation logic remain identical.

API Changes Summary

New Endpoints

Endpoint	Method	Purpose
`/api/claim-urls`	GET	Worker claims batch of due URLs
`/api/report-urls`	POST	Worker reports URL fetch results
`/api/report-proxies`	POST	Worker reports proxy test results

Modified Endpoints

Endpoint	Change
`/api/work`	Deprecated but kept for backward compatibility
`/api/results`	Deprecated but kept for backward compatibility

Unchanged Endpoints

Endpoint	Reason
`/api/register`	Same registration flow
`/api/heartbeat`	Same health reporting
`/dashboard`	Still reads from same DB
`/proxies`	Still reads from proxylist

Schema Changes

uris table additions

ALTER TABLE uris ADD COLUMN check_interval INT DEFAULT 3600;
ALTER TABLE uris ADD COLUMN working_ratio REAL DEFAULT 0.0;
ALTER TABLE uris ADD COLUMN avg_fetch_time INT DEFAULT 0;
ALTER TABLE uris ADD COLUMN last_worker TEXT;
ALTER TABLE uris ADD COLUMN yield_rate REAL DEFAULT 0.0;

proxylist table additions

ALTER TABLE proxylist ADD COLUMN last_seen INT DEFAULT 0;

Migration Strategy

Phase 1: Add New Endpoints (non-breaking)

Add /api/claim-urls, /api/report-urls, /api/report-proxies to httpd.py. Keep all existing endpoints working. Master still runs its own Leechered threads.

Files: httpd.py, dbs.py (migrations)

Phase 2: Worker V2 Mode

Add --worker-v2 flag to ppf.py. When set, worker uses the new URL-claiming loop instead of the proxy-claiming loop. Both modes coexist.

Old workers (--worker) continue working against /api/work and /api/results. New workers (--worker-v2) use the new endpoints.

Files: ppf.py, config.py

Phase 3: URL Scoring

Implement URL scoring in master based on worker feedback. Replace Leechered timer-based scheduling with score-based scheduling. Master's own fetching becomes a fallback for URLs no worker has claimed recently.

Files: httpd.py, dbs.py

Phase 4: Remove Legacy

Once all workers run V2, remove /api/work, /api/results, and master-side Leechered threads. Master no longer fetches proxy lists directly.

Files: ppf.py, httpd.py

Configuration

New config.ini Options

[worker]
# V2 mode: worker fetches URLs instead of proxy batches
mode = v2                    # v1 (legacy) or v2 (url-driven)
url_batch_size = 5           # URLs per claim cycle
max_proxies_per_cycle = 500  # Cap on proxies tested per cycle
fetch_timeout = 30           # Timeout for URL fetching (seconds)

[ppf]
# URL scoring weights
score_yield_weight = 1.0
score_quality_weight = 0.5
score_error_penalty = 0.3
score_stale_penalty = 0.1

# Proxy expiry
proxy_ttl = 14400            # Seconds before unseen proxy goes stale (4h)
proxy_ttl_dead = 43200       # Seconds before unseen proxy is killed (12h)

# Fallback: master fetches URLs not claimed by any worker
fallback_fetch = true
fallback_interval = 7200     # Seconds before master fetches unclaimed URL

Risks and Mitigations

Risk	Impact	Mitigation
Workers extract different proxy counts from same URL	Inconsistent proxy_count in reports	Use content_hash for dedup; only update yield_rate when hash changes
Tor exit blocks a source for one worker	Worker reports error for a working URL	Require 2+ consecutive errors before incrementing URL error count
Workers test same proxies redundantly	Wasted CPU	Master tracks which URLs are assigned to which workers; avoid assigning same URL to multiple workers in same cycle
Large proxy lists overwhelm worker memory	OOM on worker	Cap `max_proxies_per_cycle`; worker discards excess after dedup
Master restart loses claim state	Workers refetch recently-fetched URLs	Harmless -- just a redundant fetch. content_hash prevents duplicate work
`fetch.py` imports unavailable on worker image	ImportError	Verify worker Dockerfile includes fetch.py and dependencies

What Stays the Same

rocksock.py -- No changes to proxy chain logic
connection_pool.py -- Tor host selection unchanged
proxywatchd.py core -- TargetTestJob, WorkerThread, ProxyTestState remain identical. Only the job source changes.
fetch.py -- Used on workers now, but the code itself doesn't change
httpd.py dashboard/proxies -- Still reads from same proxylist table
SQLite as storage -- No database engine change

Open Questions

Should workers share extracted proxy lists with each other? Peer exchange would reduce redundant fetching but adds protocol complexity. Recommendation: no, keep it simple. Master deduplicates via INSERT OR REPLACE.
Should URL claiming be weighted by worker geography? Some sources may be accessible from certain Tor exits but not others. Recommendation: defer. Let natural retries handle this; track per-worker URL success rates for future optimization.
What's the right proxy_ttl? Too short and we churn proxies needlessly. Too long and we serve stale data. Start with 4 hours, tune based on observed proxy lifetime distribution.

18 KiB Raw Blame History