Compare commits

..

1 Commits

Author SHA1 Message Date
Username
d1e22a388c httpd: add ASN enrichment for worker-reported proxies
All checks were successful
CI / validate (push) Successful in 21s
Load pyasn database in httpd and look up ASN when workers report
working proxies. Previously ASN was only populated by proxywatchd
which doesn't run independently on the master node, leaving all
worker-reported proxies with asn=null.
2026-02-22 11:18:51 +01:00
10 changed files with 227 additions and 809 deletions

View File

@@ -1,29 +0,0 @@
FROM python:2.7-slim
WORKDIR /app
RUN sed -i 's/deb.debian.org/archive.debian.org/g' /etc/apt/sources.list && \
sed -i 's/security.debian.org/archive.debian.org/g' /etc/apt/sources.list && \
sed -i '/buster-updates/d' /etc/apt/sources.list && \
echo 'deb http://archive.debian.org/debian-security buster/updates main' >> /etc/apt/sources.list && \
apt-get update && \
apt-get upgrade -y && \
apt-get install -y --no-install-recommends gcc libc-dev && \
rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade "pip<21" "setuptools<45" "wheel<0.38"
COPY requirements.txt .
RUN pip install -r requirements.txt || true
RUN pip install pytest
RUN mkdir -p /app/data && \
python -c "import pyasn" 2>/dev/null && \
pyasn_util_download.py --latest && \
pyasn_util_convert.py --single rib.*.bz2 /app/data/ipasn.dat && \
rm -f rib.*.bz2 || \
echo "pyasn database setup skipped"
RUN apt-get purge -y gcc libc-dev && apt-get autoremove -y || true
CMD ["python", "-m", "pytest", "tests/", "-v", "--tb=short"]

View File

@@ -1,100 +1,69 @@
# PPF Roadmap
# PPF Project Roadmap
## Architecture
## Project Purpose
PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:
1. **Discover** proxy addresses by crawling websites and search engines
2. **Validate** proxies through multi-target testing via Tor
3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)
## Architecture Overview
```
──────────────────────────────────────────┐
Odin (Master)
│ httpd.py ─ API + SSL-only verification │
proxywatchd.py ─ proxy recheck daemon
│ SQLite: proxies.db, websites.db
└──────────┬───────────────────────────────┘
WireGuard (10.200.1.0/24)
┌────────────────┼────────────────┐
v v v
───────────┐ ┌─────────── ───────────
cassius │ │ edge │ sentinel
│ Worker Worker Worker
│ ppf.py │ │ ppf.py │ ppf.py
└───────────┘ └───────────┘ └───────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
PPF Architecture
├─────────────────────────────────────────────────────────────────────────────┤
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ scraper.py ppf.py │proxywatchd │ │
│ │ │ │ │
│ Searx query │───>│ URL harvest │───>│ Proxy test │ │
│ URL finding │ │ Proxy extract│ │ Validation │
─────────────┘ └───────────── └─────────────┘ │
│ │
│ v v v
│ ┌─────────────────────────────────────────────────────────────────┐
│ │ SQLite Databases │ │
│ │ uris.db (URLs) proxies.db (proxy list) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Network Layer │ │
│ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
Workers claim URLs, extract proxies, test them, report back.
Master verifies (SSL-only), serves API, coordinates distribution.
## Constraints
- Python 2.7 runtime (container-based)
- Minimal external dependencies
- All traffic via Tor
---
## Phase 1: Performance and Quality (current)
Profiling-driven optimizations and source pipeline hardening.
| Item | Status | Description |
|------|--------|-------------|
| Extraction short-circuits | done | Guard clauses in fetch.py extractors |
| Skip shutdown on failed sockets | done | Track _connected flag, skip shutdown on dead sockets |
| SQLite connection reuse (odin) | done | Per-greenlet cached handles via threading.local |
| Lazy-load ASN database | done | Defer ipasn.dat parsing to first lookup |
| Add more seed sources (100+) | done | Expanded to 120+ URLs with SOCKS5-specific sources |
| Protocol-aware source weighting | done | Dynamic SOCKS boost in claim_urls scoring |
| Sharpen error penalty in URL scoring | done | Reduce erroring URL claim frequency |
## Phase 2: Proxy Diversity and Consumer API
Address customer-reported quality gaps.
| Item | Status | Description |
|------|--------|-------------|
| ASN diversity scoring | pending | Deprioritize over-represented ASNs in testing |
| Graduated recheck intervals | pending | Fresh proxies rechecked more often than stale |
| API filters (proto/country/ASN/latency) | pending | Consumer-facing query parameters on /proxies |
| Latency-based ranking | pending | Expose latency percentiles per proxy |
## Phase 3: Self-Expanding Source Pool
Worker-driven link discovery from productive pages.
| Item | Status | Description |
|------|--------|-------------|
| Link extraction from productive pages | pending | Parse HTML for links when page yields proxies |
| Report discovered URLs to master | pending | New endpoint for worker URL submissions |
| Conditional discovery | pending | Only extract links from confirmed-productive pages |
## Phase 4: Long-Term
| Item | Status | Description |
|------|--------|-------------|
| Python 3 migration | deferred | Unblocks modern deps, security patches, pyasn native |
| Worker trust scoring | pending | Activate spot-check verification framework |
| Dynamic target pool | pending | Auto-discover and rotate validation targets |
| Geographic target spread | pending | Ensure targets span multiple regions |
- **Python 2.7** compatibility required
- **Minimal external dependencies** (avoid adding new modules)
- Current dependencies: beautifulsoup4, pyasn, IP2Location
- Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)
---
## Completed
| Item | Date | Description |
|------|------|-------------|
| Sharpen URL error penalty | 2026-02-22 | error*0.5 cap 4.0 + stale*0.2 cap 1.5 |
| SOCKS5 source expansion | 2026-02-22 | Added 10 new SOCKS5-specific sources |
| SQLite connection reuse | 2026-02-22 | Per-greenlet cached handles via threading.local |
| Lazy-load ASN database | 2026-02-22 | Deferred ipasn.dat to first lookup |
| Socket shutdown skip | 2026-02-22 | _connected flag, skip shutdown on dead sockets |
| Protocol-aware weighting | 2026-02-22 | Dynamic SOCKS boost in claim_urls scoring |
| Seed sources expanded | 2026-02-22 | 37 -> 120+ URLs |
| last_seen freshness fix | 2026-02-22 | Watchd updates last_seen on verification |
| Periodic re-seeding | 2026-02-22 | Reset errored sources every 6h |
| ASN enrichment | 2026-02-22 | Pure-Python ipasn.dat reader + backfill |
| URL pipeline stats | 2026-02-22 | /api/stats exposes source health metrics |
| Extraction short-circuits | 2026-02-22 | Guard clauses + precompiled table regexes |
| Target health tracking | prior | Cooldown-based health for all target pools |
| MITM field in proxy list | prior | Expose mitm boolean in JSON endpoints |
| V1 worker protocol removal | prior | Cleaned up legacy --worker code path |
### Target Management
| Task | Description | File(s) |
|------|-------------|---------|
| Target health tracking | Cooldown-based health tracking for all target pools (head, SSL, IRC, judges) | stats.py, proxywatchd.py |
| MITM field in proxy list | Expose mitm boolean in JSON proxy list endpoints | httpd.py |
---
## Open Work
### Target Management
| Task | Description | File(s) |
|------|-------------|---------|
| Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py |
| Geographic target spread | Ensure targets span multiple regions | config.py |
---
@@ -102,12 +71,31 @@ Worker-driven link discovery from productive pages.
| File | Purpose |
|------|---------|
| ppf.py | URL harvester, worker main loop |
| ppf.py | Main URL harvester daemon |
| proxywatchd.py | Proxy validation daemon |
| fetch.py | HTTP fetching, proxy extraction |
| httpd.py | API server, worker coordination |
| dbs.py | Database schema, seed sources |
| config.py | Configuration management |
| rocksock.py | Socket/proxy abstraction |
| scraper.py | Searx search integration |
| fetch.py | HTTP fetching with proxy support |
| dbs.py | Database schema and inserts |
| mysqlite.py | SQLite wrapper |
| rocksock.py | Socket/proxy abstraction (3rd party) |
| http2.py | HTTP client implementation |
| tools/ppf-deploy | Deployment wrapper |
| httpd.py | Web dashboard and REST API server |
| config.py | Configuration management |
| comboparse.py | Config/arg parser framework |
| soup_parser.py | BeautifulSoup wrapper |
| misc.py | Utilities (timestamp, logging) |
| export.py | Proxy export CLI tool |
| engines.py | Search engine implementations |
| connection_pool.py | Tor connection pooling |
| network_stats.py | Network statistics tracking |
| dns.py | DNS resolution with caching |
| mitm.py | MITM certificate detection |
| job.py | Priority job queue |
| static/dashboard.js | Dashboard frontend logic |
| static/dashboard.html | Dashboard HTML template |
| tools/lib/ppf-common.sh | Shared ops library (hosts, wrappers, colors) |
| tools/ppf-deploy | Deploy wrapper (validation + playbook) |
| tools/ppf-logs | View container logs |
| tools/ppf-service | Container lifecycle management |
| tools/playbooks/deploy.yml | Ansible deploy playbook |
| tools/playbooks/inventory.ini | Host inventory (WireGuard IPs) |

View File

@@ -1,34 +0,0 @@
# PPF Tasklist
Active execution queue. Ordered by priority.
---
## In Progress
| # | Task | File(s) | Notes |
|---|------|---------|-------|
## Queued
| # | Task | File(s) | Notes |
|---|------|---------|-------|
| 12 | API filters on /proxies (proto/country/ASN) | httpd.py | Consumer query params |
| 8 | Graduated recheck intervals | proxywatchd.py | Fresh proxies checked more often |
## Done
| # | Task | Date |
|---|------|------|
| - | Sharpen URL error penalty scoring | 2026-02-22 |
| - | Add SOCKS5-specific sources (10 new) | 2026-02-22 |
| 3 | Lazy-load ASN database | 2026-02-22 |
| 2 | SQLite connection reuse on odin | 2026-02-22 |
| 1 | Skip socket.shutdown on failed connections | 2026-02-22 |
| 4 | Add more seed sources (100+) | 2026-02-22 |
| 6 | Protocol-aware source weighting | 2026-02-22 |
| - | Extraction short-circuits | 2026-02-22 |
| - | last_seen freshness fix | 2026-02-22 |
| - | Periodic re-seeding | 2026-02-22 |
| - | ASN enrichment | 2026-02-22 |
| - | URL pipeline stats | 2026-02-22 |

82
TODO.md
View File

@@ -1,35 +1,83 @@
# PPF TODO
Intake buffer. Items refined here move to TASKLIST.md.
## Optimization
### [ ] JSON Stats Response Caching
- Cache serialized JSON response with short TTL (1-2s)
- Only regenerate when underlying stats change
- Use ETag/If-None-Match for client-side caching
- Savings: ~7-9s/hour. Low priority, only matters with frequent dashboard access.
### [ ] Object Pooling for Test States
- Pool ProxyTestState and TargetTestJob, reset and reuse
- Savings: ~11-15s/hour. **Not recommended** - high effort, medium risk, modest gain.
### [ ] SQLite Connection Reuse
- Persistent connection per thread with health checks
- Savings: ~0.3s/hour. **Not recommended** - negligible benefit.
---
## Dashboard
- [ ] Cache expensive DB queries (top countries, protocol breakdown)
- [ ] Historical graphs (24h, 7d) using stats_history table
- [ ] Per-ASN performance analysis
- [ ] Alert thresholds (success rate < X%, MITM detected)
- [ ] WebSocket push updates (reduce polling overhead)
- [ ] Mobile-responsive improvements
### [ ] Performance
- Cache expensive DB queries (top countries, protocol breakdown)
- Lazy-load historical data (only when scrolled into view)
- WebSocket option for push updates (reduce polling overhead)
- Configurable refresh interval via URL param or localStorage
### [ ] Features
- Historical graphs (24h, 7d) using stats_history table
- Per-ASN performance analysis
- Alert thresholds (success rate < X%, MITM detected)
- Mobile-responsive improvements
---
## Memory
- [ ] Lock consolidation (260k LockType objects at scale)
- [ ] Leaner state objects per job
- [ ] Lock consolidation - reduce per-proxy locks (260k LockType objects)
- [ ] Leaner state objects - reduce dict/list count per job
Memory scales ~4.5 KB/job. No leaks detected. Optimize only if constrained.
Memory scales linearly with queue (~4.5 KB/job). No leaks detected.
Optimize only if memory becomes a constraint.
## Source Pipeline
---
- [ ] PasteBin/GitHub API scrapers for proxy lists
- [ ] Telegram channel scrapers (beyond t.me/s/ HTML)
- [ ] Source quality decay tracking (flag sources going stale)
- [ ] Deduplication of sources across different URL forms
## Deprecation
### [x] Remove V1 worker protocol
Completed. Removed `--worker` flag, `worker_main()`, `claim_work()`,
`submit_results()`, `/api/work`, `/api/results`, and related config
options. `--worker` now routes to the URL-driven protocol.
---
## Known Issues
### [!] Podman Container Metadata Disappears
`podman ps -a` shows empty even though process is running.
Monitor via `ss -tlnp`, `ps aux`, or `curl localhost:8081/health`.
`podman ps -a` shows empty even though process is running. Service functions
correctly despite missing metadata. Monitor via `ss -tlnp`, `ps aux`, or
`curl localhost:8081/health`. Low impact.
---
## Container Debugging Checklist
```
1. Check for orphans: ps aux | grep -E "[p]rocess_name"
2. Check port conflicts: ss -tlnp | grep PORT
3. Run foreground: podman run --rm (no -d) to see output
4. Check podman state: podman ps -a
5. Clean stale: pkill -9 -f "pattern" && podman rm -f -a
6. Verify deps: config files, data dirs, volumes exist
7. Check logs: podman logs container_name 2>&1 | tail -50
8. Health check: curl -sf http://localhost:PORT/health
```

View File

@@ -1,18 +0,0 @@
# PPF test runner (Python 2.7, production deps + pytest)
#
# Mounts source and tests as volumes so no rebuild needed between runs.
#
# Usage:
# podman-compose -f compose.test.yml run --rm test
# podman-compose -f compose.test.yml run --rm test python -m pytest tests/test_fetch.py -v
services:
test:
container_name: ppf-test
build:
context: .
dockerfile: Dockerfile.test
volumes:
- .:/app:ro,Z
working_dir: /app
command: python -m pytest tests/ -v --tb=short

176
dbs.py
View File

@@ -582,107 +582,34 @@ def insert_urls(urls, search, sqlite):
# Known proxy list sources (GitHub raw lists, APIs)
PROXY_SOURCES = [
# --- GitHub raw lists (sorted by update frequency) ---
# TheSpeedX/PROXY-List - large, hourly updates
'https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/http.txt',
'https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/socks4.txt',
'https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/socks5.txt',
# clarketm/proxy-list - curated, daily
'https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list-raw.txt',
# monosans/proxy-list - hourly updates
'https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/http.txt',
'https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/socks4.txt',
'https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/socks5.txt',
# prxchk/proxy-list - 10 min updates
'https://raw.githubusercontent.com/prxchk/proxy-list/main/http.txt',
'https://raw.githubusercontent.com/prxchk/proxy-list/main/socks4.txt',
'https://raw.githubusercontent.com/prxchk/proxy-list/main/socks5.txt',
# jetkai/proxy-list - 10 min updates
'https://raw.githubusercontent.com/jetkai/proxy-list/main/online-proxies/txt/proxies.txt',
# hookzof/socks5_list - hourly, SOCKS5 focused
'https://raw.githubusercontent.com/hookzof/socks5_list/master/proxy.txt',
# mmpx12/proxy-list
'https://raw.githubusercontent.com/mmpx12/proxy-list/master/http.txt',
'https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks4.txt',
'https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks5.txt',
# ShiftyTR/Proxy-List
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/http.txt',
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks4.txt',
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks5.txt',
# roosterkid/openproxylist
'https://raw.githubusercontent.com/roosterkid/openproxylist/main/HTTPS_RAW.txt',
'https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS4_RAW.txt',
'https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS5_RAW.txt',
# clarketm/proxy-list - curated, daily
'https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list-raw.txt',
# officialputuid/KangProxy - 4-6 hour updates
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/http/http.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/https/https.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/socks4/socks4.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/socks5/socks5.txt',
# iplocate/free-proxy-list - 30 min updates
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/http.txt',
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/socks4.txt',
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/socks5.txt',
# ErcinDedeworken/proxy-list - hourly
'https://raw.githubusercontent.com/ErcinDedeworken/proxy-list/main/proxy-list/data.txt',
# MuRongPIG/Proxy-Master - 10 min updates
'https://raw.githubusercontent.com/MuRongPIG/Proxy-Master/main/http.txt',
'https://raw.githubusercontent.com/MuRongPIG/Proxy-Master/main/socks4.txt',
'https://raw.githubusercontent.com/MuRongPIG/Proxy-Master/main/socks5.txt',
# zloi-user/hideip.me - hourly
'https://raw.githubusercontent.com/zloi-user/hideip.me/main/http.txt',
'https://raw.githubusercontent.com/zloi-user/hideip.me/main/socks4.txt',
'https://raw.githubusercontent.com/zloi-user/hideip.me/main/socks5.txt',
# FLAVIEN-music/proxy-list - 30 min updates
'https://raw.githubusercontent.com/FLAVIEN-music/proxy-list/main/proxies/http.txt',
'https://raw.githubusercontent.com/FLAVIEN-music/proxy-list/main/proxies/socks4.txt',
'https://raw.githubusercontent.com/FLAVIEN-music/proxy-list/main/proxies/socks5.txt',
# Zaeem20/FREE_PROXIES_LIST - 30 min updates
'https://raw.githubusercontent.com/Zaeem20/FREE_PROXIES_LIST/master/http.txt',
'https://raw.githubusercontent.com/Zaeem20/FREE_PROXIES_LIST/master/https.txt',
'https://raw.githubusercontent.com/Zaeem20/FREE_PROXIES_LIST/master/socks4.txt',
'https://raw.githubusercontent.com/Zaeem20/FREE_PROXIES_LIST/master/socks5.txt',
# r00tee/Proxy-List - hourly
'https://raw.githubusercontent.com/r00tee/Proxy-List/main/Https.txt',
'https://raw.githubusercontent.com/r00tee/Proxy-List/main/Socks4.txt',
'https://raw.githubusercontent.com/r00tee/Proxy-List/main/Socks5.txt',
# casals-ar/proxy-list
'https://raw.githubusercontent.com/casals-ar/proxy-list/main/http',
'https://raw.githubusercontent.com/casals-ar/proxy-list/main/socks4',
'https://raw.githubusercontent.com/casals-ar/proxy-list/main/socks5',
# yemixzy/proxy-list
'https://raw.githubusercontent.com/yemixzy/proxy-list/main/proxies/http.txt',
'https://raw.githubusercontent.com/yemixzy/proxy-list/main/proxies/socks4.txt',
'https://raw.githubusercontent.com/yemixzy/proxy-list/main/proxies/socks5.txt',
# opsxcq/proxy-list
'https://raw.githubusercontent.com/opsxcq/proxy-list/master/list.txt',
# im-razvan/proxy_list - 10 min updates
'https://raw.githubusercontent.com/im-razvan/proxy_list/main/http.txt',
'https://raw.githubusercontent.com/im-razvan/proxy_list/main/socks4.txt',
'https://raw.githubusercontent.com/im-razvan/proxy_list/main/socks5.txt',
# zevtyardt/proxy-list - daily SOCKS5
'https://raw.githubusercontent.com/zevtyardt/proxy-list/main/socks5.txt',
# UptimerBot/proxy-list - 15 min updates
'https://raw.githubusercontent.com/UptimerBot/proxy-list/main/proxies/socks5.txt',
# Anonym0usWork1221/Free-Proxies
'https://raw.githubusercontent.com/Anonym0usWork1221/Free-Proxies/main/proxy_files/https_proxies.txt',
'https://raw.githubusercontent.com/Anonym0usWork1221/Free-Proxies/main/proxy_files/socks4_proxies.txt',
'https://raw.githubusercontent.com/Anonym0usWork1221/Free-Proxies/main/proxy_files/socks5_proxies.txt',
# ErcinDedeoglu/proxies - hourly
'https://raw.githubusercontent.com/ErcinDedeoglu/proxies/main/proxies/http.txt',
'https://raw.githubusercontent.com/ErcinDedeoglu/proxies/main/proxies/socks4.txt',
'https://raw.githubusercontent.com/ErcinDedeoglu/proxies/main/proxies/socks5.txt',
# dinoz0rg/proxy-list - daily, all protocols
'https://raw.githubusercontent.com/dinoz0rg/proxy-list/main/all.txt',
# elliottophellia/proxylist - SOCKS5
'https://raw.githubusercontent.com/elliottophellia/proxylist/master/results/socks5/global/socks5_len.txt',
# gfpcom/free-proxy-list - SOCKS5
'https://raw.githubusercontent.com/gfpcom/free-proxy-list/main/socks5.txt',
# databay-labs/free-proxy-list - SOCKS5
'https://raw.githubusercontent.com/databay-labs/free-proxy-list/master/socks5.txt',
# --- GitHub Pages / CDN hosted ---
# ShiftyTR/Proxy-List
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/http.txt',
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks4.txt',
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks5.txt',
# mmpx12/proxy-list
'https://raw.githubusercontent.com/mmpx12/proxy-list/master/http.txt',
'https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks4.txt',
'https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks5.txt',
# proxyscrape API
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all',
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=socks4&timeout=10000&country=all',
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=socks5&timeout=10000&country=all',
# proxifly/free-proxy-list - 5 min updates (jsDelivr CDN)
'https://cdn.jsdelivr.net/gh/proxifly/free-proxy-list@main/proxies/protocols/http/data.txt',
'https://cdn.jsdelivr.net/gh/proxifly/free-proxy-list@main/proxies/protocols/socks4/data.txt',
@@ -691,71 +618,24 @@ PROXY_SOURCES = [
'https://vakhov.github.io/fresh-proxy-list/http.txt',
'https://vakhov.github.io/fresh-proxy-list/socks4.txt',
'https://vakhov.github.io/fresh-proxy-list/socks5.txt',
# prxchk/proxy-list - 10 min updates
'https://raw.githubusercontent.com/prxchk/proxy-list/main/http.txt',
'https://raw.githubusercontent.com/prxchk/proxy-list/main/socks4.txt',
'https://raw.githubusercontent.com/prxchk/proxy-list/main/socks5.txt',
# sunny9577/proxy-scraper - 3 hour updates (GitHub Pages)
'https://sunny9577.github.io/proxy-scraper/generated/http_proxies.txt',
'https://sunny9577.github.io/proxy-scraper/generated/socks4_proxies.txt',
'https://sunny9577.github.io/proxy-scraper/generated/socks5_proxies.txt',
# --- API endpoints ---
# proxyscrape
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all',
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=socks4&timeout=10000&country=all',
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=socks5&timeout=10000&country=all',
# proxy-list.download - SOCKS5 API
'https://www.proxy-list.download/api/v1/get?type=socks5',
'https://www.proxy-list.download/api/v1/get?type=socks4',
# openproxylist.xyz - plain text
'https://api.openproxylist.xyz/http.txt',
'https://api.openproxylist.xyz/socks4.txt',
'https://api.openproxylist.xyz/socks5.txt',
# spys.me - plain text, 30 min updates
'http://spys.me/proxy.txt',
'http://spys.me/socks.txt',
# --- Web scrapers (HTML pages) ---
# spys.one - mixed protocols, requires parsing
'https://spys.one/en/free-proxy-list/',
'https://spys.one/en/socks-proxy-list/',
'https://spys.one/en/https-ssl-proxy/',
# free-proxy-list.net
'https://free-proxy-list.net/',
'https://www.sslproxies.org/',
'https://www.socks-proxy.net/',
# sockslist.us - SOCKS5 focused
'https://sockslist.us/',
# mtpro.xyz - SOCKS5, updated every 5 min
'https://mtpro.xyz/socks5',
# proxy-tools.com - SOCKS5 filtered
'https://proxy-tools.com/proxy/socks5',
# hidemy.name - all protocols, paginated
'https://hide.mn/en/proxy-list/',
# advanced.name - SOCKS5 filtered
'https://advanced.name/freeproxy?type=socks5',
# proxynova.com - by country
'https://www.proxynova.com/proxy-server-list/',
# freeproxy.world - SOCKS5 filtered
'https://www.freeproxy.world/?type=socks5',
# proxydb.net - all protocols
'http://proxydb.net/',
# geonode
'https://proxylist.geonode.com/api/proxy-list?limit=500&page=1&sort_by=lastChecked&sort_type=desc&protocols=http',
'https://proxylist.geonode.com/api/proxy-list?limit=500&page=1&sort_by=lastChecked&sort_type=desc&protocols=socks4',
'https://proxylist.geonode.com/api/proxy-list?limit=500&page=1&sort_by=lastChecked&sort_type=desc&protocols=socks5',
# openproxy.space
'https://openproxy.space/list/http',
'https://openproxy.space/list/socks4',
'https://openproxy.space/list/socks5',
# --- Telegram channels (public HTML view) ---
'https://t.me/s/spys_one',
'https://t.me/s/proxyfree1',
'https://t.me/s/proxylist4free',
'https://t.me/s/proxy_lists',
'https://t.me/s/Proxies4ForYou',
# officialputuid/KangProxy - 4-6 hour updates
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/http/http.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/socks4/socks4.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/socks5/socks5.txt',
# hookzof/socks5_list - hourly updates
'https://raw.githubusercontent.com/hookzof/socks5_list/master/proxy.txt',
# iplocate/free-proxy-list - 30 min updates
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/http.txt',
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/socks4.txt',
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/socks5.txt',
]

View File

@@ -221,10 +221,6 @@ def extract_auth_proxies(content):
"""
proxies = []
# Short-circuit: auth proxies always contain @
if '@' not in content:
return proxies
# IPv4 auth proxies
for match in AUTH_PROXY_PATTERN.finditer(content):
proto_str, user, passwd, ip, port = match.groups()
@@ -260,12 +256,6 @@ TABLE_PORT_HEADERS = ('port',)
TABLE_PROTO_HEADERS = ('type', 'protocol', 'proto', 'scheme')
_TABLE_PATTERN = re.compile(r'<table[^>]*>(.*?)</table>', re.IGNORECASE | re.DOTALL)
_ROW_PATTERN = re.compile(r'<tr[^>]*>(.*?)</tr>', re.IGNORECASE | re.DOTALL)
_CELL_PATTERN = re.compile(r'<t[hd][^>]*>(.*?)</t[hd]>', re.IGNORECASE | re.DOTALL)
_TAG_STRIP = re.compile(r'<[^>]+>')
def extract_proxies_from_table(content):
"""Extract proxies from HTML tables with IP/Port/Protocol columns.
@@ -279,23 +269,26 @@ def extract_proxies_from_table(content):
"""
proxies = []
# Short-circuit: no HTML tables in plain text content
if '<table' not in content and '<TABLE' not in content:
return proxies
# Simple regex-based table parsing (works without BeautifulSoup)
# Find all tables
table_pattern = re.compile(r'<table[^>]*>(.*?)</table>', re.IGNORECASE | re.DOTALL)
row_pattern = re.compile(r'<tr[^>]*>(.*?)</tr>', re.IGNORECASE | re.DOTALL)
cell_pattern = re.compile(r'<t[hd][^>]*>(.*?)</t[hd]>', re.IGNORECASE | re.DOTALL)
tag_strip = re.compile(r'<[^>]+>')
for table_match in _TABLE_PATTERN.finditer(content):
for table_match in table_pattern.finditer(content):
table_html = table_match.group(1)
rows = _ROW_PATTERN.findall(table_html)
rows = row_pattern.findall(table_html)
if not rows:
continue
# Parse header row to find column indices
ip_col = port_col = proto_col = -1
header_row = rows[0]
headers = _CELL_PATTERN.findall(header_row)
headers = cell_pattern.findall(header_row)
for i, cell in enumerate(headers):
cell_text = _TAG_STRIP.sub('', cell).strip().lower()
cell_text = tag_strip.sub('', cell).strip().lower()
if ip_col < 0 and any(h in cell_text for h in TABLE_IP_HEADERS):
ip_col = i
elif port_col < 0 and any(h in cell_text for h in TABLE_PORT_HEADERS):
@@ -309,11 +302,11 @@ def extract_proxies_from_table(content):
# Parse data rows
for row in rows[1:]:
cells = _CELL_PATTERN.findall(row)
cells = cell_pattern.findall(row)
if len(cells) <= ip_col:
continue
ip_cell = _TAG_STRIP.sub('', cells[ip_col]).strip()
ip_cell = tag_strip.sub('', cells[ip_col]).strip()
# Check if IP cell contains port (ip:port format)
if ':' in ip_cell and port_col < 0:
@@ -322,7 +315,7 @@ def extract_proxies_from_table(content):
ip, port = match.groups()
proto = None
if proto_col >= 0 and len(cells) > proto_col:
proto = _normalize_proto(_TAG_STRIP.sub('', cells[proto_col]).strip())
proto = _normalize_proto(tag_strip.sub('', cells[proto_col]).strip())
addr = '%s:%s' % (ip, port)
if is_usable_proxy(addr):
proxies.append((addr, proto))
@@ -330,7 +323,7 @@ def extract_proxies_from_table(content):
# Separate IP and Port columns
if port_col >= 0 and len(cells) > port_col:
port_cell = _TAG_STRIP.sub('', cells[port_col]).strip()
port_cell = tag_strip.sub('', cells[port_col]).strip()
try:
port = int(port_cell)
except ValueError:
@@ -342,7 +335,7 @@ def extract_proxies_from_table(content):
proto = None
if proto_col >= 0 and len(cells) > proto_col:
proto = _normalize_proto(_TAG_STRIP.sub('', cells[proto_col]).strip())
proto = _normalize_proto(tag_strip.sub('', cells[proto_col]).strip())
addr = '%s:%d' % (ip_cell, port)
if is_usable_proxy(addr):
@@ -365,10 +358,6 @@ def extract_proxies_from_json(content):
"""
proxies = []
# Short-circuit: content must contain JSON delimiters
if '{' not in content and '[' not in content:
return proxies
# Try to find JSON in content (may be embedded in HTML)
json_matches = []

280
httpd.py
View File

@@ -31,67 +31,12 @@ except (ImportError, IOError, ValueError):
_geodb = None
_geolite = False
# ASN lookup (optional, lazy-loaded on first use)
# Defers ~3.6s startup cost of parsing ipasn.dat until first ASN lookup.
_asndb = None
_asndb_loaded = False
_asn_dat_path = os.path.join("data", "ipasn.dat")
import socket
import struct
import bisect
class _AsnLookup(object):
"""Pure-Python ASN lookup using ipasn.dat (CIDR/ASN text format)."""
def __init__(self, path):
self._entries = []
with open(path) as f:
for line in f:
line = line.strip()
if not line or line.startswith(';'):
continue
parts = line.split('\t')
if len(parts) != 2:
continue
cidr, asn = parts
ip, prefix = cidr.split('/')
start = struct.unpack('!I', socket.inet_aton(ip))[0]
self._entries.append((start, int(prefix), int(asn)))
self._entries.sort()
_log('asn: loaded %d prefixes (pure-python)' % len(self._entries), 'info')
def lookup(self, ip):
ip_int = struct.unpack('!I', socket.inet_aton(ip))[0]
idx = bisect.bisect_right(self._entries, (ip_int, 33, 0)) - 1
if idx < 0:
return (None, None)
start, prefix_len, asn = self._entries[idx]
mask = (0xFFFFFFFF << (32 - prefix_len)) & 0xFFFFFFFF
if (ip_int & mask) == (start & mask):
return (asn, None)
return (None, None)
def _get_asndb():
"""Lazy-load ASN database on first call. Returns db instance or None."""
global _asndb, _asndb_loaded
if _asndb_loaded:
return _asndb
_asndb_loaded = True
try:
import pyasn
_asndb = pyasn.pyasn(_asn_dat_path)
return _asndb
except (ImportError, IOError):
pass
if os.path.exists(_asn_dat_path):
try:
_asndb = _AsnLookup(_asn_dat_path)
except Exception as e:
_log('asn: failed to load %s: %s' % (_asn_dat_path, e), 'warn')
return _asndb
# ASN lookup (optional)
try:
import pyasn
_asndb = pyasn.pyasn(os.path.join("data", "ipasn.dat"))
except (ImportError, IOError):
_asndb = None
# Rate limiting configuration
_rate_limits = defaultdict(list)
@@ -169,30 +114,6 @@ _fail_retry_interval = 60 # retry interval for failing proxies
_fail_retry_backoff = True # True=linear backoff (60,120,180...), False=fixed (60,60,60...)
_max_fail = 5 # failures before proxy considered dead
# Per-greenlet (or per-thread) SQLite connection cache
# Under gevent, threading.local() is monkey-patched to greenlet-local storage.
# Connections are reused across requests handled by the same greenlet, eliminating
# redundant sqlite3.connect() + PRAGMA calls (~0.5ms each, ~2.7k/session on odin).
_local = threading.local()
def _get_db(path):
"""Get a cached SQLite connection for the proxy database."""
db = getattr(_local, 'proxy_db', None)
if db is None:
db = mysqlite.mysqlite(path, str)
_local.proxy_db = db
return db
def _get_url_db(path):
"""Get a cached SQLite connection for the URL database."""
db = getattr(_local, 'url_db', None)
if db is None:
db = mysqlite.mysqlite(path, str)
_local.url_db = db
return db
def configure_schedule(working_checktime, fail_retry_interval, fail_retry_backoff, max_fail):
"""Set testing schedule parameters from config."""
@@ -386,42 +307,6 @@ def get_worker_test_rate(worker_id):
return 0.0
return total_tests / elapsed
def _get_proto_boost():
"""Calculate protocol scarcity boost for URL scoring.
Returns a value 0.0-1.0 to boost SOCKS sources when SOCKS proxies
are underrepresented relative to HTTP. Returns 0.0 when balanced.
"""
try:
if not _proxy_database:
return 0.0
db = _get_db(_proxy_database)
if not db:
return 0.0
row = db.execute(
"SELECT "
" SUM(CASE WHEN proto='http' THEN 1 ELSE 0 END),"
" SUM(CASE WHEN proto IN ('socks4','socks5') THEN 1 ELSE 0 END)"
" FROM proxylist WHERE failed=0"
).fetchone()
if not row or not row[0]:
return 0.5 # no data, default mild boost
http_count, socks_count = row[0] or 0, row[1] or 0
total = http_count + socks_count
if total == 0:
return 0.5
socks_ratio = float(socks_count) / total
# Boost SOCKS sources when socks_ratio < 40%
if socks_ratio >= 0.4:
return 0.0
return min((0.4 - socks_ratio) * 2.5, 1.0) # 0.0-1.0 scale
except Exception:
return 0.0
# Global reference to proxy database path (set by ProxyAPIServer.__init__)
_proxy_database = None
def claim_urls(url_db, worker_id, count=5):
"""Claim a batch of URLs for worker-driven fetching. Returns list of URL dicts.
@@ -432,7 +317,6 @@ def claim_urls(url_db, worker_id, count=5):
- quality_bonus: 0-0.5 based on working_ratio
- error_penalty: 0-2.0 based on consecutive errors
- stale_penalty: 0-1.0 based on unchanged fetches
- proto_boost: 0-1.0 for SOCKS sources when SOCKS underrepresented
"""
now = time.time()
now_int = int(now)
@@ -458,19 +342,14 @@ def claim_urls(url_db, worker_id, count=5):
list_max_age_seconds = _url_list_max_age_days * 86400
min_added = now_int - list_max_age_seconds
# Boost SOCKS sources when protocol pool is imbalanced
proto_boost = _get_proto_boost()
try:
rows = url_db.execute(
'''SELECT url, content_hash,
(? - check_time) * 1.0 / MAX(COALESCE(check_interval, 3600), 1)
+ MIN(COALESCE(yield_rate, 0) / 100.0, 1.0)
+ COALESCE(working_ratio, 0) * 0.5
- MIN(error * 0.5, 4.0)
- MIN(stale_count * 0.2, 1.5)
+ CASE WHEN LOWER(url) LIKE '%socks5%' OR LOWER(url) LIKE '%socks4%'
THEN ? ELSE 0 END
- MIN(error * 0.3, 2.0)
- MIN(stale_count * 0.1, 1.0)
AS score
FROM uris
WHERE error < ?
@@ -478,7 +357,7 @@ def claim_urls(url_db, worker_id, count=5):
AND (added > ? OR proxies_added > 0)
ORDER BY score DESC
LIMIT ?''',
(now_int, proto_boost, _url_max_fail, now_int, min_added, count * 3)
(now_int, _url_max_fail, now_int, min_added, count * 3)
).fetchall()
except Exception as e:
_log('claim_urls query error: %s' % e, 'error')
@@ -654,7 +533,7 @@ def _update_url_working_ratios(url_working_counts):
pending_snapshot = dict(_url_pending_counts)
try:
url_db = _get_url_db(_url_database_path)
url_db = mysqlite.mysqlite(_url_database_path, str)
for url, working_count in url_working_counts.items():
pending = pending_snapshot.get(url)
if not pending or pending['total'] <= 0:
@@ -675,6 +554,7 @@ def _update_url_working_ratios(url_working_counts):
settled.append(url)
url_db.commit()
url_db.close()
except Exception as e:
_log('_update_url_working_ratios error: %s' % e, 'error')
@@ -741,10 +621,9 @@ def submit_proxy_reports(db, worker_id, proxies):
(rec.country_short, proxy_key))
except Exception:
pass
asndb = _get_asndb()
if asndb:
if _asndb:
try:
asn_result = asndb.lookup(ip)
asn_result = _asndb.lookup(ip)
if asn_result and asn_result[0]:
db.execute(
'UPDATE proxylist SET asn=? WHERE proxy=?',
@@ -1187,7 +1066,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def handle_countries(self):
"""Return all countries with proxy counts."""
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
rows = db.execute(
'SELECT country, COUNT(*) as c FROM proxylist WHERE failed=0 AND country IS NOT NULL '
'GROUP BY country ORDER BY c DESC'
@@ -1206,7 +1085,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def get_db_stats(self):
"""Get statistics from database."""
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
stats = {}
# Total counts
@@ -1254,7 +1133,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
# Add database stats
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
stats['db'] = self.get_db_stats()
stats['db_health'] = get_db_health(db)
except Exception as e:
@@ -1341,7 +1220,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
args.append(limit)
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
rows = db.execute(sql, args).fetchall()
if fmt == 'plain':
@@ -1390,7 +1269,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
sql += ' ORDER BY avg_latency ASC, tested DESC'
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
rows = db.execute(sql, args).fetchall()
if fmt == 'plain':
@@ -1407,7 +1286,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def handle_count(self):
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
row = db.execute('SELECT COUNT(*) FROM proxylist WHERE failed=0 AND proto IS NOT NULL').fetchone()
self.send_json({'count': row[0] if row else 0})
except Exception as e:
@@ -1433,9 +1312,8 @@ class ProxyAPIServer(threading.Thread):
self.stats_provider = stats_provider
self.profiling = profiling
self.daemon = True
global _url_database_path, _proxy_database
global _url_database_path
_url_database_path = url_database
_proxy_database = database
self.server = None
self._stop_event = threading.Event() if not GEVENT_PATCHED else None
# Load static library files into cache
@@ -1444,42 +1322,14 @@ class ProxyAPIServer(threading.Thread):
load_static_files(THEME)
# Load worker registry from disk
load_workers()
# Backfill ASN for existing proxies missing it (triggers lazy-load)
if _get_asndb():
self._backfill_asn()
# Create verification tables if they don't exist
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
create_verification_tables(db)
_log('verification tables initialized', 'debug')
except Exception as e:
_log('failed to create verification tables: %s' % e, 'warn')
def _backfill_asn(self):
"""One-time backfill of ASN for proxies that have ip but no ASN."""
try:
db = _get_db(self.database)
rows = db.execute(
'SELECT proxy, ip FROM proxylist WHERE asn IS NULL AND ip IS NOT NULL'
).fetchall()
if not rows:
return
updated = 0
for proxy_key, ip in rows:
try:
result = _get_asndb().lookup(ip)
if result and result[0]:
db.execute('UPDATE proxylist SET asn=? WHERE proxy=?',
(result[0], proxy_key))
updated += 1
except Exception:
pass
db.commit()
if updated:
_log('asn: backfilled %d/%d proxies' % (updated, len(rows)), 'info')
except Exception as e:
_log('asn backfill error: %s' % e, 'warn')
def _wsgi_app(self, environ, start_response):
"""WSGI application wrapper for gevent."""
path = environ.get('PATH_INFO', '/').split('?')[0]
@@ -1638,15 +1488,11 @@ class ProxyAPIServer(threading.Thread):
stats['system'] = get_system_stats()
# Add database stats
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
stats['db'] = self._get_db_stats(db)
stats['db_health'] = get_db_health(db)
except Exception as e:
_log('api/stats db error: %s' % e, 'warn')
# Add URL pipeline stats
url_stats = self._get_url_stats()
if url_stats is not None:
stats['urls'] = url_stats
# Add profiling flag (from constructor or stats_provider)
if 'profiling' not in stats:
stats['profiling'] = self.profiling
@@ -1671,7 +1517,7 @@ class ProxyAPIServer(threading.Thread):
# 2. Database stats and health
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
result['stats']['db'] = self._get_db_stats(db)
result['stats']['db_health'] = get_db_health(db)
@@ -1684,6 +1530,8 @@ class ProxyAPIServer(threading.Thread):
# 4. Workers (same as /api/workers)
result['workers'] = self._get_workers_data(db)
db.close()
except Exception as e:
_log('api/dashboard db error: %s' % e, 'warn')
result['countries'] = {}
@@ -1702,7 +1550,7 @@ class ProxyAPIServer(threading.Thread):
return json.dumps({'error': 'stats not available'}), 'application/json', 500
elif path == '/api/countries':
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
rows = db.execute(
'SELECT country, COUNT(*) as c FROM proxylist WHERE failed=0 AND country IS NOT NULL '
'GROUP BY country ORDER BY c DESC'
@@ -1714,7 +1562,7 @@ class ProxyAPIServer(threading.Thread):
elif path == '/api/locations':
# Return proxy locations aggregated by lat/lon grid (0.5 degree cells)
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
rows = db.execute(
'SELECT ROUND(latitude, 1) as lat, ROUND(longitude, 1) as lon, '
'country, anonymity, COUNT(*) as c FROM proxylist '
@@ -1752,7 +1600,7 @@ class ProxyAPIServer(threading.Thread):
sql += ' ORDER BY avg_latency ASC, tested DESC LIMIT ?'
args.append(limit)
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
rows = db.execute(sql, args).fetchall()
if fmt == 'plain':
@@ -1791,7 +1639,7 @@ class ProxyAPIServer(threading.Thread):
sql += ' AND mitm=1'
sql += ' ORDER BY avg_latency ASC, tested DESC'
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
rows = db.execute(sql, args).fetchall()
if fmt == 'plain':
@@ -1813,7 +1661,7 @@ class ProxyAPIServer(threading.Thread):
sql += ' AND mitm=0'
elif mitm_filter == '1':
sql += ' AND mitm=1'
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
row = db.execute(sql).fetchone()
return json.dumps({'count': row[0] if row else 0}), 'application/json', 200
except Exception as e:
@@ -1869,8 +1717,9 @@ class ProxyAPIServer(threading.Thread):
elif path == '/api/workers':
# List connected workers
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
workers_data = self._get_workers_data(db)
db.close()
return json.dumps(workers_data, indent=2), 'application/json', 200
except Exception as e:
_log('api/workers error: %s' % e, 'warn')
@@ -1888,7 +1737,7 @@ class ProxyAPIServer(threading.Thread):
return json.dumps({'error': 'url database not configured'}), 'application/json', 500
count = min(int(query_params.get('count', 5)), 20)
try:
url_db = _get_url_db(self.url_database)
url_db = mysqlite.mysqlite(self.url_database, str)
urls = claim_urls(url_db, worker_id, count)
update_worker_heartbeat(worker_id)
return json.dumps({
@@ -1915,7 +1764,7 @@ class ProxyAPIServer(threading.Thread):
if not reports:
return json.dumps({'error': 'no reports provided'}), 'application/json', 400
try:
url_db = _get_url_db(self.url_database)
url_db = mysqlite.mysqlite(self.url_database, str)
processed = submit_url_reports(url_db, worker_id, reports)
update_worker_heartbeat(worker_id)
return json.dumps({
@@ -1939,7 +1788,7 @@ class ProxyAPIServer(threading.Thread):
if not proxies:
return json.dumps({'error': 'no proxies provided'}), 'application/json', 400
try:
db = _get_db(self.database)
db = mysqlite.mysqlite(self.database, str)
processed = submit_proxy_reports(db, worker_id, proxies)
update_worker_heartbeat(worker_id)
return json.dumps({
@@ -1988,65 +1837,6 @@ class ProxyAPIServer(threading.Thread):
_log('_get_db_stats error: %s' % e, 'warn')
return stats
def _get_url_stats(self):
"""Get URL pipeline statistics from the websites database."""
if not self.url_database:
return None
try:
db = _get_url_db(self.url_database)
stats = {}
now = int(time.time())
# Total URLs and health breakdown
row = db.execute('SELECT COUNT(*) FROM uris').fetchone()
stats['total'] = row[0] if row else 0
row = db.execute('SELECT COUNT(*) FROM uris WHERE error >= 10').fetchone()
stats['dead'] = row[0] if row else 0
row = db.execute('SELECT COUNT(*) FROM uris WHERE error > 0 AND error < 10').fetchone()
stats['erroring'] = row[0] if row else 0
row = db.execute('SELECT COUNT(*) FROM uris WHERE error = 0').fetchone()
stats['healthy'] = row[0] if row else 0
# Recently active (fetched in last hour)
row = db.execute(
'SELECT COUNT(*) FROM uris WHERE check_time >= ?',
(now - 3600,)).fetchone()
stats['fetched_last_hour'] = row[0] if row else 0
# Productive sources (have produced working proxies)
row = db.execute(
'SELECT COUNT(*) FROM uris WHERE working_ratio > 0'
).fetchone()
stats['productive'] = row[0] if row else 0
# Aggregate yield
row = db.execute(
'SELECT SUM(proxies_added), SUM(retrievals) FROM uris'
).fetchone()
stats['total_proxies_extracted'] = row[0] or 0 if row else 0
stats['total_fetches'] = row[1] or 0 if row else 0
# Currently claimed
with _url_claims_lock:
stats['claimed'] = len(_url_claims)
# Top sources by working_ratio (productive URLs only)
rows = db.execute(
'SELECT url, working_ratio, yield_rate, proxies_added, retrievals '
'FROM uris WHERE working_ratio > 0 AND retrievals > 0 '
'ORDER BY working_ratio DESC LIMIT 10'
).fetchall()
stats['top_sources'] = [{
'url': r[0], 'working_ratio': round(r[1], 3),
'yield_rate': round(r[2], 1), 'proxies_added': r[3],
'fetches': r[4],
} for r in rows]
return stats
except Exception as e:
_log('_get_url_stats error: %s' % e, 'warn')
return None
def _get_workers_data(self, db):
"""Get worker status data. Used by /api/workers and /api/dashboard."""
now = time.time()

View File

@@ -242,7 +242,6 @@ class Rocksock():
target = RocksockProxy(host, port, RS_PT_NONE)
self.proxychain.append(target)
self.sock = None
self._connected = False
self.timeout = timeout
def _translate_socket_error(self, e, pnum):
@@ -303,18 +302,15 @@ class Rocksock():
select.select([], [self.sock], [])
"""
self._connected = True
def disconnect(self):
if self.sock is None: return
if self._connected:
try:
self.sock.shutdown(socket.SHUT_RDWR)
except socket.error:
pass
try:
self.sock.shutdown(socket.SHUT_RDWR)
except socket.error:
pass
self.sock.close()
self.sock = None
self._connected = False
def canread(self):
return select.select([self.sock], [], [], 0)[0]

View File

@@ -359,198 +359,6 @@ class TestExtractAuthProxies:
assert fetch.extract_auth_proxies('just some text') == []
class TestExtractAuthProxiesShortCircuit:
"""Tests for extract_auth_proxies() short-circuit on missing @."""
def test_no_at_sign_returns_empty(self):
"""Content without @ skips regex entirely."""
content = '1.2.3.4:8080 socks5://5.6.7.8:1080 plain text'
assert fetch.extract_auth_proxies(content) == []
def test_at_sign_still_extracts(self):
"""Content with @ still finds auth proxies."""
content = 'user:pass@1.2.3.4:8080'
result = fetch.extract_auth_proxies(content)
assert len(result) == 1
assert result[0][0] == 'user:pass@1.2.3.4:8080'
def test_at_sign_no_match_returns_empty(self):
"""Content with @ but no auth proxy pattern returns empty."""
content = 'email@example.com has no proxy'
assert fetch.extract_auth_proxies(content) == []
class TestExtractProxiesFromTable:
"""Tests for extract_proxies_from_table() with precompiled regexes."""
def test_no_table_returns_empty(self):
"""Plain text without <table> returns empty."""
content = '1.2.3.4:8080\n5.6.7.8:3128\n'
assert fetch.extract_proxies_from_table(content) == []
def test_simple_table(self):
"""Basic HTML table with IP/Port columns is parsed."""
content = '''
<table>
<tr><th>IP</th><th>Port</th><th>Type</th></tr>
<tr><td>1.2.3.4</td><td>8080</td><td>HTTP</td></tr>
<tr><td>5.6.7.8</td><td>1080</td><td>SOCKS5</td></tr>
</table>
'''
result = fetch.extract_proxies_from_table(content)
assert len(result) == 2
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
assert '5.6.7.8:1080' in addrs
def test_uppercase_table_tag(self):
"""<TABLE> (uppercase) is also detected."""
content = '''
<TABLE>
<TR><TH>IP</TH><TH>Port</TH></TR>
<TR><TD>1.2.3.4</TD><TD>8080</TD></TR>
</TABLE>
'''
result = fetch.extract_proxies_from_table(content)
assert len(result) == 1
def test_empty_table(self):
"""Table with headers but no data rows returns empty."""
content = '''
<table>
<tr><th>IP</th><th>Port</th></tr>
</table>
'''
result = fetch.extract_proxies_from_table(content)
assert result == []
class TestExtractProxiesFromJson:
"""Tests for extract_proxies_from_json() short-circuit."""
def test_no_braces_returns_empty(self):
"""Content without { or [ skips JSON parsing."""
content = '1.2.3.4:8080\n5.6.7.8:3128\n'
assert fetch.extract_proxies_from_json(content) == []
def test_json_array_of_objects(self):
"""JSON array with ip/port objects is parsed."""
content = '[{"ip": "1.2.3.4", "port": 8080}]'
result = fetch.extract_proxies_from_json(content)
assert len(result) >= 1
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
def test_json_array_of_strings(self):
"""JSON array of ip:port strings is parsed."""
content = '["1.2.3.4:8080", "5.6.7.8:3128"]'
result = fetch.extract_proxies_from_json(content)
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
assert '5.6.7.8:3128' in addrs
def test_plain_html_skips_json(self):
"""HTML without JSON delimiters returns empty."""
content = '<html><body>1.2.3.4:8080</body></html>'
# HTML has < and > but this function checks for { and [
# The < > chars won't trigger JSON parsing
result = fetch.extract_proxies_from_json(content)
# May or may not find anything depending on HTML structure
# but should not crash
assert isinstance(result, list)
class TestExtractProxiesWithHints:
"""Tests for extract_proxies_with_hints()."""
def test_proto_before_ip(self):
"""Protocol keyword before IP:PORT is detected."""
content = 'socks5 1.2.3.4:8080'
result = fetch.extract_proxies_with_hints(content)
assert '1.2.3.4:8080' in result
assert result['1.2.3.4:8080'] == 'socks5'
def test_proto_after_ip(self):
"""Protocol keyword after IP:PORT is detected."""
content = '1.2.3.4:8080 socks5'
result = fetch.extract_proxies_with_hints(content)
assert '1.2.3.4:8080' in result
def test_no_hints_returns_empty(self):
"""Plain IP:PORT without protocol hints returns empty."""
content = '1.2.3.4:8080'
result = fetch.extract_proxies_with_hints(content)
assert result == {}
class TestExtractProxiesIntegration:
"""Integration tests for extract_proxies() combining all extractors."""
def test_plain_text_proxy_list(self):
"""Plain text IP:PORT list extracts correctly."""
content = '1.2.3.4:8080\n5.6.7.8:3128\n9.10.11.12:1080\n'
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
assert '5.6.7.8:3128' in addrs
assert '9.10.11.12:1080' in addrs
def test_auth_proxies_extracted(self):
"""Auth proxies found in mixed content."""
content = 'user:pass@1.2.3.4:8080\n5.6.7.8:3128\n'
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert 'user:pass@1.2.3.4:8080' in addrs
assert '5.6.7.8:3128' in addrs
def test_html_table_extraction(self):
"""Proxies extracted from HTML table."""
content = '''
<table>
<tr><th>IP</th><th>Port</th></tr>
<tr><td>1.2.3.4</td><td>8080</td></tr>
</table>
'''
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
def test_json_extraction(self):
"""Proxies extracted from JSON content."""
content = '[{"ip": "1.2.3.4", "port": 8080}]'
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
def test_empty_content(self):
"""Empty content returns no proxies."""
result = fetch.extract_proxies('', filter_known=False)
assert result == []
def test_private_ips_filtered(self):
"""Private IPs are not returned."""
content = '10.0.0.1:8080\n192.168.1.1:3128\n1.2.3.4:8080\n'
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert '10.0.0.1:8080' not in addrs
assert '192.168.1.1:3128' not in addrs
assert '1.2.3.4:8080' in addrs
def test_proto_from_hints(self):
"""Protocol hints are picked up."""
content = 'socks5 1.2.3.4:8080\n'
result = fetch.extract_proxies(content, filter_known=False)
protos = {r[0]: r[1] for r in result}
assert protos.get('1.2.3.4:8080') == 'socks5'
def test_proto_from_arg(self):
"""Fallback proto from argument is used."""
content = '1.2.3.4:8080\n'
result = fetch.extract_proxies(content, filter_known=False, proto='socks4')
protos = {r[0]: r[1] for r in result}
assert protos.get('1.2.3.4:8080') == 'socks4'
class TestConfidenceScoring:
"""Tests for confidence score constants."""