Compare commits

...

13 Commits

Author SHA1 Message Date
Username
361b70ace9 dbs: expand seed sources to 111 URLs
All checks were successful
CI / validate (push) Successful in 20s
Add 21 new proxy source URLs: missing protocol variants from
existing repos, 4 new GitHub repos, openproxylist.xyz and
spys.me APIs, 5 web scraper targets, 2 Telegram channels.
2026-02-22 17:14:47 +01:00
Username
9c7b7ba070 add compose-based test runner for Python 2.7
All checks were successful
CI / validate (push) Successful in 20s
Dockerfile.test builds production image with pytest baked in.
compose.test.yml mounts source as volume for fast iteration.
Usage: podman-compose -f compose.test.yml run --rm test
2026-02-22 15:38:00 +01:00
Username
0669b38782 docs: update roadmap and tasklist with completed items 2026-02-22 15:37:54 +01:00
Username
6130b196b1 dbs: add SOCKS5-specific proxy sources
New sources: zevtyardt, UptimerBot, Anonym0usWork1221, ErcinDedeoglu,
proxy-list.download API, sockslist.us, mtpro.xyz, proxy-tools.com.
Addresses structural SOCKS5 coverage gap (78% HTTP in pool).
2026-02-22 15:37:50 +01:00
Username
ce2d28ab07 httpd: cache sqlite connections per-greenlet, lazy-load ASN, sharpen URL scoring
- threading.local() caches proxy_db and url_db per greenlet (eliminates
  ~2.7k redundant sqlite3.connect + PRAGMA calls per session on odin)
- ASN database now lazy-loaded on first lookup (defers ~3.6s startup cost)
- URL claim error penalty increased from 0.3*error(cap 2) to 0.5*error(cap 4)
  and stale penalty from 0.1*stale(cap 1) to 0.2*stale(cap 1.5) to reduce
  worker cycles wasted on erroring URLs (71% of 7,158 URLs erroring)
2026-02-22 15:37:43 +01:00
Username
93eb395727 docs: update roadmap, todo, and add tasklist
All checks were successful
CI / validate (push) Successful in 21s
Restructure roadmap into phases. Clean up todo as intake buffer.
Add execution tasklist with prioritized items.
2026-02-22 13:58:37 +01:00
Username
f9d237fe0d httpd: add protocol-aware source weighting
Boost SOCKS sources in claim_urls scoring when SOCKS proxies
are underrepresented (<40% of pool). Dynamic 0-1.0 boost based
on current protocol distribution.
2026-02-22 13:58:32 +01:00
Username
0f1fe981ef dbs: expand seed sources from 37 to 100+
Add GitHub raw lists, API endpoints, web scrapers, and Telegram
channels. Extra SOCKS5 sources to address protocol imbalance.
2026-02-22 13:58:26 +01:00
Username
0a53e4457f rocksock: skip shutdown on never-connected sockets
Track connection state with _connected flag. Only call
socket.shutdown() on successfully connected sockets.
Saves ~39s/session on workers (974k disconnect calls).
2026-02-22 13:58:20 +01:00
Username
2ea7eb41b7 tests: add extraction short-circuit and integration tests
All checks were successful
CI / validate (push) Successful in 19s
Cover short-circuit guards, table/JSON/hint extraction,
and full extract_proxies() integration (82 tests, all passing).
2026-02-22 13:50:34 +01:00
Username
98b232f3d3 fetch: add short-circuit guards to extraction functions
Skip expensive regex scans when content lacks required markers:
- extract_auth_proxies: skip if no '@' in content
- extract_proxies_from_table: skip if no '<table' tag
- extract_proxies_from_json: skip if no '{' or '['
- Hoist table regexes to module-level precompiled constants
2026-02-22 13:50:29 +01:00
Username
b300afed6c httpd: expose URL pipeline stats in /api/stats
All checks were successful
CI / validate (push) Successful in 19s
Add urls section with total/healthy/dead/erroring counts, fetch
activity, productive source count, aggregate yield, and top sources
ranked by working_ratio.
2026-02-22 11:53:57 +01:00
Username
eeadf656f5 httpd: add ASN enrichment for worker-reported proxies
All checks were successful
CI / validate (push) Successful in 20s
Load pyasn database in httpd and look up ASN when workers report
working proxies. Falls back to a pure-Python ipasn.dat reader when
the pyasn C extension is unavailable (Python 2.7 containers).
Backfills existing proxies with null ASN on startup.
2026-02-22 11:30:01 +01:00
10 changed files with 818 additions and 220 deletions

29
Dockerfile.test Normal file
View File

@@ -0,0 +1,29 @@
FROM python:2.7-slim
WORKDIR /app
RUN sed -i 's/deb.debian.org/archive.debian.org/g' /etc/apt/sources.list && \
sed -i 's/security.debian.org/archive.debian.org/g' /etc/apt/sources.list && \
sed -i '/buster-updates/d' /etc/apt/sources.list && \
echo 'deb http://archive.debian.org/debian-security buster/updates main' >> /etc/apt/sources.list && \
apt-get update && \
apt-get upgrade -y && \
apt-get install -y --no-install-recommends gcc libc-dev && \
rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade "pip<21" "setuptools<45" "wheel<0.38"
COPY requirements.txt .
RUN pip install -r requirements.txt || true
RUN pip install pytest
RUN mkdir -p /app/data && \
python -c "import pyasn" 2>/dev/null && \
pyasn_util_download.py --latest && \
pyasn_util_convert.py --single rib.*.bz2 /app/data/ipasn.dat && \
rm -f rib.*.bz2 || \
echo "pyasn database setup skipped"
RUN apt-get purge -y gcc libc-dev && apt-get autoremove -y || true
CMD ["python", "-m", "pytest", "tests/", "-v", "--tb=short"]

View File

@@ -1,69 +1,100 @@
# PPF Project Roadmap
# PPF Roadmap
## Project Purpose
PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework designed to:
1. **Discover** proxy addresses by crawling websites and search engines
2. **Validate** proxies through multi-target testing via Tor
3. **Maintain** a database of working proxies with protocol detection (SOCKS4/SOCKS5/HTTP)
## Architecture Overview
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
PPF Architecture
├─────────────────────────────────────────────────────────────────────────────┤
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ scraper.py │ │ ppf.py │ │proxywatchd │ │
│ │ │ │ │
│ Searx query │───>│ URL harvest │───>│ Proxy test
│ URL finding │ │ Proxy extract│ │ Validation │
─────────────┘ └───────────── ─────────────┘ │
v v v
│ ┌─────────────────────────────────────────────────────────────────┐
│ │ SQLite Databases │ │
│ │ uris.db (URLs) proxies.db (proxy list) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Network Layer │ │
│ │ rocksock.py ─── Tor SOCKS ─── Test Proxy ─── Target Server │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
──────────────────────────────────────────┐
Odin (Master)
│ httpd.py ─ API + SSL-only verification │
proxywatchd.py ─ proxy recheck daemon
SQLite: proxies.db, websites.db
└──────────┬───────────────────────────────┘
WireGuard (10.200.1.0/24)
┌────────────────┼────────────────┐
v v v
───────────┐ ┌─────────── ───────────
cassius edge │ │ sentinel
│ Worker Worker Worker
│ ppf.py │ │ ppf.py │ ppf.py
└───────────┘ └───────────┘ └───────────┘
```
Workers claim URLs, extract proxies, test them, report back.
Master verifies (SSL-only), serves API, coordinates distribution.
## Constraints
- **Python 2.7** compatibility required
- **Minimal external dependencies** (avoid adding new modules)
- Current dependencies: beautifulsoup4, pyasn, IP2Location
- Data files: IP2LOCATION-LITE-DB1.BIN (country), ipasn.dat (ASN)
- Python 2.7 runtime (container-based)
- Minimal external dependencies
- All traffic via Tor
---
## Phase 1: Performance and Quality (current)
Profiling-driven optimizations and source pipeline hardening.
| Item | Status | Description |
|------|--------|-------------|
| Extraction short-circuits | done | Guard clauses in fetch.py extractors |
| Skip shutdown on failed sockets | done | Track _connected flag, skip shutdown on dead sockets |
| SQLite connection reuse (odin) | done | Per-greenlet cached handles via threading.local |
| Lazy-load ASN database | done | Defer ipasn.dat parsing to first lookup |
| Add more seed sources (100+) | done | Expanded to 120+ URLs with SOCKS5-specific sources |
| Protocol-aware source weighting | done | Dynamic SOCKS boost in claim_urls scoring |
| Sharpen error penalty in URL scoring | done | Reduce erroring URL claim frequency |
## Phase 2: Proxy Diversity and Consumer API
Address customer-reported quality gaps.
| Item | Status | Description |
|------|--------|-------------|
| ASN diversity scoring | pending | Deprioritize over-represented ASNs in testing |
| Graduated recheck intervals | pending | Fresh proxies rechecked more often than stale |
| API filters (proto/country/ASN/latency) | pending | Consumer-facing query parameters on /proxies |
| Latency-based ranking | pending | Expose latency percentiles per proxy |
## Phase 3: Self-Expanding Source Pool
Worker-driven link discovery from productive pages.
| Item | Status | Description |
|------|--------|-------------|
| Link extraction from productive pages | pending | Parse HTML for links when page yields proxies |
| Report discovered URLs to master | pending | New endpoint for worker URL submissions |
| Conditional discovery | pending | Only extract links from confirmed-productive pages |
## Phase 4: Long-Term
| Item | Status | Description |
|------|--------|-------------|
| Python 3 migration | deferred | Unblocks modern deps, security patches, pyasn native |
| Worker trust scoring | pending | Activate spot-check verification framework |
| Dynamic target pool | pending | Auto-discover and rotate validation targets |
| Geographic target spread | pending | Ensure targets span multiple regions |
---
## Completed
### Target Management
| Task | Description | File(s) |
|------|-------------|---------|
| Target health tracking | Cooldown-based health tracking for all target pools (head, SSL, IRC, judges) | stats.py, proxywatchd.py |
| MITM field in proxy list | Expose mitm boolean in JSON proxy list endpoints | httpd.py |
---
## Open Work
### Target Management
| Task | Description | File(s) |
|------|-------------|---------|
| Dynamic target pool | Auto-discover and rotate validation targets | proxywatchd.py |
| Geographic target spread | Ensure targets span multiple regions | config.py |
| Item | Date | Description |
|------|------|-------------|
| Sharpen URL error penalty | 2026-02-22 | error*0.5 cap 4.0 + stale*0.2 cap 1.5 |
| SOCKS5 source expansion | 2026-02-22 | Added 10 new SOCKS5-specific sources |
| SQLite connection reuse | 2026-02-22 | Per-greenlet cached handles via threading.local |
| Lazy-load ASN database | 2026-02-22 | Deferred ipasn.dat to first lookup |
| Socket shutdown skip | 2026-02-22 | _connected flag, skip shutdown on dead sockets |
| Protocol-aware weighting | 2026-02-22 | Dynamic SOCKS boost in claim_urls scoring |
| Seed sources expanded | 2026-02-22 | 37 -> 120+ URLs |
| last_seen freshness fix | 2026-02-22 | Watchd updates last_seen on verification |
| Periodic re-seeding | 2026-02-22 | Reset errored sources every 6h |
| ASN enrichment | 2026-02-22 | Pure-Python ipasn.dat reader + backfill |
| URL pipeline stats | 2026-02-22 | /api/stats exposes source health metrics |
| Extraction short-circuits | 2026-02-22 | Guard clauses + precompiled table regexes |
| Target health tracking | prior | Cooldown-based health for all target pools |
| MITM field in proxy list | prior | Expose mitm boolean in JSON endpoints |
| V1 worker protocol removal | prior | Cleaned up legacy --worker code path |
---
@@ -71,31 +102,12 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
| File | Purpose |
|------|---------|
| ppf.py | Main URL harvester daemon |
| ppf.py | URL harvester, worker main loop |
| proxywatchd.py | Proxy validation daemon |
| scraper.py | Searx search integration |
| fetch.py | HTTP fetching with proxy support |
| dbs.py | Database schema and inserts |
| mysqlite.py | SQLite wrapper |
| rocksock.py | Socket/proxy abstraction (3rd party) |
| http2.py | HTTP client implementation |
| httpd.py | Web dashboard and REST API server |
| fetch.py | HTTP fetching, proxy extraction |
| httpd.py | API server, worker coordination |
| dbs.py | Database schema, seed sources |
| config.py | Configuration management |
| comboparse.py | Config/arg parser framework |
| soup_parser.py | BeautifulSoup wrapper |
| misc.py | Utilities (timestamp, logging) |
| export.py | Proxy export CLI tool |
| engines.py | Search engine implementations |
| connection_pool.py | Tor connection pooling |
| network_stats.py | Network statistics tracking |
| dns.py | DNS resolution with caching |
| mitm.py | MITM certificate detection |
| job.py | Priority job queue |
| static/dashboard.js | Dashboard frontend logic |
| static/dashboard.html | Dashboard HTML template |
| tools/lib/ppf-common.sh | Shared ops library (hosts, wrappers, colors) |
| tools/ppf-deploy | Deploy wrapper (validation + playbook) |
| tools/ppf-logs | View container logs |
| tools/ppf-service | Container lifecycle management |
| tools/playbooks/deploy.yml | Ansible deploy playbook |
| tools/playbooks/inventory.ini | Host inventory (WireGuard IPs) |
| rocksock.py | Socket/proxy abstraction |
| http2.py | HTTP client implementation |
| tools/ppf-deploy | Deployment wrapper |

34
TASKLIST.md Normal file
View File

@@ -0,0 +1,34 @@
# PPF Tasklist
Active execution queue. Ordered by priority.
---
## In Progress
| # | Task | File(s) | Notes |
|---|------|---------|-------|
## Queued
| # | Task | File(s) | Notes |
|---|------|---------|-------|
| 12 | API filters on /proxies (proto/country/ASN) | httpd.py | Consumer query params |
| 8 | Graduated recheck intervals | proxywatchd.py | Fresh proxies checked more often |
## Done
| # | Task | Date |
|---|------|------|
| - | Sharpen URL error penalty scoring | 2026-02-22 |
| - | Add SOCKS5-specific sources (10 new) | 2026-02-22 |
| 3 | Lazy-load ASN database | 2026-02-22 |
| 2 | SQLite connection reuse on odin | 2026-02-22 |
| 1 | Skip socket.shutdown on failed connections | 2026-02-22 |
| 4 | Add more seed sources (100+) | 2026-02-22 |
| 6 | Protocol-aware source weighting | 2026-02-22 |
| - | Extraction short-circuits | 2026-02-22 |
| - | last_seen freshness fix | 2026-02-22 |
| - | Periodic re-seeding | 2026-02-22 |
| - | ASN enrichment | 2026-02-22 |
| - | URL pipeline stats | 2026-02-22 |

82
TODO.md
View File

@@ -1,83 +1,35 @@
# PPF TODO
## Optimization
### [ ] JSON Stats Response Caching
- Cache serialized JSON response with short TTL (1-2s)
- Only regenerate when underlying stats change
- Use ETag/If-None-Match for client-side caching
- Savings: ~7-9s/hour. Low priority, only matters with frequent dashboard access.
### [ ] Object Pooling for Test States
- Pool ProxyTestState and TargetTestJob, reset and reuse
- Savings: ~11-15s/hour. **Not recommended** - high effort, medium risk, modest gain.
### [ ] SQLite Connection Reuse
- Persistent connection per thread with health checks
- Savings: ~0.3s/hour. **Not recommended** - negligible benefit.
Intake buffer. Items refined here move to TASKLIST.md.
---
## Dashboard
### [ ] Performance
- Cache expensive DB queries (top countries, protocol breakdown)
- Lazy-load historical data (only when scrolled into view)
- WebSocket option for push updates (reduce polling overhead)
- Configurable refresh interval via URL param or localStorage
### [ ] Features
- Historical graphs (24h, 7d) using stats_history table
- Per-ASN performance analysis
- Alert thresholds (success rate < X%, MITM detected)
- Mobile-responsive improvements
---
- [ ] Cache expensive DB queries (top countries, protocol breakdown)
- [ ] Historical graphs (24h, 7d) using stats_history table
- [ ] Per-ASN performance analysis
- [ ] Alert thresholds (success rate < X%, MITM detected)
- [ ] WebSocket push updates (reduce polling overhead)
- [ ] Mobile-responsive improvements
## Memory
- [ ] Lock consolidation - reduce per-proxy locks (260k LockType objects)
- [ ] Leaner state objects - reduce dict/list count per job
- [ ] Lock consolidation (260k LockType objects at scale)
- [ ] Leaner state objects per job
Memory scales linearly with queue (~4.5 KB/job). No leaks detected.
Optimize only if memory becomes a constraint.
Memory scales ~4.5 KB/job. No leaks detected. Optimize only if constrained.
---
## Source Pipeline
## Deprecation
### [x] Remove V1 worker protocol
Completed. Removed `--worker` flag, `worker_main()`, `claim_work()`,
`submit_results()`, `/api/work`, `/api/results`, and related config
options. `--worker` now routes to the URL-driven protocol.
---
- [ ] PasteBin/GitHub API scrapers for proxy lists
- [ ] Telegram channel scrapers (beyond t.me/s/ HTML)
- [ ] Source quality decay tracking (flag sources going stale)
- [ ] Deduplication of sources across different URL forms
## Known Issues
### [!] Podman Container Metadata Disappears
`podman ps -a` shows empty even though process is running. Service functions
correctly despite missing metadata. Monitor via `ss -tlnp`, `ps aux`, or
`curl localhost:8081/health`. Low impact.
---
## Container Debugging Checklist
```
1. Check for orphans: ps aux | grep -E "[p]rocess_name"
2. Check port conflicts: ss -tlnp | grep PORT
3. Run foreground: podman run --rm (no -d) to see output
4. Check podman state: podman ps -a
5. Clean stale: pkill -9 -f "pattern" && podman rm -f -a
6. Verify deps: config files, data dirs, volumes exist
7. Check logs: podman logs container_name 2>&1 | tail -50
8. Health check: curl -sf http://localhost:PORT/health
```
`podman ps -a` shows empty even though process is running.
Monitor via `ss -tlnp`, `ps aux`, or `curl localhost:8081/health`.

18
compose.test.yml Normal file
View File

@@ -0,0 +1,18 @@
# PPF test runner (Python 2.7, production deps + pytest)
#
# Mounts source and tests as volumes so no rebuild needed between runs.
#
# Usage:
# podman-compose -f compose.test.yml run --rm test
# podman-compose -f compose.test.yml run --rm test python -m pytest tests/test_fetch.py -v
services:
test:
container_name: ppf-test
build:
context: .
dockerfile: Dockerfile.test
volumes:
- .:/app:ro,Z
working_dir: /app
command: python -m pytest tests/ -v --tb=short

176
dbs.py
View File

@@ -582,34 +582,107 @@ def insert_urls(urls, search, sqlite):
# Known proxy list sources (GitHub raw lists, APIs)
PROXY_SOURCES = [
# --- GitHub raw lists (sorted by update frequency) ---
# TheSpeedX/PROXY-List - large, hourly updates
'https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/http.txt',
'https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/socks4.txt',
'https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/socks5.txt',
# clarketm/proxy-list - curated, daily
'https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list-raw.txt',
# monosans/proxy-list - hourly updates
'https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/http.txt',
'https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/socks4.txt',
'https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/socks5.txt',
# prxchk/proxy-list - 10 min updates
'https://raw.githubusercontent.com/prxchk/proxy-list/main/http.txt',
'https://raw.githubusercontent.com/prxchk/proxy-list/main/socks4.txt',
'https://raw.githubusercontent.com/prxchk/proxy-list/main/socks5.txt',
# jetkai/proxy-list - 10 min updates
'https://raw.githubusercontent.com/jetkai/proxy-list/main/online-proxies/txt/proxies.txt',
# roosterkid/openproxylist
'https://raw.githubusercontent.com/roosterkid/openproxylist/main/HTTPS_RAW.txt',
'https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS4_RAW.txt',
'https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS5_RAW.txt',
# ShiftyTR/Proxy-List
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/http.txt',
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks4.txt',
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks5.txt',
# hookzof/socks5_list - hourly, SOCKS5 focused
'https://raw.githubusercontent.com/hookzof/socks5_list/master/proxy.txt',
# mmpx12/proxy-list
'https://raw.githubusercontent.com/mmpx12/proxy-list/master/http.txt',
'https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks4.txt',
'https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks5.txt',
# proxyscrape API
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all',
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=socks4&timeout=10000&country=all',
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=socks5&timeout=10000&country=all',
# ShiftyTR/Proxy-List
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/http.txt',
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks4.txt',
'https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks5.txt',
# roosterkid/openproxylist
'https://raw.githubusercontent.com/roosterkid/openproxylist/main/HTTPS_RAW.txt',
'https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS4_RAW.txt',
'https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS5_RAW.txt',
# clarketm/proxy-list - curated, daily
'https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list-raw.txt',
# officialputuid/KangProxy - 4-6 hour updates
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/http/http.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/https/https.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/socks4/socks4.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/socks5/socks5.txt',
# iplocate/free-proxy-list - 30 min updates
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/http.txt',
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/socks4.txt',
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/socks5.txt',
# ErcinDedeworken/proxy-list - hourly
'https://raw.githubusercontent.com/ErcinDedeworken/proxy-list/main/proxy-list/data.txt',
# MuRongPIG/Proxy-Master - 10 min updates
'https://raw.githubusercontent.com/MuRongPIG/Proxy-Master/main/http.txt',
'https://raw.githubusercontent.com/MuRongPIG/Proxy-Master/main/socks4.txt',
'https://raw.githubusercontent.com/MuRongPIG/Proxy-Master/main/socks5.txt',
# zloi-user/hideip.me - hourly
'https://raw.githubusercontent.com/zloi-user/hideip.me/main/http.txt',
'https://raw.githubusercontent.com/zloi-user/hideip.me/main/socks4.txt',
'https://raw.githubusercontent.com/zloi-user/hideip.me/main/socks5.txt',
# FLAVIEN-music/proxy-list - 30 min updates
'https://raw.githubusercontent.com/FLAVIEN-music/proxy-list/main/proxies/http.txt',
'https://raw.githubusercontent.com/FLAVIEN-music/proxy-list/main/proxies/socks4.txt',
'https://raw.githubusercontent.com/FLAVIEN-music/proxy-list/main/proxies/socks5.txt',
# Zaeem20/FREE_PROXIES_LIST - 30 min updates
'https://raw.githubusercontent.com/Zaeem20/FREE_PROXIES_LIST/master/http.txt',
'https://raw.githubusercontent.com/Zaeem20/FREE_PROXIES_LIST/master/https.txt',
'https://raw.githubusercontent.com/Zaeem20/FREE_PROXIES_LIST/master/socks4.txt',
'https://raw.githubusercontent.com/Zaeem20/FREE_PROXIES_LIST/master/socks5.txt',
# r00tee/Proxy-List - hourly
'https://raw.githubusercontent.com/r00tee/Proxy-List/main/Https.txt',
'https://raw.githubusercontent.com/r00tee/Proxy-List/main/Socks4.txt',
'https://raw.githubusercontent.com/r00tee/Proxy-List/main/Socks5.txt',
# casals-ar/proxy-list
'https://raw.githubusercontent.com/casals-ar/proxy-list/main/http',
'https://raw.githubusercontent.com/casals-ar/proxy-list/main/socks4',
'https://raw.githubusercontent.com/casals-ar/proxy-list/main/socks5',
# yemixzy/proxy-list
'https://raw.githubusercontent.com/yemixzy/proxy-list/main/proxies/http.txt',
'https://raw.githubusercontent.com/yemixzy/proxy-list/main/proxies/socks4.txt',
'https://raw.githubusercontent.com/yemixzy/proxy-list/main/proxies/socks5.txt',
# opsxcq/proxy-list
'https://raw.githubusercontent.com/opsxcq/proxy-list/master/list.txt',
# im-razvan/proxy_list - 10 min updates
'https://raw.githubusercontent.com/im-razvan/proxy_list/main/http.txt',
'https://raw.githubusercontent.com/im-razvan/proxy_list/main/socks4.txt',
'https://raw.githubusercontent.com/im-razvan/proxy_list/main/socks5.txt',
# zevtyardt/proxy-list - daily SOCKS5
'https://raw.githubusercontent.com/zevtyardt/proxy-list/main/socks5.txt',
# UptimerBot/proxy-list - 15 min updates
'https://raw.githubusercontent.com/UptimerBot/proxy-list/main/proxies/socks5.txt',
# Anonym0usWork1221/Free-Proxies
'https://raw.githubusercontent.com/Anonym0usWork1221/Free-Proxies/main/proxy_files/https_proxies.txt',
'https://raw.githubusercontent.com/Anonym0usWork1221/Free-Proxies/main/proxy_files/socks4_proxies.txt',
'https://raw.githubusercontent.com/Anonym0usWork1221/Free-Proxies/main/proxy_files/socks5_proxies.txt',
# ErcinDedeoglu/proxies - hourly
'https://raw.githubusercontent.com/ErcinDedeoglu/proxies/main/proxies/http.txt',
'https://raw.githubusercontent.com/ErcinDedeoglu/proxies/main/proxies/socks4.txt',
'https://raw.githubusercontent.com/ErcinDedeoglu/proxies/main/proxies/socks5.txt',
# dinoz0rg/proxy-list - daily, all protocols
'https://raw.githubusercontent.com/dinoz0rg/proxy-list/main/all.txt',
# elliottophellia/proxylist - SOCKS5
'https://raw.githubusercontent.com/elliottophellia/proxylist/master/results/socks5/global/socks5_len.txt',
# gfpcom/free-proxy-list - SOCKS5
'https://raw.githubusercontent.com/gfpcom/free-proxy-list/main/socks5.txt',
# databay-labs/free-proxy-list - SOCKS5
'https://raw.githubusercontent.com/databay-labs/free-proxy-list/master/socks5.txt',
# --- GitHub Pages / CDN hosted ---
# proxifly/free-proxy-list - 5 min updates (jsDelivr CDN)
'https://cdn.jsdelivr.net/gh/proxifly/free-proxy-list@main/proxies/protocols/http/data.txt',
'https://cdn.jsdelivr.net/gh/proxifly/free-proxy-list@main/proxies/protocols/socks4/data.txt',
@@ -618,24 +691,71 @@ PROXY_SOURCES = [
'https://vakhov.github.io/fresh-proxy-list/http.txt',
'https://vakhov.github.io/fresh-proxy-list/socks4.txt',
'https://vakhov.github.io/fresh-proxy-list/socks5.txt',
# prxchk/proxy-list - 10 min updates
'https://raw.githubusercontent.com/prxchk/proxy-list/main/http.txt',
'https://raw.githubusercontent.com/prxchk/proxy-list/main/socks4.txt',
'https://raw.githubusercontent.com/prxchk/proxy-list/main/socks5.txt',
# sunny9577/proxy-scraper - 3 hour updates (GitHub Pages)
'https://sunny9577.github.io/proxy-scraper/generated/http_proxies.txt',
'https://sunny9577.github.io/proxy-scraper/generated/socks4_proxies.txt',
'https://sunny9577.github.io/proxy-scraper/generated/socks5_proxies.txt',
# officialputuid/KangProxy - 4-6 hour updates
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/http/http.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/socks4/socks4.txt',
'https://raw.githubusercontent.com/officialputuid/KangProxy/KangProxy/socks5/socks5.txt',
# hookzof/socks5_list - hourly updates
'https://raw.githubusercontent.com/hookzof/socks5_list/master/proxy.txt',
# iplocate/free-proxy-list - 30 min updates
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/http.txt',
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/socks4.txt',
'https://raw.githubusercontent.com/iplocate/free-proxy-list/main/protocols/socks5.txt',
# --- API endpoints ---
# proxyscrape
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=http&timeout=10000&country=all',
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=socks4&timeout=10000&country=all',
'https://api.proxyscrape.com/v2/?request=displayproxies&protocol=socks5&timeout=10000&country=all',
# proxy-list.download - SOCKS5 API
'https://www.proxy-list.download/api/v1/get?type=socks5',
'https://www.proxy-list.download/api/v1/get?type=socks4',
# openproxylist.xyz - plain text
'https://api.openproxylist.xyz/http.txt',
'https://api.openproxylist.xyz/socks4.txt',
'https://api.openproxylist.xyz/socks5.txt',
# spys.me - plain text, 30 min updates
'http://spys.me/proxy.txt',
'http://spys.me/socks.txt',
# --- Web scrapers (HTML pages) ---
# spys.one - mixed protocols, requires parsing
'https://spys.one/en/free-proxy-list/',
'https://spys.one/en/socks-proxy-list/',
'https://spys.one/en/https-ssl-proxy/',
# free-proxy-list.net
'https://free-proxy-list.net/',
'https://www.sslproxies.org/',
'https://www.socks-proxy.net/',
# sockslist.us - SOCKS5 focused
'https://sockslist.us/',
# mtpro.xyz - SOCKS5, updated every 5 min
'https://mtpro.xyz/socks5',
# proxy-tools.com - SOCKS5 filtered
'https://proxy-tools.com/proxy/socks5',
# hidemy.name - all protocols, paginated
'https://hide.mn/en/proxy-list/',
# advanced.name - SOCKS5 filtered
'https://advanced.name/freeproxy?type=socks5',
# proxynova.com - by country
'https://www.proxynova.com/proxy-server-list/',
# freeproxy.world - SOCKS5 filtered
'https://www.freeproxy.world/?type=socks5',
# proxydb.net - all protocols
'http://proxydb.net/',
# geonode
'https://proxylist.geonode.com/api/proxy-list?limit=500&page=1&sort_by=lastChecked&sort_type=desc&protocols=http',
'https://proxylist.geonode.com/api/proxy-list?limit=500&page=1&sort_by=lastChecked&sort_type=desc&protocols=socks4',
'https://proxylist.geonode.com/api/proxy-list?limit=500&page=1&sort_by=lastChecked&sort_type=desc&protocols=socks5',
# openproxy.space
'https://openproxy.space/list/http',
'https://openproxy.space/list/socks4',
'https://openproxy.space/list/socks5',
# --- Telegram channels (public HTML view) ---
'https://t.me/s/spys_one',
'https://t.me/s/proxyfree1',
'https://t.me/s/proxylist4free',
'https://t.me/s/proxy_lists',
'https://t.me/s/Proxies4ForYou',
]

View File

@@ -221,6 +221,10 @@ def extract_auth_proxies(content):
"""
proxies = []
# Short-circuit: auth proxies always contain @
if '@' not in content:
return proxies
# IPv4 auth proxies
for match in AUTH_PROXY_PATTERN.finditer(content):
proto_str, user, passwd, ip, port = match.groups()
@@ -256,6 +260,12 @@ TABLE_PORT_HEADERS = ('port',)
TABLE_PROTO_HEADERS = ('type', 'protocol', 'proto', 'scheme')
_TABLE_PATTERN = re.compile(r'<table[^>]*>(.*?)</table>', re.IGNORECASE | re.DOTALL)
_ROW_PATTERN = re.compile(r'<tr[^>]*>(.*?)</tr>', re.IGNORECASE | re.DOTALL)
_CELL_PATTERN = re.compile(r'<t[hd][^>]*>(.*?)</t[hd]>', re.IGNORECASE | re.DOTALL)
_TAG_STRIP = re.compile(r'<[^>]+>')
def extract_proxies_from_table(content):
"""Extract proxies from HTML tables with IP/Port/Protocol columns.
@@ -269,26 +279,23 @@ def extract_proxies_from_table(content):
"""
proxies = []
# Simple regex-based table parsing (works without BeautifulSoup)
# Find all tables
table_pattern = re.compile(r'<table[^>]*>(.*?)</table>', re.IGNORECASE | re.DOTALL)
row_pattern = re.compile(r'<tr[^>]*>(.*?)</tr>', re.IGNORECASE | re.DOTALL)
cell_pattern = re.compile(r'<t[hd][^>]*>(.*?)</t[hd]>', re.IGNORECASE | re.DOTALL)
tag_strip = re.compile(r'<[^>]+>')
# Short-circuit: no HTML tables in plain text content
if '<table' not in content and '<TABLE' not in content:
return proxies
for table_match in table_pattern.finditer(content):
for table_match in _TABLE_PATTERN.finditer(content):
table_html = table_match.group(1)
rows = row_pattern.findall(table_html)
rows = _ROW_PATTERN.findall(table_html)
if not rows:
continue
# Parse header row to find column indices
ip_col = port_col = proto_col = -1
header_row = rows[0]
headers = cell_pattern.findall(header_row)
headers = _CELL_PATTERN.findall(header_row)
for i, cell in enumerate(headers):
cell_text = tag_strip.sub('', cell).strip().lower()
cell_text = _TAG_STRIP.sub('', cell).strip().lower()
if ip_col < 0 and any(h in cell_text for h in TABLE_IP_HEADERS):
ip_col = i
elif port_col < 0 and any(h in cell_text for h in TABLE_PORT_HEADERS):
@@ -302,11 +309,11 @@ def extract_proxies_from_table(content):
# Parse data rows
for row in rows[1:]:
cells = cell_pattern.findall(row)
cells = _CELL_PATTERN.findall(row)
if len(cells) <= ip_col:
continue
ip_cell = tag_strip.sub('', cells[ip_col]).strip()
ip_cell = _TAG_STRIP.sub('', cells[ip_col]).strip()
# Check if IP cell contains port (ip:port format)
if ':' in ip_cell and port_col < 0:
@@ -315,7 +322,7 @@ def extract_proxies_from_table(content):
ip, port = match.groups()
proto = None
if proto_col >= 0 and len(cells) > proto_col:
proto = _normalize_proto(tag_strip.sub('', cells[proto_col]).strip())
proto = _normalize_proto(_TAG_STRIP.sub('', cells[proto_col]).strip())
addr = '%s:%s' % (ip, port)
if is_usable_proxy(addr):
proxies.append((addr, proto))
@@ -323,7 +330,7 @@ def extract_proxies_from_table(content):
# Separate IP and Port columns
if port_col >= 0 and len(cells) > port_col:
port_cell = tag_strip.sub('', cells[port_col]).strip()
port_cell = _TAG_STRIP.sub('', cells[port_col]).strip()
try:
port = int(port_cell)
except ValueError:
@@ -335,7 +342,7 @@ def extract_proxies_from_table(content):
proto = None
if proto_col >= 0 and len(cells) > proto_col:
proto = _normalize_proto(tag_strip.sub('', cells[proto_col]).strip())
proto = _normalize_proto(_TAG_STRIP.sub('', cells[proto_col]).strip())
addr = '%s:%d' % (ip_cell, port)
if is_usable_proxy(addr):
@@ -358,6 +365,10 @@ def extract_proxies_from_json(content):
"""
proxies = []
# Short-circuit: content must contain JSON delimiters
if '{' not in content and '[' not in content:
return proxies
# Try to find JSON in content (may be embedded in HTML)
json_matches = []

282
httpd.py
View File

@@ -31,6 +31,68 @@ except (ImportError, IOError, ValueError):
_geodb = None
_geolite = False
# ASN lookup (optional, lazy-loaded on first use)
# Defers ~3.6s startup cost of parsing ipasn.dat until first ASN lookup.
_asndb = None
_asndb_loaded = False
_asn_dat_path = os.path.join("data", "ipasn.dat")
import socket
import struct
import bisect
class _AsnLookup(object):
"""Pure-Python ASN lookup using ipasn.dat (CIDR/ASN text format)."""
def __init__(self, path):
self._entries = []
with open(path) as f:
for line in f:
line = line.strip()
if not line or line.startswith(';'):
continue
parts = line.split('\t')
if len(parts) != 2:
continue
cidr, asn = parts
ip, prefix = cidr.split('/')
start = struct.unpack('!I', socket.inet_aton(ip))[0]
self._entries.append((start, int(prefix), int(asn)))
self._entries.sort()
_log('asn: loaded %d prefixes (pure-python)' % len(self._entries), 'info')
def lookup(self, ip):
ip_int = struct.unpack('!I', socket.inet_aton(ip))[0]
idx = bisect.bisect_right(self._entries, (ip_int, 33, 0)) - 1
if idx < 0:
return (None, None)
start, prefix_len, asn = self._entries[idx]
mask = (0xFFFFFFFF << (32 - prefix_len)) & 0xFFFFFFFF
if (ip_int & mask) == (start & mask):
return (asn, None)
return (None, None)
def _get_asndb():
"""Lazy-load ASN database on first call. Returns db instance or None."""
global _asndb, _asndb_loaded
if _asndb_loaded:
return _asndb
_asndb_loaded = True
try:
import pyasn
_asndb = pyasn.pyasn(_asn_dat_path)
return _asndb
except (ImportError, IOError):
pass
if os.path.exists(_asn_dat_path):
try:
_asndb = _AsnLookup(_asn_dat_path)
except Exception as e:
_log('asn: failed to load %s: %s' % (_asn_dat_path, e), 'warn')
return _asndb
# Rate limiting configuration
_rate_limits = defaultdict(list)
_rate_lock = threading.Lock()
@@ -107,6 +169,30 @@ _fail_retry_interval = 60 # retry interval for failing proxies
_fail_retry_backoff = True # True=linear backoff (60,120,180...), False=fixed (60,60,60...)
_max_fail = 5 # failures before proxy considered dead
# Per-greenlet (or per-thread) SQLite connection cache
# Under gevent, threading.local() is monkey-patched to greenlet-local storage.
# Connections are reused across requests handled by the same greenlet, eliminating
# redundant sqlite3.connect() + PRAGMA calls (~0.5ms each, ~2.7k/session on odin).
_local = threading.local()
def _get_db(path):
"""Get a cached SQLite connection for the proxy database."""
db = getattr(_local, 'proxy_db', None)
if db is None:
db = mysqlite.mysqlite(path, str)
_local.proxy_db = db
return db
def _get_url_db(path):
"""Get a cached SQLite connection for the URL database."""
db = getattr(_local, 'url_db', None)
if db is None:
db = mysqlite.mysqlite(path, str)
_local.url_db = db
return db
def configure_schedule(working_checktime, fail_retry_interval, fail_retry_backoff, max_fail):
"""Set testing schedule parameters from config."""
@@ -300,6 +386,42 @@ def get_worker_test_rate(worker_id):
return 0.0
return total_tests / elapsed
def _get_proto_boost():
"""Calculate protocol scarcity boost for URL scoring.
Returns a value 0.0-1.0 to boost SOCKS sources when SOCKS proxies
are underrepresented relative to HTTP. Returns 0.0 when balanced.
"""
try:
if not _proxy_database:
return 0.0
db = _get_db(_proxy_database)
if not db:
return 0.0
row = db.execute(
"SELECT "
" SUM(CASE WHEN proto='http' THEN 1 ELSE 0 END),"
" SUM(CASE WHEN proto IN ('socks4','socks5') THEN 1 ELSE 0 END)"
" FROM proxylist WHERE failed=0"
).fetchone()
if not row or not row[0]:
return 0.5 # no data, default mild boost
http_count, socks_count = row[0] or 0, row[1] or 0
total = http_count + socks_count
if total == 0:
return 0.5
socks_ratio = float(socks_count) / total
# Boost SOCKS sources when socks_ratio < 40%
if socks_ratio >= 0.4:
return 0.0
return min((0.4 - socks_ratio) * 2.5, 1.0) # 0.0-1.0 scale
except Exception:
return 0.0
# Global reference to proxy database path (set by ProxyAPIServer.__init__)
_proxy_database = None
def claim_urls(url_db, worker_id, count=5):
"""Claim a batch of URLs for worker-driven fetching. Returns list of URL dicts.
@@ -310,6 +432,7 @@ def claim_urls(url_db, worker_id, count=5):
- quality_bonus: 0-0.5 based on working_ratio
- error_penalty: 0-2.0 based on consecutive errors
- stale_penalty: 0-1.0 based on unchanged fetches
- proto_boost: 0-1.0 for SOCKS sources when SOCKS underrepresented
"""
now = time.time()
now_int = int(now)
@@ -335,14 +458,19 @@ def claim_urls(url_db, worker_id, count=5):
list_max_age_seconds = _url_list_max_age_days * 86400
min_added = now_int - list_max_age_seconds
# Boost SOCKS sources when protocol pool is imbalanced
proto_boost = _get_proto_boost()
try:
rows = url_db.execute(
'''SELECT url, content_hash,
(? - check_time) * 1.0 / MAX(COALESCE(check_interval, 3600), 1)
+ MIN(COALESCE(yield_rate, 0) / 100.0, 1.0)
+ COALESCE(working_ratio, 0) * 0.5
- MIN(error * 0.3, 2.0)
- MIN(stale_count * 0.1, 1.0)
- MIN(error * 0.5, 4.0)
- MIN(stale_count * 0.2, 1.5)
+ CASE WHEN LOWER(url) LIKE '%socks5%' OR LOWER(url) LIKE '%socks4%'
THEN ? ELSE 0 END
AS score
FROM uris
WHERE error < ?
@@ -350,7 +478,7 @@ def claim_urls(url_db, worker_id, count=5):
AND (added > ? OR proxies_added > 0)
ORDER BY score DESC
LIMIT ?''',
(now_int, _url_max_fail, now_int, min_added, count * 3)
(now_int, proto_boost, _url_max_fail, now_int, min_added, count * 3)
).fetchall()
except Exception as e:
_log('claim_urls query error: %s' % e, 'error')
@@ -526,7 +654,7 @@ def _update_url_working_ratios(url_working_counts):
pending_snapshot = dict(_url_pending_counts)
try:
url_db = mysqlite.mysqlite(_url_database_path, str)
url_db = _get_url_db(_url_database_path)
for url, working_count in url_working_counts.items():
pending = pending_snapshot.get(url)
if not pending or pending['total'] <= 0:
@@ -547,7 +675,6 @@ def _update_url_working_ratios(url_working_counts):
settled.append(url)
url_db.commit()
url_db.close()
except Exception as e:
_log('_update_url_working_ratios error: %s' % e, 'error')
@@ -604,7 +731,7 @@ def submit_proxy_reports(db, worker_id, proxies):
''', (proxy_key, ip, port, proto, now_int, now_int, latency, now_int,
checktype, target))
# Geolocate if IP2Location available
# Geolocate and ASN lookup
if _geolite and _geodb:
try:
rec = _geodb.get_all(ip)
@@ -614,6 +741,16 @@ def submit_proxy_reports(db, worker_id, proxies):
(rec.country_short, proxy_key))
except Exception:
pass
asndb = _get_asndb()
if asndb:
try:
asn_result = asndb.lookup(ip)
if asn_result and asn_result[0]:
db.execute(
'UPDATE proxylist SET asn=? WHERE proxy=?',
(asn_result[0], proxy_key))
except Exception:
pass
# Track per-URL working count for working_ratio
if source_url:
@@ -1050,7 +1187,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def handle_countries(self):
"""Return all countries with proxy counts."""
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
rows = db.execute(
'SELECT country, COUNT(*) as c FROM proxylist WHERE failed=0 AND country IS NOT NULL '
'GROUP BY country ORDER BY c DESC'
@@ -1069,7 +1206,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def get_db_stats(self):
"""Get statistics from database."""
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
stats = {}
# Total counts
@@ -1117,7 +1254,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
# Add database stats
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
stats['db'] = self.get_db_stats()
stats['db_health'] = get_db_health(db)
except Exception as e:
@@ -1204,7 +1341,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
args.append(limit)
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
rows = db.execute(sql, args).fetchall()
if fmt == 'plain':
@@ -1253,7 +1390,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
sql += ' ORDER BY avg_latency ASC, tested DESC'
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
rows = db.execute(sql, args).fetchall()
if fmt == 'plain':
@@ -1270,7 +1407,7 @@ class ProxyAPIHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def handle_count(self):
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
row = db.execute('SELECT COUNT(*) FROM proxylist WHERE failed=0 AND proto IS NOT NULL').fetchone()
self.send_json({'count': row[0] if row else 0})
except Exception as e:
@@ -1296,8 +1433,9 @@ class ProxyAPIServer(threading.Thread):
self.stats_provider = stats_provider
self.profiling = profiling
self.daemon = True
global _url_database_path
global _url_database_path, _proxy_database
_url_database_path = url_database
_proxy_database = database
self.server = None
self._stop_event = threading.Event() if not GEVENT_PATCHED else None
# Load static library files into cache
@@ -1306,14 +1444,42 @@ class ProxyAPIServer(threading.Thread):
load_static_files(THEME)
# Load worker registry from disk
load_workers()
# Backfill ASN for existing proxies missing it (triggers lazy-load)
if _get_asndb():
self._backfill_asn()
# Create verification tables if they don't exist
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
create_verification_tables(db)
_log('verification tables initialized', 'debug')
except Exception as e:
_log('failed to create verification tables: %s' % e, 'warn')
def _backfill_asn(self):
"""One-time backfill of ASN for proxies that have ip but no ASN."""
try:
db = _get_db(self.database)
rows = db.execute(
'SELECT proxy, ip FROM proxylist WHERE asn IS NULL AND ip IS NOT NULL'
).fetchall()
if not rows:
return
updated = 0
for proxy_key, ip in rows:
try:
result = _get_asndb().lookup(ip)
if result and result[0]:
db.execute('UPDATE proxylist SET asn=? WHERE proxy=?',
(result[0], proxy_key))
updated += 1
except Exception:
pass
db.commit()
if updated:
_log('asn: backfilled %d/%d proxies' % (updated, len(rows)), 'info')
except Exception as e:
_log('asn backfill error: %s' % e, 'warn')
def _wsgi_app(self, environ, start_response):
"""WSGI application wrapper for gevent."""
path = environ.get('PATH_INFO', '/').split('?')[0]
@@ -1472,11 +1638,15 @@ class ProxyAPIServer(threading.Thread):
stats['system'] = get_system_stats()
# Add database stats
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
stats['db'] = self._get_db_stats(db)
stats['db_health'] = get_db_health(db)
except Exception as e:
_log('api/stats db error: %s' % e, 'warn')
# Add URL pipeline stats
url_stats = self._get_url_stats()
if url_stats is not None:
stats['urls'] = url_stats
# Add profiling flag (from constructor or stats_provider)
if 'profiling' not in stats:
stats['profiling'] = self.profiling
@@ -1501,7 +1671,7 @@ class ProxyAPIServer(threading.Thread):
# 2. Database stats and health
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
result['stats']['db'] = self._get_db_stats(db)
result['stats']['db_health'] = get_db_health(db)
@@ -1514,8 +1684,6 @@ class ProxyAPIServer(threading.Thread):
# 4. Workers (same as /api/workers)
result['workers'] = self._get_workers_data(db)
db.close()
except Exception as e:
_log('api/dashboard db error: %s' % e, 'warn')
result['countries'] = {}
@@ -1534,7 +1702,7 @@ class ProxyAPIServer(threading.Thread):
return json.dumps({'error': 'stats not available'}), 'application/json', 500
elif path == '/api/countries':
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
rows = db.execute(
'SELECT country, COUNT(*) as c FROM proxylist WHERE failed=0 AND country IS NOT NULL '
'GROUP BY country ORDER BY c DESC'
@@ -1546,7 +1714,7 @@ class ProxyAPIServer(threading.Thread):
elif path == '/api/locations':
# Return proxy locations aggregated by lat/lon grid (0.5 degree cells)
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
rows = db.execute(
'SELECT ROUND(latitude, 1) as lat, ROUND(longitude, 1) as lon, '
'country, anonymity, COUNT(*) as c FROM proxylist '
@@ -1584,7 +1752,7 @@ class ProxyAPIServer(threading.Thread):
sql += ' ORDER BY avg_latency ASC, tested DESC LIMIT ?'
args.append(limit)
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
rows = db.execute(sql, args).fetchall()
if fmt == 'plain':
@@ -1623,7 +1791,7 @@ class ProxyAPIServer(threading.Thread):
sql += ' AND mitm=1'
sql += ' ORDER BY avg_latency ASC, tested DESC'
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
rows = db.execute(sql, args).fetchall()
if fmt == 'plain':
@@ -1645,7 +1813,7 @@ class ProxyAPIServer(threading.Thread):
sql += ' AND mitm=0'
elif mitm_filter == '1':
sql += ' AND mitm=1'
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
row = db.execute(sql).fetchone()
return json.dumps({'count': row[0] if row else 0}), 'application/json', 200
except Exception as e:
@@ -1701,9 +1869,8 @@ class ProxyAPIServer(threading.Thread):
elif path == '/api/workers':
# List connected workers
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
workers_data = self._get_workers_data(db)
db.close()
return json.dumps(workers_data, indent=2), 'application/json', 200
except Exception as e:
_log('api/workers error: %s' % e, 'warn')
@@ -1721,7 +1888,7 @@ class ProxyAPIServer(threading.Thread):
return json.dumps({'error': 'url database not configured'}), 'application/json', 500
count = min(int(query_params.get('count', 5)), 20)
try:
url_db = mysqlite.mysqlite(self.url_database, str)
url_db = _get_url_db(self.url_database)
urls = claim_urls(url_db, worker_id, count)
update_worker_heartbeat(worker_id)
return json.dumps({
@@ -1748,7 +1915,7 @@ class ProxyAPIServer(threading.Thread):
if not reports:
return json.dumps({'error': 'no reports provided'}), 'application/json', 400
try:
url_db = mysqlite.mysqlite(self.url_database, str)
url_db = _get_url_db(self.url_database)
processed = submit_url_reports(url_db, worker_id, reports)
update_worker_heartbeat(worker_id)
return json.dumps({
@@ -1772,7 +1939,7 @@ class ProxyAPIServer(threading.Thread):
if not proxies:
return json.dumps({'error': 'no proxies provided'}), 'application/json', 400
try:
db = mysqlite.mysqlite(self.database, str)
db = _get_db(self.database)
processed = submit_proxy_reports(db, worker_id, proxies)
update_worker_heartbeat(worker_id)
return json.dumps({
@@ -1821,6 +1988,65 @@ class ProxyAPIServer(threading.Thread):
_log('_get_db_stats error: %s' % e, 'warn')
return stats
def _get_url_stats(self):
"""Get URL pipeline statistics from the websites database."""
if not self.url_database:
return None
try:
db = _get_url_db(self.url_database)
stats = {}
now = int(time.time())
# Total URLs and health breakdown
row = db.execute('SELECT COUNT(*) FROM uris').fetchone()
stats['total'] = row[0] if row else 0
row = db.execute('SELECT COUNT(*) FROM uris WHERE error >= 10').fetchone()
stats['dead'] = row[0] if row else 0
row = db.execute('SELECT COUNT(*) FROM uris WHERE error > 0 AND error < 10').fetchone()
stats['erroring'] = row[0] if row else 0
row = db.execute('SELECT COUNT(*) FROM uris WHERE error = 0').fetchone()
stats['healthy'] = row[0] if row else 0
# Recently active (fetched in last hour)
row = db.execute(
'SELECT COUNT(*) FROM uris WHERE check_time >= ?',
(now - 3600,)).fetchone()
stats['fetched_last_hour'] = row[0] if row else 0
# Productive sources (have produced working proxies)
row = db.execute(
'SELECT COUNT(*) FROM uris WHERE working_ratio > 0'
).fetchone()
stats['productive'] = row[0] if row else 0
# Aggregate yield
row = db.execute(
'SELECT SUM(proxies_added), SUM(retrievals) FROM uris'
).fetchone()
stats['total_proxies_extracted'] = row[0] or 0 if row else 0
stats['total_fetches'] = row[1] or 0 if row else 0
# Currently claimed
with _url_claims_lock:
stats['claimed'] = len(_url_claims)
# Top sources by working_ratio (productive URLs only)
rows = db.execute(
'SELECT url, working_ratio, yield_rate, proxies_added, retrievals '
'FROM uris WHERE working_ratio > 0 AND retrievals > 0 '
'ORDER BY working_ratio DESC LIMIT 10'
).fetchall()
stats['top_sources'] = [{
'url': r[0], 'working_ratio': round(r[1], 3),
'yield_rate': round(r[2], 1), 'proxies_added': r[3],
'fetches': r[4],
} for r in rows]
return stats
except Exception as e:
_log('_get_url_stats error: %s' % e, 'warn')
return None
def _get_workers_data(self, db):
"""Get worker status data. Used by /api/workers and /api/dashboard."""
now = time.time()

View File

@@ -242,6 +242,7 @@ class Rocksock():
target = RocksockProxy(host, port, RS_PT_NONE)
self.proxychain.append(target)
self.sock = None
self._connected = False
self.timeout = timeout
def _translate_socket_error(self, e, pnum):
@@ -302,15 +303,18 @@ class Rocksock():
select.select([], [self.sock], [])
"""
self._connected = True
def disconnect(self):
if self.sock is None: return
try:
self.sock.shutdown(socket.SHUT_RDWR)
except socket.error:
pass
if self._connected:
try:
self.sock.shutdown(socket.SHUT_RDWR)
except socket.error:
pass
self.sock.close()
self.sock = None
self._connected = False
def canread(self):
return select.select([self.sock], [], [], 0)[0]

View File

@@ -359,6 +359,198 @@ class TestExtractAuthProxies:
assert fetch.extract_auth_proxies('just some text') == []
class TestExtractAuthProxiesShortCircuit:
"""Tests for extract_auth_proxies() short-circuit on missing @."""
def test_no_at_sign_returns_empty(self):
"""Content without @ skips regex entirely."""
content = '1.2.3.4:8080 socks5://5.6.7.8:1080 plain text'
assert fetch.extract_auth_proxies(content) == []
def test_at_sign_still_extracts(self):
"""Content with @ still finds auth proxies."""
content = 'user:pass@1.2.3.4:8080'
result = fetch.extract_auth_proxies(content)
assert len(result) == 1
assert result[0][0] == 'user:pass@1.2.3.4:8080'
def test_at_sign_no_match_returns_empty(self):
"""Content with @ but no auth proxy pattern returns empty."""
content = 'email@example.com has no proxy'
assert fetch.extract_auth_proxies(content) == []
class TestExtractProxiesFromTable:
"""Tests for extract_proxies_from_table() with precompiled regexes."""
def test_no_table_returns_empty(self):
"""Plain text without <table> returns empty."""
content = '1.2.3.4:8080\n5.6.7.8:3128\n'
assert fetch.extract_proxies_from_table(content) == []
def test_simple_table(self):
"""Basic HTML table with IP/Port columns is parsed."""
content = '''
<table>
<tr><th>IP</th><th>Port</th><th>Type</th></tr>
<tr><td>1.2.3.4</td><td>8080</td><td>HTTP</td></tr>
<tr><td>5.6.7.8</td><td>1080</td><td>SOCKS5</td></tr>
</table>
'''
result = fetch.extract_proxies_from_table(content)
assert len(result) == 2
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
assert '5.6.7.8:1080' in addrs
def test_uppercase_table_tag(self):
"""<TABLE> (uppercase) is also detected."""
content = '''
<TABLE>
<TR><TH>IP</TH><TH>Port</TH></TR>
<TR><TD>1.2.3.4</TD><TD>8080</TD></TR>
</TABLE>
'''
result = fetch.extract_proxies_from_table(content)
assert len(result) == 1
def test_empty_table(self):
"""Table with headers but no data rows returns empty."""
content = '''
<table>
<tr><th>IP</th><th>Port</th></tr>
</table>
'''
result = fetch.extract_proxies_from_table(content)
assert result == []
class TestExtractProxiesFromJson:
"""Tests for extract_proxies_from_json() short-circuit."""
def test_no_braces_returns_empty(self):
"""Content without { or [ skips JSON parsing."""
content = '1.2.3.4:8080\n5.6.7.8:3128\n'
assert fetch.extract_proxies_from_json(content) == []
def test_json_array_of_objects(self):
"""JSON array with ip/port objects is parsed."""
content = '[{"ip": "1.2.3.4", "port": 8080}]'
result = fetch.extract_proxies_from_json(content)
assert len(result) >= 1
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
def test_json_array_of_strings(self):
"""JSON array of ip:port strings is parsed."""
content = '["1.2.3.4:8080", "5.6.7.8:3128"]'
result = fetch.extract_proxies_from_json(content)
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
assert '5.6.7.8:3128' in addrs
def test_plain_html_skips_json(self):
"""HTML without JSON delimiters returns empty."""
content = '<html><body>1.2.3.4:8080</body></html>'
# HTML has < and > but this function checks for { and [
# The < > chars won't trigger JSON parsing
result = fetch.extract_proxies_from_json(content)
# May or may not find anything depending on HTML structure
# but should not crash
assert isinstance(result, list)
class TestExtractProxiesWithHints:
"""Tests for extract_proxies_with_hints()."""
def test_proto_before_ip(self):
"""Protocol keyword before IP:PORT is detected."""
content = 'socks5 1.2.3.4:8080'
result = fetch.extract_proxies_with_hints(content)
assert '1.2.3.4:8080' in result
assert result['1.2.3.4:8080'] == 'socks5'
def test_proto_after_ip(self):
"""Protocol keyword after IP:PORT is detected."""
content = '1.2.3.4:8080 socks5'
result = fetch.extract_proxies_with_hints(content)
assert '1.2.3.4:8080' in result
def test_no_hints_returns_empty(self):
"""Plain IP:PORT without protocol hints returns empty."""
content = '1.2.3.4:8080'
result = fetch.extract_proxies_with_hints(content)
assert result == {}
class TestExtractProxiesIntegration:
"""Integration tests for extract_proxies() combining all extractors."""
def test_plain_text_proxy_list(self):
"""Plain text IP:PORT list extracts correctly."""
content = '1.2.3.4:8080\n5.6.7.8:3128\n9.10.11.12:1080\n'
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
assert '5.6.7.8:3128' in addrs
assert '9.10.11.12:1080' in addrs
def test_auth_proxies_extracted(self):
"""Auth proxies found in mixed content."""
content = 'user:pass@1.2.3.4:8080\n5.6.7.8:3128\n'
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert 'user:pass@1.2.3.4:8080' in addrs
assert '5.6.7.8:3128' in addrs
def test_html_table_extraction(self):
"""Proxies extracted from HTML table."""
content = '''
<table>
<tr><th>IP</th><th>Port</th></tr>
<tr><td>1.2.3.4</td><td>8080</td></tr>
</table>
'''
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
def test_json_extraction(self):
"""Proxies extracted from JSON content."""
content = '[{"ip": "1.2.3.4", "port": 8080}]'
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert '1.2.3.4:8080' in addrs
def test_empty_content(self):
"""Empty content returns no proxies."""
result = fetch.extract_proxies('', filter_known=False)
assert result == []
def test_private_ips_filtered(self):
"""Private IPs are not returned."""
content = '10.0.0.1:8080\n192.168.1.1:3128\n1.2.3.4:8080\n'
result = fetch.extract_proxies(content, filter_known=False)
addrs = [r[0] for r in result]
assert '10.0.0.1:8080' not in addrs
assert '192.168.1.1:3128' not in addrs
assert '1.2.3.4:8080' in addrs
def test_proto_from_hints(self):
"""Protocol hints are picked up."""
content = 'socks5 1.2.3.4:8080\n'
result = fetch.extract_proxies(content, filter_known=False)
protos = {r[0]: r[1] for r in result}
assert protos.get('1.2.3.4:8080') == 'socks5'
def test_proto_from_arg(self):
"""Fallback proto from argument is used."""
content = '1.2.3.4:8080\n'
result = fetch.extract_proxies(content, filter_known=False, proto='socks4')
protos = {r[0]: r[1] for r in result}
assert protos.get('1.2.3.4:8080') == 'socks4'
class TestConfidenceScoring:
"""Tests for confidence score constants."""