Dockerfile.test builds production image with pytest baked in. compose.test.yml mounts source as volume for fast iteration. Usage: podman-compose -f compose.test.yml run --rm test
PPF - Python Proxy Finder
A Python 2.7 proxy discovery and validation framework.
Overview
PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP).
scraper.py ──> ppf.py ──> proxywatchd.py
│ │ │
│ search │ harvest │ validate
│ engines │ proxies │ via tor
v v v
SQLite databases
Requirements
- Python 2.7
- Tor SOCKS proxy (default: 127.0.0.1:9050)
- beautifulsoup4 (optional with --nobs flag)
Installation
Local
pip install -r requirements.txt
cp config.ini.sample config.ini
cp servers.txt.sample servers.txt
Container (Rootless)
# On container host, as dedicated user
podman build -t ppf:latest .
podman run --rm ppf:latest python ppf.py --help
Prerequisites for rootless containers:
- subuid/subgid mappings configured
- linger enabled (
loginctl enable-linger $USER) - passt installed for networking
Configuration
Copy config.ini.sample to config.ini and adjust:
[common]
tor_hosts = 127.0.0.1:9050 # Comma-separated Tor SOCKS addresses
timeout_connect = 10 # Connection timeout (seconds)
timeout_read = 15 # Read timeout (seconds)
[watchd]
threads = 10 # Parallel validation threads
max_fail = 5 # Failures before proxy marked dead
checktime = 1800 # Base recheck interval (seconds)
database = proxies.sqlite # Proxy database path
stale_days = 30 # Days before removing dead proxies
stats_interval = 300 # Seconds between status reports
[ppf]
threads = 3 # URL harvesting threads
search = 1 # Enable search engine discovery
database = websites.sqlite # URL database path
[scraper]
engines = searx,duckduckgo # Comma-separated search engines
max_pages = 5 # Max pages per engine query
[httpd]
enabled = 0 # Enable REST API
port = 8081 # API listen port
listenip = 127.0.0.1 # API bind address
Usage
Proxy Validation Daemon
python proxywatchd.py
Validates proxies from the database against multiple targets. Requires:
servers.txtwith IRC servers (for IRC mode) or uses built-in HTTP targets- Running Tor instance
URL Harvester
python ppf.py
Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally.
Search Engine Scraper
python scraper.py
Queries search engines for proxy list URLs. Supports:
- SearXNG instances
- DuckDuckGo, Startpage, Brave, Ecosia
- GitHub, GitLab, Codeberg (code search)
Import From File
python ppf.py --file proxies.txt
CLI Flags
--nobs Use stdlib HTMLParser instead of BeautifulSoup
--file FILE Import proxies from file
-q, --quiet Show warnings and errors only
-v, --verbose Show debug messages
REST API
Enable in config with httpd.enabled = 1.
# Get working proxies
curl http://localhost:8081/proxies?limit=10&proto=socks5
# Get count
curl http://localhost:8081/proxies/count
# Health check
curl http://localhost:8081/health
Query parameters:
limit- Max results (default: 100)proto- Filter by protocol (socks4/socks5/http)country- Filter by country codeasn- Filter by ASN numberformat- Output format (json/plain)
Architecture
Components
| File | Purpose |
|---|---|
| proxywatchd.py | Proxy validation daemon with multi-target voting |
| ppf.py | URL harvester and proxy extractor |
| scraper.py | Search engine integration |
| fetch.py | HTTP client with proxy support |
| dbs.py | Database operations |
| mysqlite.py | SQLite wrapper with WAL mode |
| connection_pool.py | Tor connection pooling with health tracking |
| config.py | Configuration management |
| httpd.py | REST API server |
Validation Logic
Each proxy is tested against 3 random targets:
- 2/3 majority required for success
- Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4)
- SSL/TLS tested periodically
- MITM detection via certificate validation
Database Schema
-- proxylist (proxies.sqlite)
proxy TEXT UNIQUE -- ip:port
proto TEXT -- socks4/socks5/http
country TEXT -- 2-letter code
asn INT -- autonomous system number
failed INT -- consecutive failures
success_count INT -- total successes
avg_latency REAL -- rolling average (ms)
tested INT -- last test timestamp
-- uris (websites.sqlite)
url TEXT UNIQUE -- source URL
error INT -- consecutive errors
stale_count INT -- checks without new proxies
Threading Model
- Priority queue orders jobs by proxy health
- Dynamic thread scaling based on success rate
- Work-stealing ensures even load distribution
- Tor connection pooling with worker affinity
Deployment
Container Deployment
All nodes use podman-compose with role-specific compose files
(rootless, as podman user). --network=host required for Tor
access at 127.0.0.1:9050.
# Build image
podman build -t ppf:latest .
# Start via compose
podman-compose up -d
# View logs / stop
podman-compose logs -f
podman-compose down
Operations Toolkit
The tools/ directory provides CLI wrappers for multi-node operations.
Deployment uses an Ansible playbook over WireGuard for parallel execution
and handler-based restarts.
ppf-deploy [targets...] # validate + deploy + restart (playbook)
ppf-deploy --check # dry run with diff
ppf-logs [node] # view container logs (-f to follow)
ppf-service <cmd> [nodes...] # status / start / stop / restart
ppf-db <cmd> # stats / purge-proxies / vacuum
ppf-status # cluster overview (containers, workers, queue)
See --help on each tool.
Troubleshooting
Low Success Rate
WATCHD X.XX% SUCCESS RATE - tor circuit blocked?
- Tor circuit may be flagged; restart Tor
- Target servers may be blocking; wait for rotation
- Network issues; check connectivity
Database Locked
WAL mode handles most concurrency. If issues persist:
- Reduce thread count
- Check disk I/O
- Verify single instance running
No Proxies Found
- Check search engines in config
- Verify Tor connectivity
- Review scraper logs for rate limiting
License
See LICENSE file.