Files
ppf/README.md
Username c091216afc
Some checks failed
CI / syntax-check (push) Failing after 1s
CI / memory-leak-check (push) Successful in 16s
docs: add ppf-db to README operations toolkit
2026-02-18 00:29:28 +01:00

6.5 KiB

PPF - Python Proxy Finder

A Python 2.7 proxy discovery and validation framework.

Overview

PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP).

scraper.py ──> ppf.py ──> proxywatchd.py
   │             │              │
   │ search      │ harvest      │ validate
   │ engines     │ proxies      │ via tor
   v             v              v
         SQLite databases

Requirements

  • Python 2.7
  • Tor SOCKS proxy (default: 127.0.0.1:9050)
  • beautifulsoup4 (optional with --nobs flag)

Installation

Local

pip install -r requirements.txt
cp config.ini.sample config.ini
cp servers.txt.sample servers.txt

Container (Rootless)

# On container host, as dedicated user
podman build -t ppf:latest .
podman run --rm ppf:latest python ppf.py --help

Prerequisites for rootless containers:

  • subuid/subgid mappings configured
  • linger enabled (loginctl enable-linger $USER)
  • passt installed for networking

Configuration

Copy config.ini.sample to config.ini and adjust:

[common]
tor_hosts = 127.0.0.1:9050      # Comma-separated Tor SOCKS addresses
timeout_connect = 10             # Connection timeout (seconds)
timeout_read = 15                # Read timeout (seconds)

[watchd]
threads = 10                     # Parallel validation threads
max_fail = 5                     # Failures before proxy marked dead
checktime = 1800                 # Base recheck interval (seconds)
database = proxies.sqlite        # Proxy database path
stale_days = 30                  # Days before removing dead proxies
stats_interval = 300             # Seconds between status reports

[ppf]
threads = 3                      # URL harvesting threads
search = 1                       # Enable search engine discovery
database = websites.sqlite       # URL database path

[scraper]
engines = searx,duckduckgo       # Comma-separated search engines
max_pages = 5                    # Max pages per engine query

[httpd]
enabled = 0                      # Enable REST API
port = 8081                      # API listen port
listenip = 127.0.0.1             # API bind address

Usage

Proxy Validation Daemon

python proxywatchd.py

Validates proxies from the database against multiple targets. Requires:

  • servers.txt with IRC servers (for IRC mode) or uses built-in HTTP targets
  • Running Tor instance

URL Harvester

python ppf.py

Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally.

Search Engine Scraper

python scraper.py

Queries search engines for proxy list URLs. Supports:

  • SearXNG instances
  • DuckDuckGo, Startpage, Brave, Ecosia
  • GitHub, GitLab, Codeberg (code search)

Import From File

python ppf.py --file proxies.txt

CLI Flags

--nobs          Use stdlib HTMLParser instead of BeautifulSoup
--file FILE     Import proxies from file
-q, --quiet     Show warnings and errors only
-v, --verbose   Show debug messages

REST API

Enable in config with httpd.enabled = 1.

# Get working proxies
curl http://localhost:8081/proxies?limit=10&proto=socks5

# Get count
curl http://localhost:8081/proxies/count

# Health check
curl http://localhost:8081/health

Query parameters:

  • limit - Max results (default: 100)
  • proto - Filter by protocol (socks4/socks5/http)
  • country - Filter by country code
  • asn - Filter by ASN number
  • format - Output format (json/plain)

Architecture

Components

File Purpose
proxywatchd.py Proxy validation daemon with multi-target voting
ppf.py URL harvester and proxy extractor
scraper.py Search engine integration
fetch.py HTTP client with proxy support
dbs.py Database operations
mysqlite.py SQLite wrapper with WAL mode
connection_pool.py Tor connection pooling with health tracking
config.py Configuration management
httpd.py REST API server

Validation Logic

Each proxy is tested against 3 random targets:

  • 2/3 majority required for success
  • Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4)
  • SSL/TLS tested periodically
  • MITM detection via certificate validation

Database Schema

-- proxylist (proxies.sqlite)
proxy TEXT UNIQUE      -- ip:port
proto TEXT             -- socks4/socks5/http
country TEXT           -- 2-letter code
asn INT                -- autonomous system number
failed INT             -- consecutive failures
success_count INT      -- total successes
avg_latency REAL       -- rolling average (ms)
tested INT             -- last test timestamp

-- uris (websites.sqlite)
url TEXT UNIQUE        -- source URL
error INT              -- consecutive errors
stale_count INT        -- checks without new proxies

Threading Model

  • Priority queue orders jobs by proxy health
  • Dynamic thread scaling based on success rate
  • Work-stealing ensures even load distribution
  • Tor connection pooling with worker affinity

Deployment

Container Deployment

All nodes use podman-compose with role-specific compose files (rootless, as podman user). --network=host required for Tor access at 127.0.0.1:9050.

# Build image
podman build -t ppf:latest .

# Start via compose
podman-compose up -d

# View logs / stop
podman-compose logs -f
podman-compose down

Operations Toolkit

The tools/ directory provides CLI wrappers for multi-node operations. Deployment uses an Ansible playbook over WireGuard for parallel execution and handler-based restarts.

ppf-deploy [targets...]        # validate + deploy + restart (playbook)
ppf-deploy --check             # dry run with diff
ppf-logs [node]                # view container logs (-f to follow)
ppf-service <cmd> [nodes...]   # status / start / stop / restart
ppf-db <cmd>                   # stats / purge-proxies / vacuum

See --help on each tool.

Troubleshooting

Low Success Rate

WATCHD X.XX% SUCCESS RATE - tor circuit blocked?
  • Tor circuit may be flagged; restart Tor
  • Target servers may be blocking; wait for rotation
  • Network issues; check connectivity

Database Locked

WAL mode handles most concurrency. If issues persist:

  • Reduce thread count
  • Check disk I/O
  • Verify single instance running

No Proxies Found

  • Check search engines in config
  • Verify Tor connectivity
  • Review scraper logs for rate limiting

License

See LICENSE file.