# PPF - Python Proxy Finder A Python 2.7 proxy discovery and validation framework. ## Overview PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP). ``` scraper.py ──> ppf.py ──> proxywatchd.py │ │ │ │ search │ harvest │ validate │ engines │ proxies │ via tor v v v SQLite databases ``` ## Requirements - Python 2.7 - Tor SOCKS proxy (default: 127.0.0.1:9050) - beautifulsoup4 (optional with --nobs flag) ## Installation ### Local ```sh pip install -r requirements.txt cp config.ini.sample config.ini cp servers.txt.sample servers.txt ``` ### Container (Rootless) ```sh # On container host, as dedicated user podman build -t ppf:latest . podman run --rm ppf:latest python ppf.py --help ``` Prerequisites for rootless containers: - subuid/subgid mappings configured - linger enabled (`loginctl enable-linger $USER`) - passt installed for networking ## Configuration Copy `config.ini.sample` to `config.ini` and adjust: ```ini [common] tor_hosts = 127.0.0.1:9050 # Comma-separated Tor SOCKS addresses timeout_connect = 10 # Connection timeout (seconds) timeout_read = 15 # Read timeout (seconds) [watchd] threads = 10 # Parallel validation threads max_fail = 5 # Failures before proxy marked dead checktime = 1800 # Base recheck interval (seconds) database = proxies.sqlite # Proxy database path stale_days = 30 # Days before removing dead proxies stats_interval = 300 # Seconds between status reports [ppf] threads = 3 # URL harvesting threads search = 1 # Enable search engine discovery database = websites.sqlite # URL database path [scraper] engines = searx,duckduckgo # Comma-separated search engines max_pages = 5 # Max pages per engine query [httpd] enabled = 0 # Enable REST API port = 8081 # API listen port listenip = 127.0.0.1 # API bind address ``` ## Usage ### Proxy Validation Daemon ```sh python proxywatchd.py ``` Validates proxies from the database against multiple targets. Requires: - `servers.txt` with IRC servers (for IRC mode) or uses built-in HTTP targets - Running Tor instance ### URL Harvester ```sh python ppf.py ``` Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally. ### Search Engine Scraper ```sh python scraper.py ``` Queries search engines for proxy list URLs. Supports: - SearXNG instances - DuckDuckGo, Startpage, Brave, Ecosia - GitHub, GitLab, Codeberg (code search) ### Import From File ```sh python ppf.py --file proxies.txt ``` ### CLI Flags ``` --nobs Use stdlib HTMLParser instead of BeautifulSoup --file FILE Import proxies from file -q, --quiet Show warnings and errors only -v, --verbose Show debug messages ``` ## REST API Enable in config with `httpd.enabled = 1`. ```sh # Get working proxies curl http://localhost:8081/proxies?limit=10&proto=socks5 # Get count curl http://localhost:8081/proxies/count # Health check curl http://localhost:8081/health ``` Query parameters: - `limit` - Max results (default: 100) - `proto` - Filter by protocol (socks4/socks5/http) - `country` - Filter by country code - `asn` - Filter by ASN number - `format` - Output format (json/plain) ## Architecture ### Components | File | Purpose | |------|---------| | proxywatchd.py | Proxy validation daemon with multi-target voting | | ppf.py | URL harvester and proxy extractor | | scraper.py | Search engine integration | | fetch.py | HTTP client with proxy support | | dbs.py | Database operations | | mysqlite.py | SQLite wrapper with WAL mode | | connection_pool.py | Tor connection pooling with health tracking | | config.py | Configuration management | | httpd.py | REST API server | ### Validation Logic Each proxy is tested against 3 random targets: - 2/3 majority required for success - Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4) - SSL/TLS tested periodically - MITM detection via certificate validation ### Database Schema ```sql -- proxylist (proxies.sqlite) proxy TEXT UNIQUE -- ip:port proto TEXT -- socks4/socks5/http country TEXT -- 2-letter code asn INT -- autonomous system number failed INT -- consecutive failures success_count INT -- total successes avg_latency REAL -- rolling average (ms) tested INT -- last test timestamp -- uris (websites.sqlite) url TEXT UNIQUE -- source URL error INT -- consecutive errors stale_count INT -- checks without new proxies ``` ### Threading Model - Priority queue orders jobs by proxy health - Dynamic thread scaling based on success rate - Work-stealing ensures even load distribution - Tor connection pooling with worker affinity ## Deployment ### Systemd Service ```ini [Unit] Description=PPF Python Proxy Finder After=network-online.target tor.service Wants=network-online.target [Service] Type=simple User=ppf WorkingDirectory=/opt/ppf # ppf.py is the main entry point (runs harvester + validator) ExecStart=/usr/bin/python2 ppf.py Restart=on-failure RestartSec=30 [Install] WantedBy=multi-user.target ``` ### Container Deployment All nodes use `podman-compose` with role-specific compose files. ```sh # Build image podman build -t ppf:latest . # Run standalone (single node) podman run -d --name ppf \ --network=host \ -v ./data:/app/data:Z \ -v ./config.ini:/app/config.ini:ro \ ppf:latest python ppf.py ``` Note: `--network=host` required for Tor access at 127.0.0.1:9050. ### Operations Toolkit The `tools/` directory provides CLI wrappers for multi-node operations. Deployment uses an Ansible playbook over WireGuard for parallel execution and handler-based restarts. ```sh ppf-deploy [targets...] # validate + deploy + restart (playbook) ppf-deploy --check # dry run with diff ppf-logs [node] # view container logs (-f to follow) ppf-service [nodes...] # status / start / stop / restart ``` See `--help` on each tool. ## Troubleshooting ### Low Success Rate ``` WATCHD X.XX% SUCCESS RATE - tor circuit blocked? ``` - Tor circuit may be flagged; restart Tor - Target servers may be blocking; wait for rotation - Network issues; check connectivity ### Database Locked WAL mode handles most concurrency. If issues persist: - Reduce thread count - Check disk I/O - Verify single instance running ### No Proxies Found - Check search engines in config - Verify Tor connectivity - Review scraper logs for rate limiting ## License See LICENSE file.