Files
ppf/README.md
2026-02-18 01:02:42 +01:00

263 lines
6.5 KiB
Markdown

# PPF - Python Proxy Finder
A Python 2.7 proxy discovery and validation framework.
## Overview
PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP).
```
scraper.py ──> ppf.py ──> proxywatchd.py
│ │ │
│ search │ harvest │ validate
│ engines │ proxies │ via tor
v v v
SQLite databases
```
## Requirements
- Python 2.7
- Tor SOCKS proxy (default: 127.0.0.1:9050)
- beautifulsoup4 (optional with --nobs flag)
## Installation
### Local
```sh
pip install -r requirements.txt
cp config.ini.sample config.ini
cp servers.txt.sample servers.txt
```
### Container (Rootless)
```sh
# On container host, as dedicated user
podman build -t ppf:latest .
podman run --rm ppf:latest python ppf.py --help
```
Prerequisites for rootless containers:
- subuid/subgid mappings configured
- linger enabled (`loginctl enable-linger $USER`)
- passt installed for networking
## Configuration
Copy `config.ini.sample` to `config.ini` and adjust:
```ini
[common]
tor_hosts = 127.0.0.1:9050 # Comma-separated Tor SOCKS addresses
timeout_connect = 10 # Connection timeout (seconds)
timeout_read = 15 # Read timeout (seconds)
[watchd]
threads = 10 # Parallel validation threads
max_fail = 5 # Failures before proxy marked dead
checktime = 1800 # Base recheck interval (seconds)
database = proxies.sqlite # Proxy database path
stale_days = 30 # Days before removing dead proxies
stats_interval = 300 # Seconds between status reports
[ppf]
threads = 3 # URL harvesting threads
search = 1 # Enable search engine discovery
database = websites.sqlite # URL database path
[scraper]
engines = searx,duckduckgo # Comma-separated search engines
max_pages = 5 # Max pages per engine query
[httpd]
enabled = 0 # Enable REST API
port = 8081 # API listen port
listenip = 127.0.0.1 # API bind address
```
## Usage
### Proxy Validation Daemon
```sh
python proxywatchd.py
```
Validates proxies from the database against multiple targets. Requires:
- `servers.txt` with IRC servers (for IRC mode) or uses built-in HTTP targets
- Running Tor instance
### URL Harvester
```sh
python ppf.py
```
Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally.
### Search Engine Scraper
```sh
python scraper.py
```
Queries search engines for proxy list URLs. Supports:
- SearXNG instances
- DuckDuckGo, Startpage, Brave, Ecosia
- GitHub, GitLab, Codeberg (code search)
### Import From File
```sh
python ppf.py --file proxies.txt
```
### CLI Flags
```
--nobs Use stdlib HTMLParser instead of BeautifulSoup
--file FILE Import proxies from file
-q, --quiet Show warnings and errors only
-v, --verbose Show debug messages
```
## REST API
Enable in config with `httpd.enabled = 1`.
```sh
# Get working proxies
curl http://localhost:8081/proxies?limit=10&proto=socks5
# Get count
curl http://localhost:8081/proxies/count
# Health check
curl http://localhost:8081/health
```
Query parameters:
- `limit` - Max results (default: 100)
- `proto` - Filter by protocol (socks4/socks5/http)
- `country` - Filter by country code
- `asn` - Filter by ASN number
- `format` - Output format (json/plain)
## Architecture
### Components
| File | Purpose |
|------|---------|
| proxywatchd.py | Proxy validation daemon with multi-target voting |
| ppf.py | URL harvester and proxy extractor |
| scraper.py | Search engine integration |
| fetch.py | HTTP client with proxy support |
| dbs.py | Database operations |
| mysqlite.py | SQLite wrapper with WAL mode |
| connection_pool.py | Tor connection pooling with health tracking |
| config.py | Configuration management |
| httpd.py | REST API server |
### Validation Logic
Each proxy is tested against 3 random targets:
- 2/3 majority required for success
- Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4)
- SSL/TLS tested periodically
- MITM detection via certificate validation
### Database Schema
```sql
-- proxylist (proxies.sqlite)
proxy TEXT UNIQUE -- ip:port
proto TEXT -- socks4/socks5/http
country TEXT -- 2-letter code
asn INT -- autonomous system number
failed INT -- consecutive failures
success_count INT -- total successes
avg_latency REAL -- rolling average (ms)
tested INT -- last test timestamp
-- uris (websites.sqlite)
url TEXT UNIQUE -- source URL
error INT -- consecutive errors
stale_count INT -- checks without new proxies
```
### Threading Model
- Priority queue orders jobs by proxy health
- Dynamic thread scaling based on success rate
- Work-stealing ensures even load distribution
- Tor connection pooling with worker affinity
## Deployment
### Container Deployment
All nodes use `podman-compose` with role-specific compose files
(rootless, as `podman` user). `--network=host` required for Tor
access at 127.0.0.1:9050.
```sh
# Build image
podman build -t ppf:latest .
# Start via compose
podman-compose up -d
# View logs / stop
podman-compose logs -f
podman-compose down
```
### Operations Toolkit
The `tools/` directory provides CLI wrappers for multi-node operations.
Deployment uses an Ansible playbook over WireGuard for parallel execution
and handler-based restarts.
```sh
ppf-deploy [targets...] # validate + deploy + restart (playbook)
ppf-deploy --check # dry run with diff
ppf-logs [node] # view container logs (-f to follow)
ppf-service <cmd> [nodes...] # status / start / stop / restart
ppf-db <cmd> # stats / purge-proxies / vacuum
ppf-status # cluster overview (containers, workers, queue)
```
See `--help` on each tool.
## Troubleshooting
### Low Success Rate
```
WATCHD X.XX% SUCCESS RATE - tor circuit blocked?
```
- Tor circuit may be flagged; restart Tor
- Target servers may be blocking; wait for rotation
- Network issues; check connectivity
### Database Locked
WAL mode handles most concurrency. If issues persist:
- Reduce thread count
- Check disk I/O
- Verify single instance running
### No Proxies Found
- Check search engines in config
- Verify Tor connectivity
- Review scraper logs for rate limiting
## License
See LICENSE file.