263 lines
6.5 KiB
Markdown
263 lines
6.5 KiB
Markdown
# PPF - Python Proxy Finder
|
|
|
|
A Python 2.7 proxy discovery and validation framework.
|
|
|
|
## Overview
|
|
|
|
PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP).
|
|
|
|
```
|
|
scraper.py ──> ppf.py ──> proxywatchd.py
|
|
│ │ │
|
|
│ search │ harvest │ validate
|
|
│ engines │ proxies │ via tor
|
|
v v v
|
|
SQLite databases
|
|
```
|
|
|
|
## Requirements
|
|
|
|
- Python 2.7
|
|
- Tor SOCKS proxy (default: 127.0.0.1:9050)
|
|
- beautifulsoup4 (optional with --nobs flag)
|
|
|
|
## Installation
|
|
|
|
### Local
|
|
|
|
```sh
|
|
pip install -r requirements.txt
|
|
cp config.ini.sample config.ini
|
|
cp servers.txt.sample servers.txt
|
|
```
|
|
|
|
### Container (Rootless)
|
|
|
|
```sh
|
|
# On container host, as dedicated user
|
|
podman build -t ppf:latest .
|
|
podman run --rm ppf:latest python ppf.py --help
|
|
```
|
|
|
|
Prerequisites for rootless containers:
|
|
- subuid/subgid mappings configured
|
|
- linger enabled (`loginctl enable-linger $USER`)
|
|
- passt installed for networking
|
|
|
|
## Configuration
|
|
|
|
Copy `config.ini.sample` to `config.ini` and adjust:
|
|
|
|
```ini
|
|
[common]
|
|
tor_hosts = 127.0.0.1:9050 # Comma-separated Tor SOCKS addresses
|
|
timeout_connect = 10 # Connection timeout (seconds)
|
|
timeout_read = 15 # Read timeout (seconds)
|
|
|
|
[watchd]
|
|
threads = 10 # Parallel validation threads
|
|
max_fail = 5 # Failures before proxy marked dead
|
|
checktime = 1800 # Base recheck interval (seconds)
|
|
database = proxies.sqlite # Proxy database path
|
|
stale_days = 30 # Days before removing dead proxies
|
|
stats_interval = 300 # Seconds between status reports
|
|
|
|
[ppf]
|
|
threads = 3 # URL harvesting threads
|
|
search = 1 # Enable search engine discovery
|
|
database = websites.sqlite # URL database path
|
|
|
|
[scraper]
|
|
engines = searx,duckduckgo # Comma-separated search engines
|
|
max_pages = 5 # Max pages per engine query
|
|
|
|
[httpd]
|
|
enabled = 0 # Enable REST API
|
|
port = 8081 # API listen port
|
|
listenip = 127.0.0.1 # API bind address
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Proxy Validation Daemon
|
|
|
|
```sh
|
|
python proxywatchd.py
|
|
```
|
|
|
|
Validates proxies from the database against multiple targets. Requires:
|
|
- `servers.txt` with IRC servers (for IRC mode) or uses built-in HTTP targets
|
|
- Running Tor instance
|
|
|
|
### URL Harvester
|
|
|
|
```sh
|
|
python ppf.py
|
|
```
|
|
|
|
Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally.
|
|
|
|
### Search Engine Scraper
|
|
|
|
```sh
|
|
python scraper.py
|
|
```
|
|
|
|
Queries search engines for proxy list URLs. Supports:
|
|
- SearXNG instances
|
|
- DuckDuckGo, Startpage, Brave, Ecosia
|
|
- GitHub, GitLab, Codeberg (code search)
|
|
|
|
### Import From File
|
|
|
|
```sh
|
|
python ppf.py --file proxies.txt
|
|
```
|
|
|
|
### CLI Flags
|
|
|
|
```
|
|
--nobs Use stdlib HTMLParser instead of BeautifulSoup
|
|
--file FILE Import proxies from file
|
|
-q, --quiet Show warnings and errors only
|
|
-v, --verbose Show debug messages
|
|
```
|
|
|
|
## REST API
|
|
|
|
Enable in config with `httpd.enabled = 1`.
|
|
|
|
```sh
|
|
# Get working proxies
|
|
curl http://localhost:8081/proxies?limit=10&proto=socks5
|
|
|
|
# Get count
|
|
curl http://localhost:8081/proxies/count
|
|
|
|
# Health check
|
|
curl http://localhost:8081/health
|
|
```
|
|
|
|
Query parameters:
|
|
- `limit` - Max results (default: 100)
|
|
- `proto` - Filter by protocol (socks4/socks5/http)
|
|
- `country` - Filter by country code
|
|
- `asn` - Filter by ASN number
|
|
- `format` - Output format (json/plain)
|
|
|
|
## Architecture
|
|
|
|
### Components
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| proxywatchd.py | Proxy validation daemon with multi-target voting |
|
|
| ppf.py | URL harvester and proxy extractor |
|
|
| scraper.py | Search engine integration |
|
|
| fetch.py | HTTP client with proxy support |
|
|
| dbs.py | Database operations |
|
|
| mysqlite.py | SQLite wrapper with WAL mode |
|
|
| connection_pool.py | Tor connection pooling with health tracking |
|
|
| config.py | Configuration management |
|
|
| httpd.py | REST API server |
|
|
|
|
### Validation Logic
|
|
|
|
Each proxy is tested against 3 random targets:
|
|
- 2/3 majority required for success
|
|
- Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4)
|
|
- SSL/TLS tested periodically
|
|
- MITM detection via certificate validation
|
|
|
|
### Database Schema
|
|
|
|
```sql
|
|
-- proxylist (proxies.sqlite)
|
|
proxy TEXT UNIQUE -- ip:port
|
|
proto TEXT -- socks4/socks5/http
|
|
country TEXT -- 2-letter code
|
|
asn INT -- autonomous system number
|
|
failed INT -- consecutive failures
|
|
success_count INT -- total successes
|
|
avg_latency REAL -- rolling average (ms)
|
|
tested INT -- last test timestamp
|
|
|
|
-- uris (websites.sqlite)
|
|
url TEXT UNIQUE -- source URL
|
|
error INT -- consecutive errors
|
|
stale_count INT -- checks without new proxies
|
|
```
|
|
|
|
### Threading Model
|
|
|
|
- Priority queue orders jobs by proxy health
|
|
- Dynamic thread scaling based on success rate
|
|
- Work-stealing ensures even load distribution
|
|
- Tor connection pooling with worker affinity
|
|
|
|
## Deployment
|
|
|
|
### Container Deployment
|
|
|
|
All nodes use `podman-compose` with role-specific compose files
|
|
(rootless, as `podman` user). `--network=host` required for Tor
|
|
access at 127.0.0.1:9050.
|
|
|
|
```sh
|
|
# Build image
|
|
podman build -t ppf:latest .
|
|
|
|
# Start via compose
|
|
podman-compose up -d
|
|
|
|
# View logs / stop
|
|
podman-compose logs -f
|
|
podman-compose down
|
|
```
|
|
|
|
### Operations Toolkit
|
|
|
|
The `tools/` directory provides CLI wrappers for multi-node operations.
|
|
Deployment uses an Ansible playbook over WireGuard for parallel execution
|
|
and handler-based restarts.
|
|
|
|
```sh
|
|
ppf-deploy [targets...] # validate + deploy + restart (playbook)
|
|
ppf-deploy --check # dry run with diff
|
|
ppf-logs [node] # view container logs (-f to follow)
|
|
ppf-service <cmd> [nodes...] # status / start / stop / restart
|
|
ppf-db <cmd> # stats / purge-proxies / vacuum
|
|
ppf-status # cluster overview (containers, workers, queue)
|
|
```
|
|
|
|
See `--help` on each tool.
|
|
|
|
## Troubleshooting
|
|
|
|
### Low Success Rate
|
|
|
|
```
|
|
WATCHD X.XX% SUCCESS RATE - tor circuit blocked?
|
|
```
|
|
|
|
- Tor circuit may be flagged; restart Tor
|
|
- Target servers may be blocking; wait for rotation
|
|
- Network issues; check connectivity
|
|
|
|
### Database Locked
|
|
|
|
WAL mode handles most concurrency. If issues persist:
|
|
- Reduce thread count
|
|
- Check disk I/O
|
|
- Verify single instance running
|
|
|
|
### No Proxies Found
|
|
|
|
- Check search engines in config
|
|
- Verify Tor connectivity
|
|
- Review scraper logs for rate limiting
|
|
|
|
## License
|
|
|
|
See LICENSE file.
|