diff --git a/README.md b/README.md new file mode 100644 index 0000000..e7d856b --- /dev/null +++ b/README.md @@ -0,0 +1,261 @@ +# PPF - Proxy Fetcher + +A Python 2.7 proxy discovery and validation framework. + +## Overview + +PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP). + +``` +scraper.py ──> ppf.py ──> proxywatchd.py + │ │ │ + │ search │ harvest │ validate + │ engines │ proxies │ via tor + v v v + SQLite databases +``` + +## Requirements + +- Python 2.7 +- Tor SOCKS proxy (default: 127.0.0.1:9050) +- beautifulsoup4 (optional with --nobs flag) + +## Installation + +### Local + +```sh +pip install -r requirements.txt +cp config.ini.sample config.ini +cp servers.txt.sample servers.txt +``` + +### Container (Rootless) + +```sh +# On container host, as dedicated user +podman build -t ppf:latest . +podman run --rm ppf:latest python ppf.py --help +``` + +Prerequisites for rootless containers: +- subuid/subgid mappings configured +- linger enabled (`loginctl enable-linger $USER`) +- passt installed for networking + +## Configuration + +Copy `config.ini.sample` to `config.ini` and adjust: + +```ini +[common] +tor_hosts = 127.0.0.1:9050 # Comma-separated Tor SOCKS addresses +timeout_connect = 10 # Connection timeout (seconds) +timeout_read = 15 # Read timeout (seconds) + +[watchd] +threads = 10 # Parallel validation threads +max_fail = 5 # Failures before proxy marked dead +checktime = 1800 # Base recheck interval (seconds) +database = proxies.sqlite # Proxy database path +stale_days = 30 # Days before removing dead proxies +stats_interval = 300 # Seconds between status reports + +[ppf] +threads = 3 # URL harvesting threads +search = 1 # Enable search engine discovery +database = websites.sqlite # URL database path + +[scraper] +engines = searx,duckduckgo # Comma-separated search engines +max_pages = 5 # Max pages per engine query + +[httpd] +enabled = 0 # Enable REST API +port = 8081 # API listen port +listenip = 127.0.0.1 # API bind address +``` + +## Usage + +### Proxy Validation Daemon + +```sh +python proxywatchd.py +``` + +Validates proxies from the database against multiple targets. Requires: +- `servers.txt` with IRC servers (for IRC mode) or uses built-in HTTP targets +- Running Tor instance + +### URL Harvester + +```sh +python ppf.py +``` + +Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally. + +### Search Engine Scraper + +```sh +python scraper.py +``` + +Queries search engines for proxy list URLs. Supports: +- SearXNG instances +- DuckDuckGo, Startpage, Brave, Ecosia +- GitHub, GitLab, Codeberg (code search) + +### Import From File + +```sh +python ppf.py --file proxies.txt +``` + +### CLI Flags + +``` +--nobs Use stdlib HTMLParser instead of BeautifulSoup +--file FILE Import proxies from file +-q, --quiet Show warnings and errors only +-v, --verbose Show debug messages +``` + +## REST API + +Enable in config with `httpd.enabled = 1`. + +```sh +# Get working proxies +curl http://localhost:8081/proxies?limit=10&proto=socks5 + +# Get count +curl http://localhost:8081/proxies/count + +# Health check +curl http://localhost:8081/health +``` + +Query parameters: +- `limit` - Max results (default: 100) +- `proto` - Filter by protocol (socks4/socks5/http) +- `country` - Filter by country code +- `format` - Output format (json/plain) + +## Architecture + +### Components + +| File | Purpose | +|------|---------| +| proxywatchd.py | Proxy validation daemon with multi-target voting | +| ppf.py | URL harvester and proxy extractor | +| scraper.py | Search engine integration | +| fetch.py | HTTP client with proxy support | +| dbs.py | Database operations | +| mysqlite.py | SQLite wrapper with WAL mode | +| connection_pool.py | Tor connection pooling with health tracking | +| config.py | Configuration management | +| httpd.py | REST API server | + +### Validation Logic + +Each proxy is tested against 3 random targets: +- 2/3 majority required for success +- Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4) +- SSL/TLS tested periodically +- MITM detection via certificate validation + +### Database Schema + +```sql +-- proxylist (proxies.sqlite) +proxy TEXT UNIQUE -- ip:port +proto TEXT -- socks4/socks5/http +country TEXT -- 2-letter code +failed INT -- consecutive failures +success_count INT -- total successes +avg_latency REAL -- rolling average (ms) +tested INT -- last test timestamp + +-- uris (websites.sqlite) +url TEXT UNIQUE -- source URL +error INT -- consecutive errors +stale_count INT -- checks without new proxies +``` + +### Threading Model + +- Priority queue orders jobs by proxy health +- Dynamic thread scaling based on success rate +- Work-stealing ensures even load distribution +- Tor connection pooling with worker affinity + +## Deployment + +### Systemd Service + +```ini +[Unit] +Description=PPF Proxy Validator +After=network-online.target tor.service +Wants=network-online.target + +[Service] +Type=simple +User=ppf +WorkingDirectory=/opt/ppf +ExecStart=/usr/bin/python2 proxywatchd.py +Restart=on-failure +RestartSec=30 + +[Install] +WantedBy=multi-user.target +``` + +### Container Deployment + +```sh +# Build +podman build -t ppf:latest . + +# Run with persistent storage +podman run -d --name ppf \ + -v ./data:/app/data:Z \ + -v ./config.ini:/app/config.ini:ro \ + ppf:latest python proxywatchd.py + +# Generate systemd unit +podman generate systemd --name ppf --files --new +``` + +## Troubleshooting + +### Low Success Rate + +``` +WATCHD X.XX% SUCCESS RATE - tor circuit blocked? +``` + +- Tor circuit may be flagged; restart Tor +- Target servers may be blocking; wait for rotation +- Network issues; check connectivity + +### Database Locked + +WAL mode handles most concurrency. If issues persist: +- Reduce thread count +- Check disk I/O +- Verify single instance running + +### No Proxies Found + +- Check search engines in config +- Verify Tor connectivity +- Review scraper logs for rate limiting + +## License + +See LICENSE file. diff --git a/ROADMAP.md b/ROADMAP.md index 30d7cab..8324f87 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -177,18 +177,18 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design ├──────────────────────────┬──────────────────────────────────────────────────┤ │ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │ │ │ │ -│ ● Unify _known_proxies │ ● Connection pooling │ -│ ● Graceful DB errors │ ● Dynamic thread scaling │ -│ ● Batch inserts │ ● Unit test infrastructure │ -│ ● WAL mode for SQLite │ ● Latency tracking │ +│ [x] Unify _known_proxies │ [x] Connection pooling │ +│ [x] Graceful DB errors │ [x] Dynamic thread scaling │ +│ [x] Batch inserts │ [ ] Unit test infrastructure │ +│ [x] WAL mode for SQLite │ [x] Latency tracking │ │ │ │ ├──────────────────────────┼──────────────────────────────────────────────────┤ │ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │ │ │ │ -│ ● Standardize logging │ ● Geographic validation │ -│ ● Config validation │ ● Additional scrapers │ -│ ● Export functionality │ ● API sources │ -│ ● Status output │ ● Protocol fingerprinting │ +│ [x] Standardize logging │ [ ] Geographic validation │ +│ [x] Config validation │ [x] Additional scrapers │ +│ [ ] Export functionality │ [ ] API sources │ +│ [x] Status output │ [ ] Protocol fingerprinting │ │ │ │ └──────────────────────────┴──────────────────────────────────────────────────┘ ``` @@ -233,6 +233,41 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design - [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days) - [x] Timeout config options (timeout_connect, timeout_read) +### Connection Pooling (Done) +- [x] TorHostState class tracking per-host health and latency +- [x] TorConnectionPool with worker affinity for circuit reuse +- [x] Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures +- [x] Pool warmup and health status reporting + +### Priority Queue (Done) +- [x] PriorityJobQueue class with heap-based ordering +- [x] calculate_priority() assigns priority 0-4 by proxy state +- [x] New proxies tested first, high-fail proxies last + +### Dynamic Thread Scaling (Done) +- [x] ThreadScaler class adjusts thread count dynamically +- [x] Scales up when queue deep and success rate acceptable +- [x] Scales down when queue shallow or success rate drops +- [x] Respects min/max bounds with cooldown period + +### Latency Tracking (Done) +- [x] avg_latency, latency_samples columns in proxylist +- [x] Exponential moving average calculation +- [x] Migration function for existing databases +- [x] Latency recorded for successful proxy tests + +### Container Support (Done) +- [x] Dockerfile with Python 2.7-slim base +- [x] docker-compose.yml for local development +- [x] Rootless podman deployment documentation +- [x] Volume mounts for persistent data + +### Code Style (Done) +- [x] Normalized indentation (4-space, no tabs) +- [x] Removed dead code and unused imports +- [x] Added docstrings to classes and functions +- [x] Python 2/3 compatible imports (Queue/queue) + --- ## Technical Debt