docs: add README and update ROADMAP

- README.md: installation, configuration, usage, deployment
- ROADMAP.md: mark completed items (pooling, scaling, latency, containers)
- priority matrix updated with completion status
This commit is contained in:
Username
2025-12-21 10:19:18 +01:00
parent 79475c2bff
commit 55bc9a635e
2 changed files with 304 additions and 8 deletions

261
README.md Normal file
View File

@@ -0,0 +1,261 @@
# PPF - Proxy Fetcher
A Python 2.7 proxy discovery and validation framework.
## Overview
PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP).
```
scraper.py ──> ppf.py ──> proxywatchd.py
│ │ │
│ search │ harvest │ validate
│ engines │ proxies │ via tor
v v v
SQLite databases
```
## Requirements
- Python 2.7
- Tor SOCKS proxy (default: 127.0.0.1:9050)
- beautifulsoup4 (optional with --nobs flag)
## Installation
### Local
```sh
pip install -r requirements.txt
cp config.ini.sample config.ini
cp servers.txt.sample servers.txt
```
### Container (Rootless)
```sh
# On container host, as dedicated user
podman build -t ppf:latest .
podman run --rm ppf:latest python ppf.py --help
```
Prerequisites for rootless containers:
- subuid/subgid mappings configured
- linger enabled (`loginctl enable-linger $USER`)
- passt installed for networking
## Configuration
Copy `config.ini.sample` to `config.ini` and adjust:
```ini
[common]
tor_hosts = 127.0.0.1:9050 # Comma-separated Tor SOCKS addresses
timeout_connect = 10 # Connection timeout (seconds)
timeout_read = 15 # Read timeout (seconds)
[watchd]
threads = 10 # Parallel validation threads
max_fail = 5 # Failures before proxy marked dead
checktime = 1800 # Base recheck interval (seconds)
database = proxies.sqlite # Proxy database path
stale_days = 30 # Days before removing dead proxies
stats_interval = 300 # Seconds between status reports
[ppf]
threads = 3 # URL harvesting threads
search = 1 # Enable search engine discovery
database = websites.sqlite # URL database path
[scraper]
engines = searx,duckduckgo # Comma-separated search engines
max_pages = 5 # Max pages per engine query
[httpd]
enabled = 0 # Enable REST API
port = 8081 # API listen port
listenip = 127.0.0.1 # API bind address
```
## Usage
### Proxy Validation Daemon
```sh
python proxywatchd.py
```
Validates proxies from the database against multiple targets. Requires:
- `servers.txt` with IRC servers (for IRC mode) or uses built-in HTTP targets
- Running Tor instance
### URL Harvester
```sh
python ppf.py
```
Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally.
### Search Engine Scraper
```sh
python scraper.py
```
Queries search engines for proxy list URLs. Supports:
- SearXNG instances
- DuckDuckGo, Startpage, Brave, Ecosia
- GitHub, GitLab, Codeberg (code search)
### Import From File
```sh
python ppf.py --file proxies.txt
```
### CLI Flags
```
--nobs Use stdlib HTMLParser instead of BeautifulSoup
--file FILE Import proxies from file
-q, --quiet Show warnings and errors only
-v, --verbose Show debug messages
```
## REST API
Enable in config with `httpd.enabled = 1`.
```sh
# Get working proxies
curl http://localhost:8081/proxies?limit=10&proto=socks5
# Get count
curl http://localhost:8081/proxies/count
# Health check
curl http://localhost:8081/health
```
Query parameters:
- `limit` - Max results (default: 100)
- `proto` - Filter by protocol (socks4/socks5/http)
- `country` - Filter by country code
- `format` - Output format (json/plain)
## Architecture
### Components
| File | Purpose |
|------|---------|
| proxywatchd.py | Proxy validation daemon with multi-target voting |
| ppf.py | URL harvester and proxy extractor |
| scraper.py | Search engine integration |
| fetch.py | HTTP client with proxy support |
| dbs.py | Database operations |
| mysqlite.py | SQLite wrapper with WAL mode |
| connection_pool.py | Tor connection pooling with health tracking |
| config.py | Configuration management |
| httpd.py | REST API server |
### Validation Logic
Each proxy is tested against 3 random targets:
- 2/3 majority required for success
- Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4)
- SSL/TLS tested periodically
- MITM detection via certificate validation
### Database Schema
```sql
-- proxylist (proxies.sqlite)
proxy TEXT UNIQUE -- ip:port
proto TEXT -- socks4/socks5/http
country TEXT -- 2-letter code
failed INT -- consecutive failures
success_count INT -- total successes
avg_latency REAL -- rolling average (ms)
tested INT -- last test timestamp
-- uris (websites.sqlite)
url TEXT UNIQUE -- source URL
error INT -- consecutive errors
stale_count INT -- checks without new proxies
```
### Threading Model
- Priority queue orders jobs by proxy health
- Dynamic thread scaling based on success rate
- Work-stealing ensures even load distribution
- Tor connection pooling with worker affinity
## Deployment
### Systemd Service
```ini
[Unit]
Description=PPF Proxy Validator
After=network-online.target tor.service
Wants=network-online.target
[Service]
Type=simple
User=ppf
WorkingDirectory=/opt/ppf
ExecStart=/usr/bin/python2 proxywatchd.py
Restart=on-failure
RestartSec=30
[Install]
WantedBy=multi-user.target
```
### Container Deployment
```sh
# Build
podman build -t ppf:latest .
# Run with persistent storage
podman run -d --name ppf \
-v ./data:/app/data:Z \
-v ./config.ini:/app/config.ini:ro \
ppf:latest python proxywatchd.py
# Generate systemd unit
podman generate systemd --name ppf --files --new
```
## Troubleshooting
### Low Success Rate
```
WATCHD X.XX% SUCCESS RATE - tor circuit blocked?
```
- Tor circuit may be flagged; restart Tor
- Target servers may be blocking; wait for rotation
- Network issues; check connectivity
### Database Locked
WAL mode handles most concurrency. If issues persist:
- Reduce thread count
- Check disk I/O
- Verify single instance running
### No Proxies Found
- Check search engines in config
- Verify Tor connectivity
- Review scraper logs for rate limiting
## License
See LICENSE file.

View File

@@ -177,18 +177,18 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
├──────────────────────────┬──────────────────────────────────────────────────┤
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │
│ │ │
Unify _known_proxies Connection pooling
Graceful DB errors Dynamic thread scaling
Batch inserts Unit test infrastructure
WAL mode for SQLite Latency tracking
[x] Unify _known_proxies │ [x] Connection pooling │
[x] Graceful DB errors │ [x] Dynamic thread scaling │
[x] Batch inserts │ [ ] Unit test infrastructure │
[x] WAL mode for SQLite │ [x] Latency tracking │
│ │ │
├──────────────────────────┼──────────────────────────────────────────────────┤
│ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │
│ │ │
Standardize logging Geographic validation
Config validation Additional scrapers
Export functionality │ ● API sources
Status output Protocol fingerprinting
[x] Standardize logging │ [ ] Geographic validation │
[x] Config validation │ [x] Additional scrapers │
[ ] Export functionality │ [ ] API sources │
[x] Status output │ [ ] Protocol fingerprinting │
│ │ │
└──────────────────────────┴──────────────────────────────────────────────────┘
```
@@ -233,6 +233,41 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
- [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days)
- [x] Timeout config options (timeout_connect, timeout_read)
### Connection Pooling (Done)
- [x] TorHostState class tracking per-host health and latency
- [x] TorConnectionPool with worker affinity for circuit reuse
- [x] Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures
- [x] Pool warmup and health status reporting
### Priority Queue (Done)
- [x] PriorityJobQueue class with heap-based ordering
- [x] calculate_priority() assigns priority 0-4 by proxy state
- [x] New proxies tested first, high-fail proxies last
### Dynamic Thread Scaling (Done)
- [x] ThreadScaler class adjusts thread count dynamically
- [x] Scales up when queue deep and success rate acceptable
- [x] Scales down when queue shallow or success rate drops
- [x] Respects min/max bounds with cooldown period
### Latency Tracking (Done)
- [x] avg_latency, latency_samples columns in proxylist
- [x] Exponential moving average calculation
- [x] Migration function for existing databases
- [x] Latency recorded for successful proxy tests
### Container Support (Done)
- [x] Dockerfile with Python 2.7-slim base
- [x] docker-compose.yml for local development
- [x] Rootless podman deployment documentation
- [x] Volume mounts for persistent data
### Code Style (Done)
- [x] Normalized indentation (4-space, no tabs)
- [x] Removed dead code and unused imports
- [x] Added docstrings to classes and functions
- [x] Python 2/3 compatible imports (Queue/queue)
---
## Technical Debt