docs: add README and update ROADMAP
- README.md: installation, configuration, usage, deployment - ROADMAP.md: mark completed items (pooling, scaling, latency, containers) - priority matrix updated with completion status
This commit is contained in:
261
README.md
Normal file
261
README.md
Normal file
@@ -0,0 +1,261 @@
|
||||
# PPF - Proxy Fetcher
|
||||
|
||||
A Python 2.7 proxy discovery and validation framework.
|
||||
|
||||
## Overview
|
||||
|
||||
PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP).
|
||||
|
||||
```
|
||||
scraper.py ──> ppf.py ──> proxywatchd.py
|
||||
│ │ │
|
||||
│ search │ harvest │ validate
|
||||
│ engines │ proxies │ via tor
|
||||
v v v
|
||||
SQLite databases
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 2.7
|
||||
- Tor SOCKS proxy (default: 127.0.0.1:9050)
|
||||
- beautifulsoup4 (optional with --nobs flag)
|
||||
|
||||
## Installation
|
||||
|
||||
### Local
|
||||
|
||||
```sh
|
||||
pip install -r requirements.txt
|
||||
cp config.ini.sample config.ini
|
||||
cp servers.txt.sample servers.txt
|
||||
```
|
||||
|
||||
### Container (Rootless)
|
||||
|
||||
```sh
|
||||
# On container host, as dedicated user
|
||||
podman build -t ppf:latest .
|
||||
podman run --rm ppf:latest python ppf.py --help
|
||||
```
|
||||
|
||||
Prerequisites for rootless containers:
|
||||
- subuid/subgid mappings configured
|
||||
- linger enabled (`loginctl enable-linger $USER`)
|
||||
- passt installed for networking
|
||||
|
||||
## Configuration
|
||||
|
||||
Copy `config.ini.sample` to `config.ini` and adjust:
|
||||
|
||||
```ini
|
||||
[common]
|
||||
tor_hosts = 127.0.0.1:9050 # Comma-separated Tor SOCKS addresses
|
||||
timeout_connect = 10 # Connection timeout (seconds)
|
||||
timeout_read = 15 # Read timeout (seconds)
|
||||
|
||||
[watchd]
|
||||
threads = 10 # Parallel validation threads
|
||||
max_fail = 5 # Failures before proxy marked dead
|
||||
checktime = 1800 # Base recheck interval (seconds)
|
||||
database = proxies.sqlite # Proxy database path
|
||||
stale_days = 30 # Days before removing dead proxies
|
||||
stats_interval = 300 # Seconds between status reports
|
||||
|
||||
[ppf]
|
||||
threads = 3 # URL harvesting threads
|
||||
search = 1 # Enable search engine discovery
|
||||
database = websites.sqlite # URL database path
|
||||
|
||||
[scraper]
|
||||
engines = searx,duckduckgo # Comma-separated search engines
|
||||
max_pages = 5 # Max pages per engine query
|
||||
|
||||
[httpd]
|
||||
enabled = 0 # Enable REST API
|
||||
port = 8081 # API listen port
|
||||
listenip = 127.0.0.1 # API bind address
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Proxy Validation Daemon
|
||||
|
||||
```sh
|
||||
python proxywatchd.py
|
||||
```
|
||||
|
||||
Validates proxies from the database against multiple targets. Requires:
|
||||
- `servers.txt` with IRC servers (for IRC mode) or uses built-in HTTP targets
|
||||
- Running Tor instance
|
||||
|
||||
### URL Harvester
|
||||
|
||||
```sh
|
||||
python ppf.py
|
||||
```
|
||||
|
||||
Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally.
|
||||
|
||||
### Search Engine Scraper
|
||||
|
||||
```sh
|
||||
python scraper.py
|
||||
```
|
||||
|
||||
Queries search engines for proxy list URLs. Supports:
|
||||
- SearXNG instances
|
||||
- DuckDuckGo, Startpage, Brave, Ecosia
|
||||
- GitHub, GitLab, Codeberg (code search)
|
||||
|
||||
### Import From File
|
||||
|
||||
```sh
|
||||
python ppf.py --file proxies.txt
|
||||
```
|
||||
|
||||
### CLI Flags
|
||||
|
||||
```
|
||||
--nobs Use stdlib HTMLParser instead of BeautifulSoup
|
||||
--file FILE Import proxies from file
|
||||
-q, --quiet Show warnings and errors only
|
||||
-v, --verbose Show debug messages
|
||||
```
|
||||
|
||||
## REST API
|
||||
|
||||
Enable in config with `httpd.enabled = 1`.
|
||||
|
||||
```sh
|
||||
# Get working proxies
|
||||
curl http://localhost:8081/proxies?limit=10&proto=socks5
|
||||
|
||||
# Get count
|
||||
curl http://localhost:8081/proxies/count
|
||||
|
||||
# Health check
|
||||
curl http://localhost:8081/health
|
||||
```
|
||||
|
||||
Query parameters:
|
||||
- `limit` - Max results (default: 100)
|
||||
- `proto` - Filter by protocol (socks4/socks5/http)
|
||||
- `country` - Filter by country code
|
||||
- `format` - Output format (json/plain)
|
||||
|
||||
## Architecture
|
||||
|
||||
### Components
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| proxywatchd.py | Proxy validation daemon with multi-target voting |
|
||||
| ppf.py | URL harvester and proxy extractor |
|
||||
| scraper.py | Search engine integration |
|
||||
| fetch.py | HTTP client with proxy support |
|
||||
| dbs.py | Database operations |
|
||||
| mysqlite.py | SQLite wrapper with WAL mode |
|
||||
| connection_pool.py | Tor connection pooling with health tracking |
|
||||
| config.py | Configuration management |
|
||||
| httpd.py | REST API server |
|
||||
|
||||
### Validation Logic
|
||||
|
||||
Each proxy is tested against 3 random targets:
|
||||
- 2/3 majority required for success
|
||||
- Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4)
|
||||
- SSL/TLS tested periodically
|
||||
- MITM detection via certificate validation
|
||||
|
||||
### Database Schema
|
||||
|
||||
```sql
|
||||
-- proxylist (proxies.sqlite)
|
||||
proxy TEXT UNIQUE -- ip:port
|
||||
proto TEXT -- socks4/socks5/http
|
||||
country TEXT -- 2-letter code
|
||||
failed INT -- consecutive failures
|
||||
success_count INT -- total successes
|
||||
avg_latency REAL -- rolling average (ms)
|
||||
tested INT -- last test timestamp
|
||||
|
||||
-- uris (websites.sqlite)
|
||||
url TEXT UNIQUE -- source URL
|
||||
error INT -- consecutive errors
|
||||
stale_count INT -- checks without new proxies
|
||||
```
|
||||
|
||||
### Threading Model
|
||||
|
||||
- Priority queue orders jobs by proxy health
|
||||
- Dynamic thread scaling based on success rate
|
||||
- Work-stealing ensures even load distribution
|
||||
- Tor connection pooling with worker affinity
|
||||
|
||||
## Deployment
|
||||
|
||||
### Systemd Service
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=PPF Proxy Validator
|
||||
After=network-online.target tor.service
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=ppf
|
||||
WorkingDirectory=/opt/ppf
|
||||
ExecStart=/usr/bin/python2 proxywatchd.py
|
||||
Restart=on-failure
|
||||
RestartSec=30
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
### Container Deployment
|
||||
|
||||
```sh
|
||||
# Build
|
||||
podman build -t ppf:latest .
|
||||
|
||||
# Run with persistent storage
|
||||
podman run -d --name ppf \
|
||||
-v ./data:/app/data:Z \
|
||||
-v ./config.ini:/app/config.ini:ro \
|
||||
ppf:latest python proxywatchd.py
|
||||
|
||||
# Generate systemd unit
|
||||
podman generate systemd --name ppf --files --new
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Low Success Rate
|
||||
|
||||
```
|
||||
WATCHD X.XX% SUCCESS RATE - tor circuit blocked?
|
||||
```
|
||||
|
||||
- Tor circuit may be flagged; restart Tor
|
||||
- Target servers may be blocking; wait for rotation
|
||||
- Network issues; check connectivity
|
||||
|
||||
### Database Locked
|
||||
|
||||
WAL mode handles most concurrency. If issues persist:
|
||||
- Reduce thread count
|
||||
- Check disk I/O
|
||||
- Verify single instance running
|
||||
|
||||
### No Proxies Found
|
||||
|
||||
- Check search engines in config
|
||||
- Verify Tor connectivity
|
||||
- Review scraper logs for rate limiting
|
||||
|
||||
## License
|
||||
|
||||
See LICENSE file.
|
||||
51
ROADMAP.md
51
ROADMAP.md
@@ -177,18 +177,18 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
├──────────────────────────┬──────────────────────────────────────────────────┤
|
||||
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │
|
||||
│ │ │
|
||||
│ ● Unify _known_proxies │ ● Connection pooling │
|
||||
│ ● Graceful DB errors │ ● Dynamic thread scaling │
|
||||
│ ● Batch inserts │ ● Unit test infrastructure │
|
||||
│ ● WAL mode for SQLite │ ● Latency tracking │
|
||||
│ [x] Unify _known_proxies │ [x] Connection pooling │
|
||||
│ [x] Graceful DB errors │ [x] Dynamic thread scaling │
|
||||
│ [x] Batch inserts │ [ ] Unit test infrastructure │
|
||||
│ [x] WAL mode for SQLite │ [x] Latency tracking │
|
||||
│ │ │
|
||||
├──────────────────────────┼──────────────────────────────────────────────────┤
|
||||
│ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │
|
||||
│ │ │
|
||||
│ ● Standardize logging │ ● Geographic validation │
|
||||
│ ● Config validation │ ● Additional scrapers │
|
||||
│ ● Export functionality │ ● API sources │
|
||||
│ ● Status output │ ● Protocol fingerprinting │
|
||||
│ [x] Standardize logging │ [ ] Geographic validation │
|
||||
│ [x] Config validation │ [x] Additional scrapers │
|
||||
│ [ ] Export functionality │ [ ] API sources │
|
||||
│ [x] Status output │ [ ] Protocol fingerprinting │
|
||||
│ │ │
|
||||
└──────────────────────────┴──────────────────────────────────────────────────┘
|
||||
```
|
||||
@@ -233,6 +233,41 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
||||
- [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days)
|
||||
- [x] Timeout config options (timeout_connect, timeout_read)
|
||||
|
||||
### Connection Pooling (Done)
|
||||
- [x] TorHostState class tracking per-host health and latency
|
||||
- [x] TorConnectionPool with worker affinity for circuit reuse
|
||||
- [x] Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures
|
||||
- [x] Pool warmup and health status reporting
|
||||
|
||||
### Priority Queue (Done)
|
||||
- [x] PriorityJobQueue class with heap-based ordering
|
||||
- [x] calculate_priority() assigns priority 0-4 by proxy state
|
||||
- [x] New proxies tested first, high-fail proxies last
|
||||
|
||||
### Dynamic Thread Scaling (Done)
|
||||
- [x] ThreadScaler class adjusts thread count dynamically
|
||||
- [x] Scales up when queue deep and success rate acceptable
|
||||
- [x] Scales down when queue shallow or success rate drops
|
||||
- [x] Respects min/max bounds with cooldown period
|
||||
|
||||
### Latency Tracking (Done)
|
||||
- [x] avg_latency, latency_samples columns in proxylist
|
||||
- [x] Exponential moving average calculation
|
||||
- [x] Migration function for existing databases
|
||||
- [x] Latency recorded for successful proxy tests
|
||||
|
||||
### Container Support (Done)
|
||||
- [x] Dockerfile with Python 2.7-slim base
|
||||
- [x] docker-compose.yml for local development
|
||||
- [x] Rootless podman deployment documentation
|
||||
- [x] Volume mounts for persistent data
|
||||
|
||||
### Code Style (Done)
|
||||
- [x] Normalized indentation (4-space, no tabs)
|
||||
- [x] Removed dead code and unused imports
|
||||
- [x] Added docstrings to classes and functions
|
||||
- [x] Python 2/3 compatible imports (Queue/queue)
|
||||
|
||||
---
|
||||
|
||||
## Technical Debt
|
||||
|
||||
Reference in New Issue
Block a user