docs: add README and update ROADMAP
- README.md: installation, configuration, usage, deployment - ROADMAP.md: mark completed items (pooling, scaling, latency, containers) - priority matrix updated with completion status
This commit is contained in:
261
README.md
Normal file
261
README.md
Normal file
@@ -0,0 +1,261 @@
|
|||||||
|
# PPF - Proxy Fetcher
|
||||||
|
|
||||||
|
A Python 2.7 proxy discovery and validation framework.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP).
|
||||||
|
|
||||||
|
```
|
||||||
|
scraper.py ──> ppf.py ──> proxywatchd.py
|
||||||
|
│ │ │
|
||||||
|
│ search │ harvest │ validate
|
||||||
|
│ engines │ proxies │ via tor
|
||||||
|
v v v
|
||||||
|
SQLite databases
|
||||||
|
```
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- Python 2.7
|
||||||
|
- Tor SOCKS proxy (default: 127.0.0.1:9050)
|
||||||
|
- beautifulsoup4 (optional with --nobs flag)
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### Local
|
||||||
|
|
||||||
|
```sh
|
||||||
|
pip install -r requirements.txt
|
||||||
|
cp config.ini.sample config.ini
|
||||||
|
cp servers.txt.sample servers.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Container (Rootless)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# On container host, as dedicated user
|
||||||
|
podman build -t ppf:latest .
|
||||||
|
podman run --rm ppf:latest python ppf.py --help
|
||||||
|
```
|
||||||
|
|
||||||
|
Prerequisites for rootless containers:
|
||||||
|
- subuid/subgid mappings configured
|
||||||
|
- linger enabled (`loginctl enable-linger $USER`)
|
||||||
|
- passt installed for networking
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Copy `config.ini.sample` to `config.ini` and adjust:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[common]
|
||||||
|
tor_hosts = 127.0.0.1:9050 # Comma-separated Tor SOCKS addresses
|
||||||
|
timeout_connect = 10 # Connection timeout (seconds)
|
||||||
|
timeout_read = 15 # Read timeout (seconds)
|
||||||
|
|
||||||
|
[watchd]
|
||||||
|
threads = 10 # Parallel validation threads
|
||||||
|
max_fail = 5 # Failures before proxy marked dead
|
||||||
|
checktime = 1800 # Base recheck interval (seconds)
|
||||||
|
database = proxies.sqlite # Proxy database path
|
||||||
|
stale_days = 30 # Days before removing dead proxies
|
||||||
|
stats_interval = 300 # Seconds between status reports
|
||||||
|
|
||||||
|
[ppf]
|
||||||
|
threads = 3 # URL harvesting threads
|
||||||
|
search = 1 # Enable search engine discovery
|
||||||
|
database = websites.sqlite # URL database path
|
||||||
|
|
||||||
|
[scraper]
|
||||||
|
engines = searx,duckduckgo # Comma-separated search engines
|
||||||
|
max_pages = 5 # Max pages per engine query
|
||||||
|
|
||||||
|
[httpd]
|
||||||
|
enabled = 0 # Enable REST API
|
||||||
|
port = 8081 # API listen port
|
||||||
|
listenip = 127.0.0.1 # API bind address
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Proxy Validation Daemon
|
||||||
|
|
||||||
|
```sh
|
||||||
|
python proxywatchd.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Validates proxies from the database against multiple targets. Requires:
|
||||||
|
- `servers.txt` with IRC servers (for IRC mode) or uses built-in HTTP targets
|
||||||
|
- Running Tor instance
|
||||||
|
|
||||||
|
### URL Harvester
|
||||||
|
|
||||||
|
```sh
|
||||||
|
python ppf.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally.
|
||||||
|
|
||||||
|
### Search Engine Scraper
|
||||||
|
|
||||||
|
```sh
|
||||||
|
python scraper.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Queries search engines for proxy list URLs. Supports:
|
||||||
|
- SearXNG instances
|
||||||
|
- DuckDuckGo, Startpage, Brave, Ecosia
|
||||||
|
- GitHub, GitLab, Codeberg (code search)
|
||||||
|
|
||||||
|
### Import From File
|
||||||
|
|
||||||
|
```sh
|
||||||
|
python ppf.py --file proxies.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### CLI Flags
|
||||||
|
|
||||||
|
```
|
||||||
|
--nobs Use stdlib HTMLParser instead of BeautifulSoup
|
||||||
|
--file FILE Import proxies from file
|
||||||
|
-q, --quiet Show warnings and errors only
|
||||||
|
-v, --verbose Show debug messages
|
||||||
|
```
|
||||||
|
|
||||||
|
## REST API
|
||||||
|
|
||||||
|
Enable in config with `httpd.enabled = 1`.
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Get working proxies
|
||||||
|
curl http://localhost:8081/proxies?limit=10&proto=socks5
|
||||||
|
|
||||||
|
# Get count
|
||||||
|
curl http://localhost:8081/proxies/count
|
||||||
|
|
||||||
|
# Health check
|
||||||
|
curl http://localhost:8081/health
|
||||||
|
```
|
||||||
|
|
||||||
|
Query parameters:
|
||||||
|
- `limit` - Max results (default: 100)
|
||||||
|
- `proto` - Filter by protocol (socks4/socks5/http)
|
||||||
|
- `country` - Filter by country code
|
||||||
|
- `format` - Output format (json/plain)
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Components
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| proxywatchd.py | Proxy validation daemon with multi-target voting |
|
||||||
|
| ppf.py | URL harvester and proxy extractor |
|
||||||
|
| scraper.py | Search engine integration |
|
||||||
|
| fetch.py | HTTP client with proxy support |
|
||||||
|
| dbs.py | Database operations |
|
||||||
|
| mysqlite.py | SQLite wrapper with WAL mode |
|
||||||
|
| connection_pool.py | Tor connection pooling with health tracking |
|
||||||
|
| config.py | Configuration management |
|
||||||
|
| httpd.py | REST API server |
|
||||||
|
|
||||||
|
### Validation Logic
|
||||||
|
|
||||||
|
Each proxy is tested against 3 random targets:
|
||||||
|
- 2/3 majority required for success
|
||||||
|
- Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4)
|
||||||
|
- SSL/TLS tested periodically
|
||||||
|
- MITM detection via certificate validation
|
||||||
|
|
||||||
|
### Database Schema
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- proxylist (proxies.sqlite)
|
||||||
|
proxy TEXT UNIQUE -- ip:port
|
||||||
|
proto TEXT -- socks4/socks5/http
|
||||||
|
country TEXT -- 2-letter code
|
||||||
|
failed INT -- consecutive failures
|
||||||
|
success_count INT -- total successes
|
||||||
|
avg_latency REAL -- rolling average (ms)
|
||||||
|
tested INT -- last test timestamp
|
||||||
|
|
||||||
|
-- uris (websites.sqlite)
|
||||||
|
url TEXT UNIQUE -- source URL
|
||||||
|
error INT -- consecutive errors
|
||||||
|
stale_count INT -- checks without new proxies
|
||||||
|
```
|
||||||
|
|
||||||
|
### Threading Model
|
||||||
|
|
||||||
|
- Priority queue orders jobs by proxy health
|
||||||
|
- Dynamic thread scaling based on success rate
|
||||||
|
- Work-stealing ensures even load distribution
|
||||||
|
- Tor connection pooling with worker affinity
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
### Systemd Service
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=PPF Proxy Validator
|
||||||
|
After=network-online.target tor.service
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=ppf
|
||||||
|
WorkingDirectory=/opt/ppf
|
||||||
|
ExecStart=/usr/bin/python2 proxywatchd.py
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=30
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
```
|
||||||
|
|
||||||
|
### Container Deployment
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Build
|
||||||
|
podman build -t ppf:latest .
|
||||||
|
|
||||||
|
# Run with persistent storage
|
||||||
|
podman run -d --name ppf \
|
||||||
|
-v ./data:/app/data:Z \
|
||||||
|
-v ./config.ini:/app/config.ini:ro \
|
||||||
|
ppf:latest python proxywatchd.py
|
||||||
|
|
||||||
|
# Generate systemd unit
|
||||||
|
podman generate systemd --name ppf --files --new
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Low Success Rate
|
||||||
|
|
||||||
|
```
|
||||||
|
WATCHD X.XX% SUCCESS RATE - tor circuit blocked?
|
||||||
|
```
|
||||||
|
|
||||||
|
- Tor circuit may be flagged; restart Tor
|
||||||
|
- Target servers may be blocking; wait for rotation
|
||||||
|
- Network issues; check connectivity
|
||||||
|
|
||||||
|
### Database Locked
|
||||||
|
|
||||||
|
WAL mode handles most concurrency. If issues persist:
|
||||||
|
- Reduce thread count
|
||||||
|
- Check disk I/O
|
||||||
|
- Verify single instance running
|
||||||
|
|
||||||
|
### No Proxies Found
|
||||||
|
|
||||||
|
- Check search engines in config
|
||||||
|
- Verify Tor connectivity
|
||||||
|
- Review scraper logs for rate limiting
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
See LICENSE file.
|
||||||
51
ROADMAP.md
51
ROADMAP.md
@@ -177,18 +177,18 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
|||||||
├──────────────────────────┬──────────────────────────────────────────────────┤
|
├──────────────────────────┬──────────────────────────────────────────────────┤
|
||||||
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │
|
│ HIGH IMPACT / LOW EFFORT │ HIGH IMPACT / HIGH EFFORT │
|
||||||
│ │ │
|
│ │ │
|
||||||
│ ● Unify _known_proxies │ ● Connection pooling │
|
│ [x] Unify _known_proxies │ [x] Connection pooling │
|
||||||
│ ● Graceful DB errors │ ● Dynamic thread scaling │
|
│ [x] Graceful DB errors │ [x] Dynamic thread scaling │
|
||||||
│ ● Batch inserts │ ● Unit test infrastructure │
|
│ [x] Batch inserts │ [ ] Unit test infrastructure │
|
||||||
│ ● WAL mode for SQLite │ ● Latency tracking │
|
│ [x] WAL mode for SQLite │ [x] Latency tracking │
|
||||||
│ │ │
|
│ │ │
|
||||||
├──────────────────────────┼──────────────────────────────────────────────────┤
|
├──────────────────────────┼──────────────────────────────────────────────────┤
|
||||||
│ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │
|
│ LOW IMPACT / LOW EFFORT │ LOW IMPACT / HIGH EFFORT │
|
||||||
│ │ │
|
│ │ │
|
||||||
│ ● Standardize logging │ ● Geographic validation │
|
│ [x] Standardize logging │ [ ] Geographic validation │
|
||||||
│ ● Config validation │ ● Additional scrapers │
|
│ [x] Config validation │ [x] Additional scrapers │
|
||||||
│ ● Export functionality │ ● API sources │
|
│ [ ] Export functionality │ [ ] API sources │
|
||||||
│ ● Status output │ ● Protocol fingerprinting │
|
│ [x] Status output │ [ ] Protocol fingerprinting │
|
||||||
│ │ │
|
│ │ │
|
||||||
└──────────────────────────┴──────────────────────────────────────────────────┘
|
└──────────────────────────┴──────────────────────────────────────────────────┘
|
||||||
```
|
```
|
||||||
@@ -233,6 +233,41 @@ PPF (Proxy Fetcher) is a Python 2 proxy scraping and validation framework design
|
|||||||
- [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days)
|
- [x] Stale proxy cleanup (cleanup_stale() with configurable stale_days)
|
||||||
- [x] Timeout config options (timeout_connect, timeout_read)
|
- [x] Timeout config options (timeout_connect, timeout_read)
|
||||||
|
|
||||||
|
### Connection Pooling (Done)
|
||||||
|
- [x] TorHostState class tracking per-host health and latency
|
||||||
|
- [x] TorConnectionPool with worker affinity for circuit reuse
|
||||||
|
- [x] Exponential backoff (5s, 10s, 20s, 40s, max 60s) on failures
|
||||||
|
- [x] Pool warmup and health status reporting
|
||||||
|
|
||||||
|
### Priority Queue (Done)
|
||||||
|
- [x] PriorityJobQueue class with heap-based ordering
|
||||||
|
- [x] calculate_priority() assigns priority 0-4 by proxy state
|
||||||
|
- [x] New proxies tested first, high-fail proxies last
|
||||||
|
|
||||||
|
### Dynamic Thread Scaling (Done)
|
||||||
|
- [x] ThreadScaler class adjusts thread count dynamically
|
||||||
|
- [x] Scales up when queue deep and success rate acceptable
|
||||||
|
- [x] Scales down when queue shallow or success rate drops
|
||||||
|
- [x] Respects min/max bounds with cooldown period
|
||||||
|
|
||||||
|
### Latency Tracking (Done)
|
||||||
|
- [x] avg_latency, latency_samples columns in proxylist
|
||||||
|
- [x] Exponential moving average calculation
|
||||||
|
- [x] Migration function for existing databases
|
||||||
|
- [x] Latency recorded for successful proxy tests
|
||||||
|
|
||||||
|
### Container Support (Done)
|
||||||
|
- [x] Dockerfile with Python 2.7-slim base
|
||||||
|
- [x] docker-compose.yml for local development
|
||||||
|
- [x] Rootless podman deployment documentation
|
||||||
|
- [x] Volume mounts for persistent data
|
||||||
|
|
||||||
|
### Code Style (Done)
|
||||||
|
- [x] Normalized indentation (4-space, no tabs)
|
||||||
|
- [x] Removed dead code and unused imports
|
||||||
|
- [x] Added docstrings to classes and functions
|
||||||
|
- [x] Python 2/3 compatible imports (Queue/queue)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Technical Debt
|
## Technical Debt
|
||||||
|
|||||||
Reference in New Issue
Block a user