Go to file

CI / validate (push) Successful in 20s

Details

add compose-based test runner for Python 2.7

Dockerfile.test builds production image with pytest baked in.
compose.test.yml mounts source as volume for fast iteration.
Usage: podman-compose -f compose.test.yml run --rm test

2026-02-22 15:38:00 +01:00

.gitea/workflows

ci: consolidate jobs, expand import check, add yaml lint

2026-02-18 20:59:49 +01:00

.github

add copilot instructions

2025-12-20 16:46:44 +01:00

documentation

docs: add worker-driven discovery design doc

2026-02-17 13:32:42 +01:00

static

dashboard: update UI for queue status display

2026-01-08 09:05:34 +01:00

tests

tests: add extraction short-circuit and integration tests

2026-02-22 13:50:34 +01:00

tools

tools: fix ansible output filter in ppf-db

2026-02-18 01:02:45 +01:00

.gitignore

add .venv to .gitignore

2026-02-17 21:07:16 +01:00

CLAUDE.md

tools: add ppf-status cluster overview

2026-02-18 01:02:42 +01:00

comboparse.py

comboparse: replace SafeConfigParser with ConfigParser

2026-02-18 21:01:10 +01:00

compose.master.yml

compose: add k8s-file logging driver to master

2026-02-18 08:35:04 +01:00

compose.test.yml

add compose-based test runner for Python 2.7

2026-02-22 15:38:00 +01:00

compose.worker.yml

rename --worker-v2 to --worker

2026-02-17 22:30:09 +01:00

config.ini.sample

add ssl_first: try SSL handshake before secondary check

2025-12-28 14:56:46 +01:00

config.py

worker: add local proxy test cache

2026-02-18 01:37:09 +01:00

connection_pool.py

connection_pool: fix python 2 integer division

2025-12-24 00:17:40 +01:00

CRUSH.md

add project context documentation

2025-12-20 16:47:10 +01:00

dbs.py

dbs: add SOCKS5-specific proxy sources

2026-02-22 15:37:50 +01:00

dns.py

refactor: extract modules from proxywatchd.py

2025-12-28 15:45:24 +01:00

Dockerfile

dockerfile: apply debian 10 security updates

2026-01-18 09:14:48 +01:00

Dockerfile.test

add compose-based test runner for Python 2.7

2026-02-22 15:38:00 +01:00

engines.py

engines: add Bing and Yahoo search engines

2025-12-25 02:51:11 +01:00

export.py

add export.py for proxy list export

2025-12-23 17:34:51 +01:00

fetch.py

fetch: add short-circuit guards to extraction functions

2026-02-22 13:50:29 +01:00

http2.py

refactor core modules, integrate network stats

2025-12-25 11:13:20 +01:00

httpd.py

httpd: cache sqlite connections per-greenlet, lazy-load ASN, sharpen URL scoring

2026-02-22 15:37:43 +01:00

job.py

refactor: extract modules from proxywatchd.py

2025-12-28 15:45:24 +01:00

misc.py

misc: simplify tor proxy URL to avoid circuit exhaustion

2026-01-08 09:05:03 +01:00

mitm.py

refactor: extract modules from proxywatchd.py

2025-12-28 15:45:24 +01:00

mysqlite.py

style: normalize indentation and improve code style

2025-12-20 23:18:45 +01:00

network_stats.py

add network statistics tracking module

2025-12-25 02:49:34 +01:00

ppf.container.service

add systemd unit for rootless podman container

2025-12-21 10:23:27 +01:00

ppf.py

ppf: add periodic re-seeding of proxy source URLs

2026-02-22 11:18:45 +01:00

proxywatchd.py

watchd: update last_seen on successful proxy verification

2026-02-22 10:04:28 +01:00

README.md

tools: add ppf-status cluster overview

2026-02-18 01:02:42 +01:00

remove_bad_proxies_from_proxylist.bash

working but inefficient script to remove invalid proxies from db

2019-01-09 00:41:46 +00:00

requirements.txt

update config sample, requirements, searx instances

2025-12-25 11:13:34 +01:00

ROADMAP.md

docs: update roadmap and tasklist with completed items

2026-02-22 15:37:54 +01:00

rocksock.py

rocksock: skip shutdown on never-connected sockets

2026-02-22 13:58:20 +01:00

scraper.py

scraper: add Bing and Yahoo engines

2025-12-28 15:19:39 +01:00

search_terms.txt

changs

2021-05-02 00:22:12 +02:00

searx.instances

update config sample, requirements, searx instances

2025-12-25 11:13:34 +01:00

servers.txt

servers: refresh list from mirc.com (128 servers)

2026-02-17 21:06:31 +01:00

servers.txt.sample

add servers.txt.sample

2019-03-05 22:29:16 +00:00

soup_parser.py

style: normalize indentation and improve code style

2025-12-20 23:18:45 +01:00

stats.py

watchd: add target health tracking for all target pools

2026-02-18 18:21:53 +01:00

TASKLIST.md

docs: update roadmap and tasklist with completed items

2026-02-22 15:37:54 +01:00

test_nobs.py

style: normalize test file indentation

2025-12-20 23:19:22 +01:00

TODO.md

docs: update roadmap, todo, and add tasklist

2026-02-22 13:58:37 +01:00

translations.py

translations: add multi-lingual search term generation

2025-12-20 22:27:37 +01:00

urignore.txt

add updowntoay to urignore

2021-06-27 12:12:49 +02:00

usernames.txt

add usernames

2021-04-26 02:07:22 +02:00

README.md

PPF - Python Proxy Finder

A Python 2.7 proxy discovery and validation framework.

Overview

PPF discovers proxy addresses by crawling websites and search engines, validates them through multi-target testing via Tor, and maintains a database of working proxies with automatic protocol detection (SOCKS4/SOCKS5/HTTP).

scraper.py ──> ppf.py ──> proxywatchd.py
   │             │              │
   │ search      │ harvest      │ validate
   │ engines     │ proxies      │ via tor
   v             v              v
         SQLite databases

Requirements

Python 2.7
Tor SOCKS proxy (default: 127.0.0.1:9050)
beautifulsoup4 (optional with --nobs flag)

Installation

Local

pip install -r requirements.txt
cp config.ini.sample config.ini
cp servers.txt.sample servers.txt

Container (Rootless)

# On container host, as dedicated user
podman build -t ppf:latest .
podman run --rm ppf:latest python ppf.py --help

Prerequisites for rootless containers:

subuid/subgid mappings configured
linger enabled (loginctl enable-linger $USER)
passt installed for networking

Configuration

Copy config.ini.sample to config.ini and adjust:

[common]
tor_hosts = 127.0.0.1:9050      # Comma-separated Tor SOCKS addresses
timeout_connect = 10             # Connection timeout (seconds)
timeout_read = 15                # Read timeout (seconds)

[watchd]
threads = 10                     # Parallel validation threads
max_fail = 5                     # Failures before proxy marked dead
checktime = 1800                 # Base recheck interval (seconds)
database = proxies.sqlite        # Proxy database path
stale_days = 30                  # Days before removing dead proxies
stats_interval = 300             # Seconds between status reports

[ppf]
threads = 3                      # URL harvesting threads
search = 1                       # Enable search engine discovery
database = websites.sqlite       # URL database path

[scraper]
engines = searx,duckduckgo       # Comma-separated search engines
max_pages = 5                    # Max pages per engine query

[httpd]
enabled = 0                      # Enable REST API
port = 8081                      # API listen port
listenip = 127.0.0.1             # API bind address

Usage

Proxy Validation Daemon

python proxywatchd.py

Validates proxies from the database against multiple targets. Requires:

servers.txt with IRC servers (for IRC mode) or uses built-in HTTP targets
Running Tor instance

URL Harvester

python ppf.py

Crawls URLs for proxy addresses and feeds them to the validator. Also starts the watchd internally.

Search Engine Scraper

python scraper.py

Queries search engines for proxy list URLs. Supports:

SearXNG instances
DuckDuckGo, Startpage, Brave, Ecosia
GitHub, GitLab, Codeberg (code search)

Import From File

python ppf.py --file proxies.txt

CLI Flags

--nobs          Use stdlib HTMLParser instead of BeautifulSoup
--file FILE     Import proxies from file
-q, --quiet     Show warnings and errors only
-v, --verbose   Show debug messages

REST API

Enable in config with httpd.enabled = 1.

# Get working proxies
curl http://localhost:8081/proxies?limit=10&proto=socks5

# Get count
curl http://localhost:8081/proxies/count

# Health check
curl http://localhost:8081/health

Query parameters:

limit - Max results (default: 100)
proto - Filter by protocol (socks4/socks5/http)
country - Filter by country code
asn - Filter by ASN number
format - Output format (json/plain)

Architecture

Components

File	Purpose
proxywatchd.py	Proxy validation daemon with multi-target voting
ppf.py	URL harvester and proxy extractor
scraper.py	Search engine integration
fetch.py	HTTP client with proxy support
dbs.py	Database operations
mysqlite.py	SQLite wrapper with WAL mode
connection_pool.py	Tor connection pooling with health tracking
config.py	Configuration management
httpd.py	REST API server

Validation Logic

Each proxy is tested against 3 random targets:

2/3 majority required for success
Protocol auto-detected (tries HTTP, SOCKS5, SOCKS4)
SSL/TLS tested periodically
MITM detection via certificate validation

Database Schema

-- proxylist (proxies.sqlite)
proxy TEXT UNIQUE      -- ip:port
proto TEXT             -- socks4/socks5/http
country TEXT           -- 2-letter code
asn INT                -- autonomous system number
failed INT             -- consecutive failures
success_count INT      -- total successes
avg_latency REAL       -- rolling average (ms)
tested INT             -- last test timestamp

-- uris (websites.sqlite)
url TEXT UNIQUE        -- source URL
error INT              -- consecutive errors
stale_count INT        -- checks without new proxies

Threading Model

Priority queue orders jobs by proxy health
Dynamic thread scaling based on success rate
Work-stealing ensures even load distribution
Tor connection pooling with worker affinity

Deployment

Container Deployment

All nodes use podman-compose with role-specific compose files (rootless, as podman user). --network=host required for Tor access at 127.0.0.1:9050.

# Build image
podman build -t ppf:latest .

# Start via compose
podman-compose up -d

# View logs / stop
podman-compose logs -f
podman-compose down

Operations Toolkit

The tools/ directory provides CLI wrappers for multi-node operations. Deployment uses an Ansible playbook over WireGuard for parallel execution and handler-based restarts.

ppf-deploy [targets...]        # validate + deploy + restart (playbook)
ppf-deploy --check             # dry run with diff
ppf-logs [node]                # view container logs (-f to follow)
ppf-service <cmd> [nodes...]   # status / start / stop / restart
ppf-db <cmd>                   # stats / purge-proxies / vacuum
ppf-status                     # cluster overview (containers, workers, queue)

See --help on each tool.

Troubleshooting

Low Success Rate

WATCHD X.XX% SUCCESS RATE - tor circuit blocked?

Tor circuit may be flagged; restart Tor
Target servers may be blocking; wait for rotation
Network issues; check connectivity

Database Locked

WAL mode handles most concurrency. If issues persist:

Reduce thread count
Check disk I/O
Verify single instance running

No Proxies Found

Check search engines in config
Verify Tor connectivity
Review scraper logs for rate limiting

License

See LICENSE file.

Languages

Python 74.2%

JavaScript 10.5%

CSS 7.4%

Shell 4%

HTML 3.7%

Other 0.2%