feat: add 11 alert backends and fix PyPI/DEV.to search
Add Wikipedia, Stack Exchange, GitLab, npm, PyPI, Docker Hub, arXiv, Lobsters, DEV.to, Medium, and Hugging Face backends to the alert plugin (16 -> 27 total). Fix PyPI backend to use RSS updates feed (web search now requires JS challenge). Fix DEV.to to use public articles API (feed_content endpoint returns empty). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
21
ROADMAP.md
21
ROADMAP.md
@@ -76,6 +76,27 @@
|
||||
- [x] Integration tests with mock IRC server
|
||||
- [x] `username` plugin (cross-platform username enumeration)
|
||||
|
||||
## v1.2.0 -- Subscriptions + Proxy (done)
|
||||
|
||||
- [x] `rss` plugin (RSS/Atom feed subscriptions with polling)
|
||||
- [x] `yt` plugin (YouTube channel follow via Atom feeds)
|
||||
- [x] `twitch` plugin (livestream notifications via public GQL)
|
||||
- [x] `alert` plugin (keyword alerts across 27 platforms)
|
||||
- [x] `searx` plugin (SearXNG web search)
|
||||
- [x] `tdns` plugin (TCP DNS via SOCKS5 proxy)
|
||||
- [x] `remind` plugin (one-shot, repeating, calendar-based reminders)
|
||||
- [x] SOCKS5 proxy transport layer (HTTP, TCP, async connections)
|
||||
- [x] Alert backends: YouTube, Twitch, SearXNG, Reddit, Mastodon,
|
||||
DuckDuckGo, Google News, Kick, Dailymotion, PeerTube, Bluesky,
|
||||
Lemmy, Odysee, Archive.org, Hacker News, GitHub, Wikipedia,
|
||||
Stack Exchange, GitLab, npm, PyPI, Docker Hub, arXiv, Lobsters,
|
||||
DEV.to, Medium, Hugging Face
|
||||
- [x] Alert result history (SQLite) with short IDs and `!alert info`
|
||||
- [x] OG tag fetching for keyword matching and date enrichment
|
||||
- [x] Invite auto-join with persistence
|
||||
- [x] Graceful SIGTERM shutdown
|
||||
- [x] InnerTube-based YouTube channel resolution for video URLs
|
||||
|
||||
## v2.0.0 -- Multi-Server + Stable API
|
||||
|
||||
- [ ] Multi-server support (per-server config, shared plugins)
|
||||
|
||||
27
TASKS.md
27
TASKS.md
@@ -1,25 +1,28 @@
|
||||
# derp - Tasks
|
||||
|
||||
## Current Sprint -- v1.1.0 Hardening (2026-02-15)
|
||||
## Current Sprint -- v1.2.0 Subscriptions + Proxy (2026-02-16)
|
||||
|
||||
| Pri | Status | Task |
|
||||
|-----|--------|------|
|
||||
| P0 | [x] | IRC 512-byte message truncation (RFC 2812) |
|
||||
| P0 | [x] | Exponential reconnect backoff with jitter |
|
||||
| P1 | [x] | `dork` plugin (Google dork query builder) |
|
||||
| P1 | [x] | `wayback` plugin (Wayback Machine snapshot lookup) |
|
||||
| P1 | [x] | Config merge/load/resolve unit tests |
|
||||
| P1 | [x] | Bot API + format_msg + split_utf8 tests |
|
||||
| P1 | [x] | Per-channel plugin enable/disable |
|
||||
| P1 | [x] | Structured JSON logging |
|
||||
| P1 | [x] | Documentation update |
|
||||
| P1 | [x] | `username` plugin (cross-platform username enumeration) |
|
||||
| P1 | [x] | Integration tests with mock IRC server (14 tests) |
|
||||
| P0 | [x] | `rss` plugin (RSS/Atom feed subscriptions) |
|
||||
| P0 | [x] | `yt` plugin (YouTube channel follow via Atom feeds) |
|
||||
| P0 | [x] | `twitch` plugin (livestream notifications via GQL) |
|
||||
| P0 | [x] | `alert` plugin (keyword alerts, 27 backends) |
|
||||
| P0 | [x] | SOCKS5 proxy transport layer (HTTP, TCP, async) |
|
||||
| P1 | [x] | `searx` plugin (SearXNG web search) |
|
||||
| P1 | [x] | `tdns` plugin (TCP DNS via SOCKS5) |
|
||||
| P1 | [x] | `remind` plugin (one-shot, repeating, calendar) |
|
||||
| P1 | [x] | Alert history (SQLite) with short IDs + `!alert info` |
|
||||
| P1 | [x] | OG tag fetching for keyword match + date enrichment |
|
||||
| P1 | [x] | InnerTube channel resolution for video URLs |
|
||||
| P2 | [x] | Invite auto-join with persistence |
|
||||
| P2 | [x] | Graceful SIGTERM shutdown |
|
||||
|
||||
## Completed
|
||||
|
||||
| Date | Task |
|
||||
|------|------|
|
||||
| 2026-02-16 | v1.2.0 (subscriptions, alerts, proxy, reminders) |
|
||||
| 2026-02-15 | Calendar-based reminders (at/yearly) with persistence |
|
||||
| 2026-02-15 | v1.1.0 (channel filter, JSON logging, dork, wayback, tests) |
|
||||
| 2026-02-15 | v1.0.0 (IRCv3, chanmgmt, state persistence) |
|
||||
|
||||
@@ -346,14 +346,16 @@ No API credentials needed (uses public GQL endpoint).
|
||||
!alert history <name> [n] # Show recent results (default 5)
|
||||
```
|
||||
|
||||
Searches keywords across 16 backends: YouTube (yt), Twitch (tw), SearXNG (sx),
|
||||
Searches keywords across 27 backends: YouTube (yt), Twitch (tw), SearXNG (sx),
|
||||
Reddit (rd), Mastodon (ft), DuckDuckGo (dg), Google News (gn), Kick (kk),
|
||||
Dailymotion (dm), PeerTube (pt), Bluesky (bs), Lemmy (ly), Odysee (od),
|
||||
Archive.org (ia), Hacker News (hn), GitHub (gh). Names: lowercase alphanumeric +
|
||||
hyphens, 1-20 chars. Keywords: 1-100 chars. Max 20 alerts/channel. Polls every
|
||||
5min. Format: `[name/yt/a8k2m] Title -- URL`. Use `!alert info <id>` to see full
|
||||
details. No API credentials needed. Persists across restarts. History stored in
|
||||
`data/alert_history.db`.
|
||||
Archive.org (ia), Hacker News (hn), GitHub (gh), Wikipedia (wp),
|
||||
Stack Exchange (se), GitLab (gl), npm (nm), PyPI (pp), Docker Hub (dh),
|
||||
arXiv (ax), Lobsters (lb), DEV.to (dv), Medium (md), Hugging Face (hf).
|
||||
Names: lowercase alphanumeric + hyphens, 1-20 chars. Keywords: 1-100 chars.
|
||||
Max 20 alerts/channel. Polls every 5min. Format: `[name/yt/a8k2m] Title -- URL`.
|
||||
Use `!alert info <id>` to see full details. No API credentials needed. Persists
|
||||
across restarts. History stored in `data/alert_history.db`.
|
||||
|
||||
## SearX
|
||||
|
||||
|
||||
@@ -660,10 +660,9 @@ Title Three -- https://example.com/page3
|
||||
|
||||
### `!alert` -- Keyword Alert Subscriptions
|
||||
|
||||
Search keywords across multiple platforms (YouTube, Twitch, SearXNG, Reddit,
|
||||
Mastodon/Fediverse) and announce new results. Unlike `!rss`/`!yt`/`!twitch`
|
||||
which follow specific channels/feeds, `!alert` searches keywords across all
|
||||
supported platforms simultaneously.
|
||||
Search keywords across 27 platforms and announce new results. Unlike
|
||||
`!rss`/`!yt`/`!twitch` which follow specific channels/feeds, `!alert` searches
|
||||
keywords across all supported platforms simultaneously.
|
||||
|
||||
```
|
||||
!alert add <name> <keyword...> Add keyword alert (admin)
|
||||
@@ -699,6 +698,17 @@ Platforms searched:
|
||||
- **Archive.org** (`ia`) -- Internet Archive advanced search, sorted by date (no auth required)
|
||||
- **Hacker News** (`hn`) -- Algolia search API, sorted by date (no auth required)
|
||||
- **GitHub** (`gh`) -- Repository search API, sorted by recently updated (no auth required)
|
||||
- **Wikipedia** (`wp`) -- MediaWiki search API, English Wikipedia (no auth required)
|
||||
- **Stack Exchange** (`se`) -- Stack Overflow search API, sorted by activity (no auth required)
|
||||
- **GitLab** (`gl`) -- Public project search API, sorted by last activity (no auth required)
|
||||
- **npm** (`nm`) -- npm registry search API (no auth required)
|
||||
- **PyPI** (`pp`) -- Recent package updates RSS feed, keyword-filtered (no auth required)
|
||||
- **Docker Hub** (`dh`) -- Public repository search API (no auth required)
|
||||
- **arXiv** (`ax`) -- Atom search API for academic papers (no auth required)
|
||||
- **Lobsters** (`lb`) -- Community link aggregator search (no auth required)
|
||||
- **DEV.to** (`dv`) -- Forem articles API, tag-based search (no auth required)
|
||||
- **Medium** (`md`) -- Tag-based RSS feed (no auth required)
|
||||
- **Hugging Face** (`hf`) -- Model search API, sorted by downloads (no auth required)
|
||||
|
||||
Polling and announcements:
|
||||
|
||||
@@ -706,7 +716,8 @@ Polling and announcements:
|
||||
- On `add`, existing results are recorded without announcing (prevents flood)
|
||||
- New results announced as `[name/<tag>/<id>] Title -- URL` where tag is one of:
|
||||
`yt`, `tw`, `sx`, `rd`, `ft`, `dg`, `gn`, `kk`, `dm`, `pt`, `bs`, `ly`, `od`, `ia`,
|
||||
`hn`, `gh` and `<id>` is a short deterministic ID for use with `!alert info`
|
||||
`hn`, `gh`, `wp`, `se`, `gl`, `nm`, `pp`, `dh`, `ax`, `lb`, `dv`, `md`, `hf`
|
||||
and `<id>` is a short deterministic ID for use with `!alert info`
|
||||
- Titles are truncated to 80 characters
|
||||
- Each platform maintains its own seen list (capped at 200 per platform)
|
||||
- 5 consecutive errors doubles the poll interval (max 1 hour)
|
||||
|
||||
519
plugins/alert.py
519
plugins/alert.py
@@ -64,6 +64,17 @@ _ODYSEE_API = "https://api.na-backend.odysee.com/api/v1/proxy"
|
||||
_ARCHIVE_SEARCH_URL = "https://archive.org/advancedsearch.php"
|
||||
_HN_SEARCH_URL = "https://hn.algolia.com/api/v1/search_by_date"
|
||||
_GITHUB_SEARCH_URL = "https://api.github.com/search/repositories"
|
||||
_WIKIPEDIA_API = "https://en.wikipedia.org/w/api.php"
|
||||
_STACKEXCHANGE_URL = "https://api.stackexchange.com/2.3/search"
|
||||
_GITLAB_SEARCH_URL = "https://gitlab.com/api/v4/projects"
|
||||
_NPM_SEARCH_URL = "https://registry.npmjs.org/-/v1/search"
|
||||
_PYPI_RSS_URL = "https://pypi.org/rss/updates.xml"
|
||||
_DOCKERHUB_SEARCH_URL = "https://hub.docker.com/v2/search/repositories/"
|
||||
_ARXIV_API = "https://export.arxiv.org/api/query"
|
||||
_LOBSTERS_SEARCH_URL = "https://lobste.rs/search"
|
||||
_DEVTO_API = "https://dev.to/api/articles"
|
||||
_MEDIUM_FEED_URL = "https://medium.com/feed/tag"
|
||||
_HUGGINGFACE_API = "https://huggingface.co/api/models"
|
||||
|
||||
# -- Module-level tracking ---------------------------------------------------
|
||||
|
||||
@@ -1125,6 +1136,503 @@ def _search_github(keyword: str) -> list[dict]:
|
||||
return results
|
||||
|
||||
|
||||
# -- Wikipedia search (blocking) --------------------------------------------
|
||||
|
||||
def _search_wikipedia(keyword: str) -> list[dict]:
|
||||
"""Search Wikipedia articles via public API. Blocking."""
|
||||
import urllib.parse
|
||||
|
||||
params = urllib.parse.urlencode({
|
||||
"action": "query", "list": "search", "srsearch": keyword,
|
||||
"srlimit": "25", "format": "json",
|
||||
})
|
||||
url = f"{_WIKIPEDIA_API}?{params}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
data = json.loads(raw)
|
||||
results: list[dict] = []
|
||||
for item in (data.get("query") or {}).get("search") or []:
|
||||
title = item.get("title", "")
|
||||
pageid = str(item.get("pageid", ""))
|
||||
if not pageid:
|
||||
continue
|
||||
date = _parse_date(item.get("timestamp", ""))
|
||||
slug = title.replace(" ", "_")
|
||||
results.append({
|
||||
"id": pageid,
|
||||
"title": title,
|
||||
"url": f"https://en.wikipedia.org/wiki/{slug}",
|
||||
"date": date,
|
||||
"extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- Stack Exchange search (blocking) ---------------------------------------
|
||||
|
||||
def _search_stackexchange(keyword: str) -> list[dict]:
|
||||
"""Search Stack Overflow questions via public API. Blocking."""
|
||||
import gzip
|
||||
import io
|
||||
import urllib.parse
|
||||
|
||||
params = urllib.parse.urlencode({
|
||||
"order": "desc", "sort": "creation", "intitle": keyword,
|
||||
"site": "stackoverflow", "pagesize": "25",
|
||||
})
|
||||
url = f"{_STACKEXCHANGE_URL}?{params}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
req.add_header("Accept-Encoding", "gzip")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
try:
|
||||
raw = gzip.GzipFile(fileobj=io.BytesIO(raw)).read()
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
data = json.loads(raw)
|
||||
results: list[dict] = []
|
||||
for item in data.get("items") or []:
|
||||
qid = str(item.get("question_id", ""))
|
||||
if not qid:
|
||||
continue
|
||||
title = _strip_html(item.get("title", ""))
|
||||
link = item.get("link", "")
|
||||
score = item.get("score", 0)
|
||||
if score:
|
||||
title += f" [{score}v]"
|
||||
created = item.get("creation_date")
|
||||
date = ""
|
||||
if created:
|
||||
try:
|
||||
date = datetime.fromtimestamp(
|
||||
int(created), tz=timezone.utc,
|
||||
).strftime("%Y-%m-%d")
|
||||
except (ValueError, OSError):
|
||||
pass
|
||||
results.append({
|
||||
"id": qid, "title": title, "url": link,
|
||||
"date": date, "extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- GitLab search (blocking) ----------------------------------------------
|
||||
|
||||
def _search_gitlab(keyword: str) -> list[dict]:
|
||||
"""Search GitLab projects via public API. Blocking."""
|
||||
import urllib.parse
|
||||
|
||||
params = urllib.parse.urlencode({
|
||||
"search": keyword, "order_by": "updated_at",
|
||||
"sort": "desc", "per_page": "25",
|
||||
})
|
||||
url = f"{_GITLAB_SEARCH_URL}?{params}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
data = json.loads(raw)
|
||||
results: list[dict] = []
|
||||
for repo in data if isinstance(data, list) else []:
|
||||
rid = str(repo.get("id", ""))
|
||||
if not rid:
|
||||
continue
|
||||
name = repo.get("path_with_namespace", "")
|
||||
description = repo.get("description") or ""
|
||||
web_url = repo.get("web_url", "")
|
||||
stars = repo.get("star_count", 0)
|
||||
title = name
|
||||
if description:
|
||||
title += f": {_truncate(description, 50)}"
|
||||
if stars:
|
||||
title += f" [{stars}*]"
|
||||
date = _parse_date(repo.get("last_activity_at", ""))
|
||||
results.append({
|
||||
"id": rid, "title": title, "url": web_url,
|
||||
"date": date, "extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- npm search (blocking) -------------------------------------------------
|
||||
|
||||
def _search_npm(keyword: str) -> list[dict]:
|
||||
"""Search npm packages via registry API. Blocking."""
|
||||
import urllib.parse
|
||||
|
||||
params = urllib.parse.urlencode({"text": keyword, "size": "25"})
|
||||
url = f"{_NPM_SEARCH_URL}?{params}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
data = json.loads(raw)
|
||||
results: list[dict] = []
|
||||
for obj in data.get("objects") or []:
|
||||
pkg = obj.get("package") or {}
|
||||
name = pkg.get("name", "")
|
||||
if not name:
|
||||
continue
|
||||
description = pkg.get("description") or ""
|
||||
version = pkg.get("version", "")
|
||||
links = pkg.get("links") or {}
|
||||
npm_url = links.get("npm", f"https://www.npmjs.com/package/{name}")
|
||||
title = f"{name}@{version}" if version else name
|
||||
if description:
|
||||
title += f": {_truncate(description, 50)}"
|
||||
date = _parse_date(pkg.get("date", ""))
|
||||
results.append({
|
||||
"id": name, "title": title, "url": npm_url,
|
||||
"date": date, "extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- PyPI search (blocking) ------------------------------------------------
|
||||
|
||||
def _search_pypi(keyword: str) -> list[dict]:
|
||||
"""Search PyPI recent updates via RSS feed, filtered by keyword. Blocking."""
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
req = urllib.request.Request(_PYPI_RSS_URL, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
root = ET.fromstring(raw)
|
||||
kw_lower = keyword.lower()
|
||||
results: list[dict] = []
|
||||
for item in root.findall(".//item"):
|
||||
title = (item.findtext("title") or "").strip()
|
||||
link = (item.findtext("link") or "").strip()
|
||||
desc = (item.findtext("description") or "").strip()
|
||||
if not title or not link:
|
||||
continue
|
||||
if kw_lower not in title.lower() and kw_lower not in desc.lower():
|
||||
continue
|
||||
pkg_name = title.split()[0] if title else ""
|
||||
display = title
|
||||
if desc:
|
||||
display += f": {_truncate(desc, 50)}"
|
||||
results.append({
|
||||
"id": pkg_name or link,
|
||||
"title": display,
|
||||
"url": link,
|
||||
"date": "",
|
||||
"extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- Docker Hub search (blocking) ------------------------------------------
|
||||
|
||||
def _search_dockerhub(keyword: str) -> list[dict]:
|
||||
"""Search Docker Hub repositories via public API. Blocking."""
|
||||
import urllib.parse
|
||||
|
||||
params = urllib.parse.urlencode({"query": keyword, "page_size": "25"})
|
||||
url = f"{_DOCKERHUB_SEARCH_URL}?{params}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
data = json.loads(raw)
|
||||
results: list[dict] = []
|
||||
for item in data.get("results") or []:
|
||||
name = item.get("repo_name", "")
|
||||
if not name:
|
||||
continue
|
||||
description = item.get("short_description") or ""
|
||||
stars = item.get("star_count", 0)
|
||||
title = name
|
||||
if description:
|
||||
title += f": {_truncate(description, 50)}"
|
||||
if stars:
|
||||
title += f" [{stars}*]"
|
||||
hub_url = (
|
||||
f"https://hub.docker.com/r/{name}" if "/" in name
|
||||
else f"https://hub.docker.com/_/{name}"
|
||||
)
|
||||
results.append({
|
||||
"id": name, "title": title, "url": hub_url,
|
||||
"date": "", "extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- arXiv search (blocking) -----------------------------------------------
|
||||
|
||||
def _search_arxiv(keyword: str) -> list[dict]:
|
||||
"""Search arXiv preprints via Atom API. Blocking."""
|
||||
import urllib.parse
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
params = urllib.parse.urlencode({
|
||||
"search_query": f"all:{keyword}",
|
||||
"sortBy": "submittedDate", "sortOrder": "descending",
|
||||
"max_results": "25",
|
||||
})
|
||||
url = f"{_ARXIV_API}?{params}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
ns = {"a": "http://www.w3.org/2005/Atom"}
|
||||
root = ET.fromstring(raw)
|
||||
results: list[dict] = []
|
||||
for entry in root.findall("a:entry", ns):
|
||||
entry_id = (entry.findtext("a:id", "", ns) or "").strip()
|
||||
title = (entry.findtext("a:title", "", ns) or "").strip()
|
||||
title = " ".join(title.split()) # collapse whitespace
|
||||
published = entry.findtext("a:published", "", ns) or ""
|
||||
link_url = ""
|
||||
for link in entry.findall("a:link", ns):
|
||||
if link.get("type") == "text/html":
|
||||
link_url = link.get("href", "")
|
||||
break
|
||||
if not link_url:
|
||||
link_url = entry_id
|
||||
arxiv_id = entry_id.rsplit("/abs/", 1)[-1] if "/abs/" in entry_id else entry_id
|
||||
date = _parse_date(published)
|
||||
if title:
|
||||
results.append({
|
||||
"id": arxiv_id, "title": title, "url": link_url,
|
||||
"date": date, "extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- Lobsters search (blocking) --------------------------------------------
|
||||
|
||||
class _LobstersParser(HTMLParser):
|
||||
"""Extract story links from Lobsters search HTML."""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.results: list[tuple[str, str]] = []
|
||||
self._in_link = False
|
||||
self._url = ""
|
||||
self._title_parts: list[str] = []
|
||||
|
||||
def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
|
||||
if tag != "a":
|
||||
return
|
||||
attr_map = {k: (v or "") for k, v in attrs}
|
||||
cls = attr_map.get("class", "")
|
||||
if "u-url" in cls:
|
||||
self._in_link = True
|
||||
self._url = attr_map.get("href", "")
|
||||
self._title_parts = []
|
||||
|
||||
def handle_data(self, data: str) -> None:
|
||||
if self._in_link:
|
||||
self._title_parts.append(data)
|
||||
|
||||
def handle_endtag(self, tag: str) -> None:
|
||||
if tag == "a" and self._in_link:
|
||||
self._in_link = False
|
||||
title = "".join(self._title_parts).strip()
|
||||
if self._url and title:
|
||||
self.results.append((self._url, title))
|
||||
|
||||
|
||||
def _search_lobsters(keyword: str) -> list[dict]:
|
||||
"""Search Lobsters stories via HTML search page. Blocking."""
|
||||
import urllib.parse
|
||||
|
||||
params = urllib.parse.urlencode({
|
||||
"q": keyword, "what": "stories", "order": "newest",
|
||||
})
|
||||
url = f"{_LOBSTERS_SEARCH_URL}?{params}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
html = raw.decode("utf-8", errors="replace")
|
||||
parser = _LobstersParser()
|
||||
parser.feed(html)
|
||||
|
||||
results: list[dict] = []
|
||||
seen_urls: set[str] = set()
|
||||
for item_url, title in parser.results:
|
||||
if item_url in seen_urls:
|
||||
continue
|
||||
seen_urls.add(item_url)
|
||||
results.append({
|
||||
"id": item_url,
|
||||
"title": title,
|
||||
"url": item_url,
|
||||
"date": "",
|
||||
"extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- DEV.to search (blocking) ----------------------------------------------
|
||||
|
||||
def _search_devto(keyword: str) -> list[dict]:
|
||||
"""Search DEV.to articles via public articles API. Blocking."""
|
||||
import urllib.parse
|
||||
|
||||
tag = re.sub(r"[^a-zA-Z0-9]", "", keyword).lower()
|
||||
params = urllib.parse.urlencode({"per_page": "25", "tag": tag})
|
||||
url = f"{_DEVTO_API}?{params}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
data = json.loads(raw)
|
||||
if not isinstance(data, list):
|
||||
return []
|
||||
results: list[dict] = []
|
||||
for item in data:
|
||||
article_id = str(item.get("id", ""))
|
||||
if not article_id:
|
||||
continue
|
||||
title = item.get("title", "")
|
||||
article_url = item.get("url", "")
|
||||
user = item.get("user", {})
|
||||
if isinstance(user, dict):
|
||||
author = user.get("username", "")
|
||||
else:
|
||||
author = ""
|
||||
if author:
|
||||
title = f"{author}: {title}"
|
||||
date = _parse_date(item.get("published_at", ""))
|
||||
results.append({
|
||||
"id": article_id, "title": title, "url": article_url,
|
||||
"date": date, "extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- Medium tag feed search (blocking) -------------------------------------
|
||||
|
||||
def _search_medium(keyword: str) -> list[dict]:
|
||||
"""Search Medium via tag RSS feed. Blocking."""
|
||||
import urllib.parse
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
tag = re.sub(r"[^a-zA-Z0-9-]", "-", keyword).lower().strip("-")
|
||||
if not tag:
|
||||
return []
|
||||
url = f"{_MEDIUM_FEED_URL}/{urllib.parse.quote(tag, safe='')}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
root = ET.fromstring(raw)
|
||||
results: list[dict] = []
|
||||
for item in root.iter("item"):
|
||||
title = (item.findtext("title") or "").strip()
|
||||
link = (item.findtext("link") or "").strip()
|
||||
if not link:
|
||||
continue
|
||||
guid = (item.findtext("guid") or link).strip()
|
||||
creator = item.findtext("{http://purl.org/dc/elements/1.1/}creator") or ""
|
||||
if creator:
|
||||
title = f"{creator}: {title}"
|
||||
pub_date = item.findtext("pubDate") or ""
|
||||
date = _parse_date(pub_date)
|
||||
if not date and pub_date:
|
||||
from email.utils import parsedate_to_datetime
|
||||
try:
|
||||
dt = parsedate_to_datetime(pub_date)
|
||||
date = dt.strftime("%Y-%m-%d")
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
results.append({
|
||||
"id": guid, "title": title, "url": link,
|
||||
"date": date, "extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- Hugging Face search (blocking) ----------------------------------------
|
||||
|
||||
def _search_huggingface(keyword: str) -> list[dict]:
|
||||
"""Search Hugging Face models via public API. Blocking."""
|
||||
import urllib.parse
|
||||
|
||||
params = urllib.parse.urlencode({
|
||||
"search": keyword, "sort": "lastModified",
|
||||
"direction": "-1", "limit": "25",
|
||||
})
|
||||
url = f"{_HUGGINGFACE_API}?{params}"
|
||||
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
req.add_header("User-Agent", "Mozilla/5.0 (compatible; derp-bot)")
|
||||
|
||||
resp = _urlopen(req, timeout=_FETCH_TIMEOUT)
|
||||
raw = resp.read()
|
||||
resp.close()
|
||||
|
||||
data = json.loads(raw)
|
||||
results: list[dict] = []
|
||||
for model in data if isinstance(data, list) else []:
|
||||
model_id = model.get("modelId") or model.get("id", "")
|
||||
if not model_id:
|
||||
continue
|
||||
downloads = model.get("downloads", 0)
|
||||
likes = model.get("likes", 0)
|
||||
title = model_id
|
||||
if downloads:
|
||||
title += f" [{downloads} dl]"
|
||||
elif likes:
|
||||
title += f" [{likes} likes]"
|
||||
date = _parse_date(model.get("lastModified", ""))
|
||||
results.append({
|
||||
"id": model_id,
|
||||
"title": title,
|
||||
"url": f"https://huggingface.co/{model_id}",
|
||||
"date": date,
|
||||
"extra": "",
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
# -- Backend registry -------------------------------------------------------
|
||||
|
||||
_BACKENDS: dict[str, callable] = {
|
||||
@@ -1144,6 +1652,17 @@ _BACKENDS: dict[str, callable] = {
|
||||
"ia": _search_archive,
|
||||
"hn": _search_hackernews,
|
||||
"gh": _search_github,
|
||||
"wp": _search_wikipedia,
|
||||
"se": _search_stackexchange,
|
||||
"gl": _search_gitlab,
|
||||
"nm": _search_npm,
|
||||
"pp": _search_pypi,
|
||||
"dh": _search_dockerhub,
|
||||
"ax": _search_arxiv,
|
||||
"lb": _search_lobsters,
|
||||
"dv": _search_devto,
|
||||
"md": _search_medium,
|
||||
"hf": _search_huggingface,
|
||||
}
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user