derp/docs/AUDIO.md

# Audio Engine -- Issues, Fixes, and Consolidation Notes

Technical reference for the Mumble audio pipeline: known issues,
applied fixes, architectural decisions, and areas for future work.

## Architecture Overview

```
yt-dlp -> ffmpeg (decode to s16le 48kHz mono) -> PCM frames (20ms)
    -> volume ramp/scale -> pymumble sound_output -> Opus encode -> Mumble
```

Key components:

| File | Role |
|------|------|
| `src/derp/mumble.py` | `stream_audio()` -- PCM feed loop, volume ramp, seek |
| `plugins/music.py` | Queue, play loop, fade orchestration, duck monitor |

### Volume control layers (evaluated per-frame, highest priority first)

1. **fade_vol** -- active during fade-out (skip/stop/pause); set to 0 as target
2. **duck_vol** -- voice-activated ducking; snap to floor, linear restore
3. **volume** -- user-set level (0-100)

The play loop passes a lambda to `stream_audio`:

```python
volume=lambda: (
    ps["fade_vol"]   if ps["fade_vol"] is not None else
    ps["duck_vol"]   if ps["duck_vol"] is not None else
    ps["volume"]
) / 100.0
```

### Per-frame volume ramping

`stream_audio` never jumps to the target volume. Each 20ms frame is
ramped from `_cur_vol` toward `target` by at most `step`:

- **_max_step** = 0.005 (~4s full ramp) -- ceiling for normal changes
- **fade_in_step** -- computed from fade-in duration (default 5s)
- **fade_step** -- override from plugin (fade-out on skip/stop/pause)

When `abs(diff) < 0.0001`, flat scaling is used (avoids ramp artifacts
on steady-state frames). Otherwise, `_scale_pcm_ramp()` linearly
interpolates across all 960 samples in the frame.

---

## Issues and Fixes

### 1. Alpine ffmpeg lacks librubberband

**Symptom:** 13/15 voice audition samples failed. `rubberband` audio
filter unavailable in ffmpeg.

**Root cause:** Alpine's ffmpeg package is compiled without
`--enable-librubberband`.

**Fix:** Added `rubberband` CLI package to `Containerfile`. Created
`_split_fx()` in `plugins/voice.py` to parse FX chains: pitch-shifting
goes through the `rubberband` CLI binary, remaining filters (bass, echo)
through ffmpeg. Two-stage pipeline.

**Files:** `Containerfile`, `plugins/voice.py`

---

### 2. Self-ducking between bots

**Symptom:** derp's music volume dropped when merlin spoke (TTS).

**Root cause:** merlin's TTS output triggered `_on_sound_received`,
which updated the shared `registry._voice_ts` timestamp. derp's duck
monitor saw recent voice activity and ducked.

**Fix:** `_on_sound_received` checks `registry._bots` and returns early
for any bot username -- no timestamp update, no listener dispatch.

```python
def _on_sound_received(self, user, sound_chunk) -> None:
    name = user["name"] if isinstance(user, dict) else None
    bots = getattr(self.registry, "_bots", {})
    if name and name in bots:
        return  # ignore audio from bots entirely
```

**Files:** `src/derp/mumble.py`

---

### 3. Click/pop on skip/stop (fade-out cancellation)

**Symptom:** Audible glitch at the end of fade-out when skipping or
stopping a track.

**Root cause:** `_fade_and_cancel()` fades volume to 0 over ~3s, then
calls `task.cancel()`. In `stream_audio`, `CancelledError` triggers
`clear_buffer()`, which drops any frames still queued in pymumble's
output -- including frames that were encoded at non-zero amplitude a
few frames earlier. The sudden buffer wipe produces a click.

**Fix (two-part):**

1. **Plugin side** (`music.py`): Added 150ms post-fade drain before
   cancel, giving pymumble time to flush remaining silent frames.

2. **Engine side** (`mumble.py`): `CancelledError` handler only calls
   `clear_buffer()` if `_cur_vol > 0.01`. When a fade-out has already
   driven volume to ~0, the remaining buffer frames are silent and
   clearing them is unnecessary.

```python
# mumble.py -- CancelledError handler
if _cur_vol > 0.01:
    self._mumble.sound_output.clear_buffer()
```

```python
# music.py -- _fade_and_cancel()
await asyncio.sleep(duration)
await asyncio.sleep(0.15)  # drain window
task.cancel()
```

**Files:** `src/derp/mumble.py`, `plugins/music.py`

---

### 4. Fade-out math

**How it works:** `_fade_and_cancel(duration=3.0)` computes the
per-frame step from the current effective volume:

```python
cur_vol = (duck_vol or volume) / 100.0
n_frames = duration / 0.02  # 150 frames for 3s
step = cur_vol / n_frames
```

The play loop sets `ps["fade_vol"] = 0` (the target) and
`ps["fade_step"] = step` (the rate). `stream_audio` ramps `_cur_vol`
toward 0 at `step` per frame. At 50% volume: step = 0.0033, reaching
zero in exactly 150 frames (3.0s).

**Note:** `fade_vol` is set to 0 immediately, making the volume lambda
return 0 as the target. The ramp code smoothly transitions -- there is
no abrupt jump because `_cur_vol` tracks actual output level, not the
target.

---

### 5. Self-mute lifecycle

**Requirement:** merlin mutes on connect, unmutes only when emitting
audio (TTS), re-mutes after a delay.

**Implementation:**

```
connect -> mute()
stream_audio start -> cancel pending mute task, unmute()
stream_audio finally -> spawn _delayed_mute(3.0)
```

The 3-second delay prevents rapid mute/unmute flicker on back-to-back
TTS. The mute task is cancelled if new audio starts before it fires.

**Config:** `self_mute = true` in `[[mumble.extra]]`

**Files:** `src/derp/mumble.py`

---

### 6. Self-deafen on connect

**Requirement:** merlin deafens on connect (no audio reception needed).

**Implementation:** `self_deaf = true` config flag, calls
`self._mumble.users.myself.deafen()` in `_on_connected`.

**Files:** `src/derp/mumble.py`, `config/derp.toml`

---

## Pause/Resume

### Design

`!pause` toggles between paused and playing states:

**Pause:** Captures current track + elapsed position + monotonic
timestamp. Fades out, cancels play loop. Queue is preserved.

**Unpause:** Re-inserts track at queue front, starts play loop with
seek. Two special behaviors:

1. **Rewind:** 3s rewind on unpause for continuity (only if paused >= 3s
   to prevent anti-flood: rapid toggle doesn't compound the rewind).

2. **Stale stream:** If paused > 45s, cached stream files (in
   `data/music/cache/`) are deleted so the play loop re-downloads.
   Kept files (`data/music/`) are never deleted. Stream URLs from
   YouTube et al. expire within minutes.

3. **Fade-in:** Unpause always uses `fade_in=True` (5s ramp from 0).

**State cleanup:** `!stop` clears `ps["paused"]`. The play loop's
`finally` block skips `_cleanup_track` when paused (preserves the file).

---

## Autoplay

### Design

When `autoplay = true` (config), the play loop stays alive after the
queue empties:

1. Waits for silence (duck_silence threshold, default 15s)
2. Picks one random kept track
3. Plays it
4. On completion, loops back to step 1

This replaces the previous bulk-queue approach (shuffle all kept tracks
at once). Benefits: no large upfront queue, silence-aware gaps between
tracks, indefinite looping.

### Resume persistence

A background task saves track URL + elapsed position to the state DB
every 10 seconds during playback:

```python
async def _periodic_save():
    while True:
        await asyncio.sleep(10)
        el = cur_seek + progress[0] * 0.02
        if el > 1.0:
            _save_resume(bot, track, el)
```

On hard kill: resumes from at most ~10s behind. On normal track
completion: `_clear_resume()` wipes the state.

---

## Voice Ducking

### Flow

```
voice detected -> duck_vol = floor (instant)
silence > duck_silence -> linear restore over duck_restore seconds
```

The duck monitor runs as a background task alongside the play loop.
It updates `ps["duck_vol"]` which the volume lambda reads per-frame.

### Restore ramp

Restoration is linear from floor to user volume. The per-frame ramp in
`stream_audio` further smooths each 1-second update from the monitor,
eliminating audible steps.

### Bot audio isolation

Bot usernames (from `registry._bots`) are excluded from
`_on_sound_received` entirely -- no timestamp update, no listener
dispatch. This prevents self-ducking between derp and merlin.

---

## Seek (in-stream pipeline swap)

### Design

Seek rebuilds the ffmpeg pipeline at the new position without cancelling
the play loop task. This avoids the overhead of re-downloading.

1. Set `_seek_fading = True`, `_seek_fade_out = 10` (0.2s ramp-down)
2. Continue reading frames, scaling by decreasing ratio
3. At fade-out = 0: kill ffmpeg, clear buffer, spawn new pipeline
4. 0.5s fade-in on the new pipeline

### Consolidation note

Seek fade-out (10 frames / 0.2s) is much shorter than skip/stop
fade-out (3s). This is intentional -- seek should feel responsive.
The mechanisms are separate: seek uses frame-counting in
`stream_audio`, skip/stop uses `_fade_and_cancel` in the plugin.

---

## Consolidation Opportunities

### Volume control unification

Three volume layers (fade_vol, duck_vol, volume) evaluated in a lambda
per-frame. Works but the priority logic is implicit. A future refactor
could use a single `effective_volume()` method that explicitly resolves
priority and makes the per-frame cost clearer.

### Fade-out ownership

Skip/stop/pause all route through `_fade_and_cancel()` -- good. But the
fade target is communicated indirectly via `ps["fade_vol"] = 0` and
`ps["fade_step"]`, read by a lambda in the play loop, evaluated in
`stream_audio`. A more explicit signal (e.g. an asyncio.Event or a
dedicated fade state machine in `stream_audio`) could simplify reasoning
about timing.

### Buffer drain timing

The 150ms post-fade drain is empirical. A more robust approach would be
to query `sound_output.get_buffer_size()` and wait for it to drop below
a threshold before cancelling. This would adapt to varying network
conditions and pymumble buffer sizes.

### Track duration

Duration is probed via `ffprobe` after download (blocking, run in
executor). For kept tracks, it's stored in state metadata. This is
duplicated -- kept track metadata already has duration from
`_fetch_metadata` (yt-dlp). The `ffprobe` path is the fallback for
non-kept tracks. Could unify by always probing locally.

### Periodic resume save interval

Currently 10s fixed. Could be adaptive -- save more frequently near
the start of a track (where losing position is more noticeable) and
less frequently later. Marginal benefit vs. complexity though.