Add Docker user namespace testing guide, rollback runbook, and VM backup playbook

- Add comprehensive Docker user namespace testing documentation
- Add Docker configuration rollback runbook for disaster recovery
- Add VM snapshot backup playbook for system protection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-11-11 09:55:20 +01:00
parent 005ab46174
commit e124bc2a96
3 changed files with 1517 additions and 0 deletions

View File

@@ -0,0 +1,762 @@
# Docker User Namespace Remapping - Testing and Implementation Guide
**Document Version:** 1.0
**Last Updated:** 2025-11-11
**Risk Level:** HIGH
**Testing Required:** YES (Mandatory in dev/test first)
---
## Table of Contents
1. [Overview](#overview)
2. [Security Benefits](#security-benefits)
3. [Prerequisites](#prerequisites)
4. [Testing Phase (Week 48-49)](#testing-phase-week-48-49)
5. [Production Implementation (Week 50)](#production-implementation-week-50)
6. [Mailcow-Specific Considerations](#mailcow-specific-considerations)
7. [Troubleshooting](#troubleshooting)
---
## Overview
User namespace remapping is a Docker security feature that maps container UID/GIDs to different values on the host, preventing container root from being host root.
### Current Status
| Host | User Namespaces | Risk Level | Implementation Priority |
|------|-----------------|------------|------------------------|
| pihole | Not configured | MEDIUM | Week 49 (after testing) |
| mymx | Not configured | HIGH | Week 50 (mailcow complexity) |
### Impact Assessment
**Benefits:**
- ✅ Container root ≠ host root (major security improvement)
- ✅ Reduces container escape impact
- ✅ CIS Docker Benchmark compliance (2.13)
**Risks:**
- ⚠️ **ALL containers must be recreated**
- ⚠️ Volume permissions must be remapped
- ⚠️ Breaking change for existing deployments
- ⚠️ Mailcow may have specific requirements
**Recommendation:** Test thoroughly in dev, then pihole, then mymx (last)
---
## Security Benefits
### Without User Namespace Remapping (Current State)
```
Container: Host:
UID 0 (root) → UID 0 (root) ❌ DANGEROUS
UID 1000 → UID 1000
```
**Problem:** Container root can potentially escape and has host root privileges.
### With User Namespace Remapping (Target State)
```
Container: Host:
UID 0 (root) → UID 165536 ✅ SAFE
UID 1000 → UID 166536
```
**Benefit:** Container root is unprivileged user on host.
---
## Prerequisites
### Before Starting Testing
1. **VM Snapshots Created**
```bash
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['pihole', 'mymx']"
```
2. **Rollback Procedures Reviewed**
- Read: `docs/runbooks/docker-configuration-rollback.md`
- Understand VM snapshot restore process
- Have emergency contact information ready
3. **Maintenance Window Scheduled**
- Duration: 2-3 hours for testing
- Low-traffic period recommended
- Second person available for verification
4. **Documentation Ready**
- This guide printed or accessible offline
- Docker and mailcow documentation available
- Notepad for documenting issues
---
## Testing Phase (Week 48-49)
### Phase 1: Test Environment Setup (Week 48)
**Objective:** Validate user namespace remapping with simple container
#### Option A: Use derp VM (Recommended)
```bash
# 1. Start derp VM (if stopped)
ssh grokbox "sudo virsh start derp"
# 2. Create ansible user and configure SSH
# (Use deploy_linux_vm role or manual setup)
# 3. Install Docker
ansible derp -m apt -a "name=docker.io state=present" -b
# 4. Create snapshot before testing
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['derp']"
```
#### Option B: Create temporary test container on existing host
```bash
# On pihole (low risk - only 1 container)
# Create test container first
docker run -d --name userns-test \
-v test-volume:/data \
alpine:latest sleep infinity
```
### Phase 2: Enable User Namespace Remapping (Week 48)
#### Step 1: Configure Docker Daemon
```bash
# On test host (derp or pihole)
sudo tee /etc/docker/daemon.json <<EOF
{
"userns-remap": "default"
}
EOF
# Validate syntax
cat /etc/docker/daemon.json | jq '.'
```
#### Step 2: Restart Docker
```bash
# Stop all containers first
docker stop $(docker ps -q)
# Restart Docker daemon
sudo systemctl restart docker
# Verify it started
sudo systemctl status docker
# Check for user namespace in docker info
docker info | grep -i "userns"
# Should show: "userns": true
```
#### Step 3: Verify UID Mapping
```bash
# Check subuid/subgid configuration
cat /etc/subuid
cat /etc/subgid
# Should show something like:
# dockremap:165536:65536
# Verify Docker is using remapping
docker info --format '{{.SecurityOptions}}'
```
#### Step 4: Recreate Test Container
```bash
# Remove old container (data is in volume)
docker rm userns-test
# Recreate container
docker run -d --name userns-test \
-v test-volume:/data \
alpine:latest sleep infinity
# Verify it's running
docker ps | grep userns-test
```
#### Step 5: Test Volume Permissions
```bash
# Create test file in container
docker exec userns-test sh -c 'echo "test" > /data/test.txt'
# Check file ownership on host
# Volume location changed! It's now in:
sudo ls -la /var/lib/docker/165536.165536/volumes/test-volume/_data/
# UID should be 165536 (remapped root)
# Test read/write in container
docker exec userns-test cat /data/test.txt
docker exec userns-test sh -c 'echo "test2" >> /data/test.txt'
```
### Phase 3: Test with Real Application (Week 48-49)
#### Test Scenario 1: Simple Web Server (pihole preparation)
```bash
# Deploy nginx with volume
docker run -d --name test-nginx \
-p 8080:80 \
-v nginx-data:/usr/share/nginx/html \
nginx:alpine
# Test access
curl http://localhost:8080
# Create content
docker exec test-nginx sh -c 'echo "<h1>User Namespace Test</h1>" > /usr/share/nginx/html/test.html'
# Verify access
curl http://localhost:8080/test.html
# Check logs
docker logs test-nginx
```
#### Test Scenario 2: Database Container (mailcow preparation)
```bash
# Deploy MariaDB with volume
docker run -d --name test-db \
-e MYSQL_ROOT_PASSWORD=testpass123 \
-v mysql-data:/var/lib/mysql \
mariadb:10.11
# Wait for startup
sleep 30
# Test database
docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;"
# Create test database
docker exec test-db mysql -ptest pass123 -e "CREATE DATABASE testdb;"
# Stop and restart to test persistence
docker stop test-db
docker start test-db
sleep 20
# Verify data persisted
docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;" | grep testdb
```
#### Test Scenario 3: Application with File Uploads
```bash
# Create upload directory
mkdir -p /tmp/test-uploads
# Run container with bind mount
docker run -d --name test-upload \
-v /tmp/test-uploads:/uploads \
alpine:latest sleep infinity
# Test file creation
docker exec test-upload sh -c 'echo "test" > /uploads/test.txt'
# Check host permissions
ls -la /tmp/test-uploads/
# File should be owned by UID 165536
# Test file access from container
docker exec test-upload cat /uploads/test.txt
```
### Phase 4: Identify Issues (Week 48-49)
#### Common Issues to Check
1. **Permission Denied Errors**
```bash
# Check container logs
docker logs <container_name> 2>&1 | grep -i "permission"
```
2. **Volume Mount Failures**
```bash
# List volumes
docker volume ls
# Inspect volume
docker volume inspect <volume_name>
# Check actual location on disk
sudo ls -la /var/lib/docker/*/volumes/
```
3. **Bind Mount Issues**
```bash
# For bind mounts, may need to adjust host permissions
# Example: Allow remapped UID to write
sudo chown 165536:165536 /path/to/host/dir
```
4. **Privileged Container Conflicts**
```bash
# Test if privileged containers still work
docker run --rm --privileged alpine:latest id
# Note: Privileged containers bypass userns remapping
```
#### Document All Findings
Create test log:
```markdown
## User Namespace Remapping Test Log
Date: <date>
Host: <hostname>
Docker Version: <version>
### Test 1: Simple Container
- Result: PASS/FAIL
- Issues: <none or list>
- Notes: <observations>
### Test 2: Web Server
- Result: PASS/FAIL
- Issues: <none or list>
- Notes: <observations>
### Test 3: Database
- Result: PASS/FAIL
- Issues: <none or list>
- Notes: <observations>
### Conclusion
Ready for production: YES/NO
Blockers: <list if any>
```
---
## Production Implementation (Week 50)
### Implementation Order
1. **pihole** (Week 49 end / Week 50 start) - Lowest risk
2. **mymx** (Week 50 end) - Highest risk, requires mailcow-specific testing
### pihole Implementation
**Prerequisites:**
- ✅ Testing completed successfully on derp/test environment
- ✅ VM snapshot created
- ✅ Maintenance window scheduled
- ✅ Rollback procedure reviewed
**Steps:**
```bash
# 1. Create snapshot
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['pihole']" \
-e "snapshot_description='Pre user namespace implementation'"
# 2. Backup current configuration
ansible pihole -m shell -a "sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)" -b
# 3. Stop pihole container
ansible pihole -m shell -a "docker stop pihole" -b
# 4. Configure user namespace remapping
ansible pihole -m copy -b -a "
dest=/etc/docker/daemon.json
content='{\"userns-remap\": \"default\"}'
owner=root
group=root
mode='0644'
"
# 5. Restart Docker
ansible pihole -m systemd -a "name=docker state=restarted" -b
# 6. Verify Docker started
ansible pihole -m shell -a "docker info | grep -i userns" -b
# 7. Recreate pihole container (adjust based on actual deployment)
# If using docker run command, re-run it
# If using docker-compose, run: docker-compose up -d
# 8. Verify pihole is working
ansible pihole -m shell -a "docker ps" -b
ansible pihole -m shell -a "docker logs pihole --tail 50" -b
# 9. Test DNS functionality
dig @192.168.122.12 google.com
# 10. Monitor for 1 hour
watch -n 60 'ansible pihole -m shell -a "docker ps" -b'
```
**Rollback if Issues:**
```bash
# Follow docs/runbooks/docker-configuration-rollback.md
# Procedure 3: User Namespace Remapping Rollback
```
---
## Mailcow-Specific Considerations
### Why Mailcow is Complex
1. **Multiple interconnected containers** (24 containers)
2. **Persistent data in multiple volumes** (mail, databases, configs)
3. **File permissions critical** for mail delivery
4. **Active production service** - downtime impact high
### Mailcow Testing Approach (Week 49-50)
#### Phase 1: Research (Week 49)
```bash
# 1. Check mailcow documentation
# Search: "user namespace" or "userns-remap"
# URL: https://docs.mailcow.email/
# 2. Check mailcow GitHub issues
# Search for: userns, user namespace, permission issues
# 3. Check mailcow community forum
# URL: https://community.mailcow.email/
# Search for similar implementations
```
#### Phase 2: Mailcow Test Environment (Week 49)
**Option A: Deploy test mailcow on derp**
```bash
# Requires:
# - 4GB+ RAM (derp may be too small)
# - 20GB+ disk space
# - Domain for testing
# Install mailcow on derp
git clone https://github.com/mailcow/mailcow-dockerized
cd mailcow-dockerized
./generate_config.sh
docker-compose up -d
```
**Option B: Clone mymx mailcow config to test environment**
```bash
# Create test VM clone
# Copy mailcow configuration
# Test with user namespaces
```
#### Phase 3: Mailcow Volume Analysis (Week 49)
```bash
# On mymx, identify all volumes
docker volume ls | grep mailcow
# Check critical volumes
docker volume inspect mailcowdockerized_vmail-vol-1
docker volume inspect mailcowdockerized_mysql-vol-1
# Document current permissions
for vol in $(docker volume ls -q | grep mailcow); do
echo "=== $vol ==="
sudo ls -la /var/lib/docker/volumes/$vol/_data/ | head -20
done > /tmp/mailcow-permissions-before.txt
```
#### Phase 4: Mailcow Implementation (Week 50 - IF testing successful)
**ONLY proceed if:**
- ✅ Testing in dev environment successful
- ✅ pihole implementation successful
- ✅ Mailcow community confirms no known issues
- ✅ Extended maintenance window available (2-4 hours)
- ✅ Full backups completed
- ✅ Rollback tested and confirmed working
**Implementation Steps:**
```bash
# 1. Create snapshot
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['mymx']" \
-e "snapshot_description='Pre mailcow user namespace'"
# 2. Backup ALL mailcow data
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && ./helper-scripts/backup_and_restore.sh backup all" -b
# 3. Stop mailcow
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose down" -b
# 4. Backup current state
ansible mymx -m shell -a "
sudo tar -czf /root/mailcow-pre-userns-$(date +%s).tar.gz \
/etc/docker \
/opt/mailcow-dockerized \
/var/lib/docker/volumes/mailcow*
" -b
# 5. Configure user namespace
ansible mymx -m shell -a "
sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)
echo '{\"userns-remap\": \"default\"}' | sudo tee /etc/docker/daemon.json
" -b
# 6. Restart Docker
ansible mymx -m systemd -a "name=docker state=restarted" -b
# 7. Verify Docker started with user namespaces
ansible mymx -m shell -a "docker info | grep -i userns" -b
# 8. Start mailcow (will recreate all containers)
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose up -d" -b
# 9. Monitor startup
watch -n 10 'ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose ps" -b'
# 10. Check logs for permission errors
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose logs --tail 100" -b | grep -i "permission\|denied\|failed"
# 11. Test mail functionality
# - Send test email
# - Receive test email
# - Check webmail access
# - Verify SOGo groupware
# - Test IMAP/SMTP connections
# 12. Monitor for 4-8 hours before declaring success
```
**Known Potential Issues with Mailcow:**
1. **Vmail Volume Permissions**
```bash
# If mail delivery fails with permission errors
# May need to adjust permissions (LAST RESORT)
sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes/mailcowdockerized_vmail-vol-1/_data/
```
2. **MySQL Volume Issues**
```bash
# If database won't start
# Check MySQL logs
docker logs mailcowdockerized-mysql-mailcow-1
# May need database permission fixes
# This is why testing is CRITICAL
```
3. **Dovecot Permission Issues**
```bash
# Dovecot is sensitive to mail file permissions
# May require config adjustments in mailcow.conf
```
### Mailcow Rollback Decision Point
**Roll back immediately if:**
- Docker daemon won't start
- MySQL container won't start
- Cannot send/receive mail after 15 minutes
- Permission errors in critical containers
- Data appears missing/inaccessible
**Use VM snapshot restore if:**
- Multiple containers failing
- Data corruption suspected
- Cannot resolve within 30 minutes
---
## Troubleshooting
### Issue 1: Docker Daemon Won't Start
**Symptoms:**
```bash
systemctl status docker
# Failed to start Docker Application Container Engine
```
**Solutions:**
```bash
# Check logs
journalctl -u docker -n 100 --no-pager
# Common causes:
# 1. Invalid daemon.json syntax
cat /etc/docker/daemon.json | jq '.'
# 2. Subuid/subgid not configured
cat /etc/subuid
cat /etc/subgid
# Should have dockremap:165536:65536
# 3. Restore backup
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
sudo systemctl start docker
```
### Issue 2: Container Won't Start - Permission Denied
**Symptoms:**
```bash
docker logs <container>
# Permission denied errors
```
**Solutions:**
```bash
# 1. Check volume location
docker volume inspect <volume_name>
# 2. Check permissions on host
sudo ls -la /var/lib/docker/165536.165536/volumes/<volume>/_data/
# 3. If permissions wrong, may need to adjust
# (Avoid this if possible - indicates larger problem)
sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes/<volume>/_data/
```
### Issue 3: Bind Mounts Not Working
**Symptoms:**
```bash
docker logs <container>
# Cannot access /bind/mount/path
```
**Solutions:**
```bash
# Bind mounts need host directory permissions adjusted
sudo chown 165536:165536 /path/to/bind/mount
# Or use volumes instead of bind mounts
# Volumes are handled automatically by Docker
```
### Issue 4: Privileged Container Needed
**Note:** Privileged containers (like mailcow netfilter) bypass user namespace remapping.
```bash
# Verify privileged container still works
docker inspect <container> | grep -i privileged
# Should show: "Privileged": true
# Privileged containers run as actual root (userns bypassed)
# This is expected for netfilter, acceptable risk (documented)
```
---
## Success Criteria
### Testing Phase Success (Before Production)
- [ ] Simple container runs successfully
- [ ] Web server container accessible
- [ ] Database container stores/retrieves data
- [ ] Volume permissions correct (165536 UID)
- [ ] Bind mounts work (if needed)
- [ ] No permission errors in logs
- [ ] Can recreate containers after Docker restart
- [ ] Rollback procedure tested and successful
### Production Implementation Success
#### pihole
- [ ] VM snapshot created
- [ ] Docker daemon running with user namespaces
- [ ] pihole container running
- [ ] DNS queries working
- [ ] No permission errors in logs
- [ ] Monitoring shows normal operation for 24+ hours
#### mymx/mailcow
- [ ] VM snapshot created
- [ ] Docker daemon running with user namespaces
- [ ] All 24 containers running
- [ ] Can send email
- [ ] Can receive email
- [ ] Webmail accessible
- [ ] SOGo groupware working
- [ ] No permission errors in logs
- [ ] Monitoring shows normal operation for 48+ hours
- [ ] Full service verification completed
---
## Decision Tree
```
START: Ready to enable user namespaces?
├─ Testing completed in dev?
│ ├─ NO → STOP: Complete testing first
│ └─ YES → Continue
├─ VM snapshots created?
│ ├─ NO → STOP: Create snapshots first
│ └─ YES → Continue
├─ Rollback procedure reviewed?
│ ├─ NO → STOP: Review rollback docs
│ └─ YES → Continue
├─ Which host?
│ ├─ pihole → Proceed (lower risk)
│ └─ mymx → Additional checks needed
│ │
│ ├─ Mailcow community research done?
│ │ ├─ NO → STOP: Research first
│ │ └─ YES → Continue
│ │
│ ├─ pihole implementation successful?
│ │ ├─ NO → STOP: Fix pihole first
│ │ └─ YES → Continue
│ │
│ ├─ Extended maintenance window?
│ │ ├─ NO → STOP: Schedule proper window
│ │ └─ YES → Proceed with caution
│ │
│ └─ Proceed with mymx (high risk)
```
---
## References
- Docker User Namespace Documentation: https://docs.docker.com/engine/security/userns-remap/
- CIS Docker Benchmark 2.13: Enable user namespace support
- Mailcow Documentation: https://docs.mailcow.email/
- NIST SP 800-190: Section 4.4 - Host OS and multi-tenancy
---
**Document Version:** 1.0
**Next Review:** After testing completion (Week 49)
**Owner:** Infrastructure Security Team

View File

@@ -0,0 +1,549 @@
# Docker Configuration Rollback Procedures
**Document Version:** 1.0
**Last Updated:** 2025-11-11
**Owner:** Infrastructure Team
**Risk Level:** HIGH - User Namespace Remapping / LOW - Resource Limits
---
## Table of Contents
1. [Overview](#overview)
2. [Pre-Change Requirements](#pre-change-requirements)
3. [Rollback Procedures](#rollback-procedures)
4. [Specific Scenarios](#specific-scenarios)
5. [Emergency Contacts](#emergency-contacts)
---
## Overview
This runbook provides step-by-step rollback procedures for Docker configuration changes, with special focus on high-risk modifications like user namespace remapping.
### Risk Classification
| Change Type | Risk Level | Rollback Complexity | Downtime |
|-------------|-----------|---------------------|----------|
| Resource limits | LOW | Simple | < 1 min |
| Image version pinning | LOW | Simple | < 1 min |
| User namespace remapping | HIGH | Complex | 5-15 min |
| Network configuration | MEDIUM | Moderate | 2-5 min |
| Storage driver change | CRITICAL | Complex | 15-30 min |
---
## Pre-Change Requirements
### Before ANY Docker Configuration Change
**MANDATORY STEPS - DO NOT SKIP:**
1. **Create VM Snapshot**
```bash
# From Ansible control node
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['pihole']" \
-e "snapshot_description='Pre Docker config change'"
```
2. **Backup Docker Configuration**
```bash
# On target host
sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)
sudo tar -czf /root/docker-backup-$(date +%s).tar.gz \
/etc/docker \
/var/lib/docker/volumes
```
3. **Document Current State**
```bash
# Capture current container list
docker ps -a --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" > /tmp/containers-before.txt
# Capture current configuration
docker info > /tmp/docker-info-before.txt
# Capture volume list
docker volume ls > /tmp/volumes-before.txt
```
4. **Verify Connectivity**
```bash
# Test from Ansible control node
ansible pihole -m ping
ansible pihole -m shell -a "docker ps"
```
5. **Schedule Maintenance Window**
- Notify stakeholders
- Plan for 30-60 minute window
- Have second person available for verification
---
## Rollback Procedures
### Procedure 1: Quick Rollback (Resource Limits / Image Versions)
**Time Estimate:** 1-2 minutes
**Risk:** LOW
**Downtime:** < 1 minute per container
#### Steps
1. **Stop affected container**
```bash
docker stop <container_name>
```
2. **Restore previous configuration**
```bash
# For docker run commands
# Simply re-run with old parameters
# For docker-compose
git checkout HEAD~1 docker-compose.yml
docker-compose up -d <container_name>
```
3. **Verify service**
```bash
docker ps | grep <container_name>
docker logs <container_name> --tail 50
# Test application functionality
curl -I http://<service_url>
```
#### Success Criteria
- Container running
- Logs show normal operation
- Service accessible
- No errors in `docker logs`
---
### Procedure 2: Daemon Configuration Rollback (Non-Breaking Changes)
**Time Estimate:** 3-5 minutes
**Risk:** MEDIUM
**Downtime:** 2-3 minutes
#### Steps
1. **Restore daemon.json**
```bash
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
```
2. **Restart Docker daemon**
```bash
sudo systemctl restart docker
```
3. **Verify Docker is running**
```bash
sudo systemctl status docker
docker info
```
4. **Check all containers**
```bash
docker ps -a
# Restart any stopped containers
docker start $(docker ps -aq)
```
5. **Verify services**
```bash
# Test each service
docker logs <container> --tail 20
```
#### Success Criteria
- Docker daemon running
- All containers started
- Services accessible
- No errors in `journalctl -u docker`
---
### Procedure 3: User Namespace Remapping Rollback (HIGH RISK)
**Time Estimate:** 10-15 minutes
**Risk:** HIGH
**Downtime:** 10-15 minutes
**Data Loss Risk:** LOW (if volumes backed up)
⚠️ **WARNING:** This is the most complex rollback. Follow carefully.
#### Pre-Rollback Verification
```bash
# Verify snapshot exists
ssh grokbox "sudo virsh snapshot-list <vm_name>"
# Verify backup archive exists
ls -lh /root/docker-backup-*.tar.gz
```
#### Steps
1. **Stop all containers gracefully**
```bash
# Mailcow example
cd /opt/mailcow-dockerized
docker-compose down
# Or generic
docker stop $(docker ps -q)
```
2. **Stop Docker daemon**
```bash
sudo systemctl stop docker
```
3. **Restore daemon.json (remove userns-remap)**
```bash
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
# Verify userns-remap is removed
grep -i userns /etc/docker/daemon.json
```
4. **CRITICAL: Handle user namespace volume mappings**
```bash
# User namespaced volumes are in a different location
# /var/lib/docker/<uid>.<gid>/volumes/
# List namespaced volumes
sudo ls -la /var/lib/docker/*/volumes/
# Copy volumes back to main location (if needed)
sudo rsync -av /var/lib/docker/*/volumes/* /var/lib/docker/volumes/
```
5. **Start Docker daemon**
```bash
sudo systemctl start docker
sudo systemctl status docker
```
6. **Verify Docker info**
```bash
docker info | grep -i "userns"
# Should NOT show user namespace remapping
```
7. **Recreate containers**
```bash
# Mailcow example
cd /opt/mailcow-dockerized
docker-compose up -d
# Wait for all containers to start
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'
```
8. **Verify all services**
```bash
# Check container logs
docker-compose logs --tail 50
# Test services
curl -I https://cow.mymx.me
# Verify email functionality (mailcow)
docker-compose exec postfix-mailcow postqueue -p
```
#### If Rollback Fails: VM Snapshot Restore
```bash
# From Ansible control node or directly on hypervisor
# 1. Shutdown VM
ssh grokbox "sudo virsh shutdown <vm_name>"
# 2. Wait for shutdown (max 60 seconds)
sleep 30
# 3. Force stop if needed
ssh grokbox "sudo virsh destroy <vm_name>"
# 4. Revert to snapshot
ssh grokbox "sudo virsh snapshot-revert <vm_name> backup_<timestamp>"
# 5. Start VM
ssh grokbox "sudo virsh start <vm_name>"
# 6. Verify SSH access (may take 1-2 minutes)
ansible <vm_name> -m ping
# 7. Verify services
ansible <vm_name> -m shell -a "docker ps"
```
#### Success Criteria
- Docker daemon running WITHOUT user namespace remapping
- All containers running
- All services accessible
- Volume data intact
- No permission errors in logs
---
## Specific Scenarios
### Scenario A: Mailcow Container Won't Start After Namespace Change
**Symptoms:**
- Containers exit immediately
- Permission denied errors in logs
- Volume mount failures
**Solution:**
```bash
# 1. Check volume permissions
docker run --rm -v mailcowdockerized_vmail-vol-1:/volume alpine ls -la /volume
# 2. Fix permissions if needed (DANGEROUS - only if you know UID mapping)
# This example assumes standard userns mapping (165536 offset)
sudo chown -R 165536:165536 /var/lib/docker/volumes/mailcowdockerized_vmail-vol-1
# 3. If permissions are unfixable, revert to snapshot
# See "VM Snapshot Restore" above
```
### Scenario B: Docker Daemon Won't Start After Config Change
**Symptoms:**
- `systemctl start docker` fails
- Errors in `journalctl -u docker`
**Solution:**
```bash
# 1. Check exact error
sudo journalctl -u docker -n 50 --no-pager
# 2. Validate daemon.json syntax
sudo cat /etc/docker/daemon.json | jq '.'
# 3. If syntax error, restore backup
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
# 4. If configuration conflict, check docs
sudo dockerd --validate --config-file /etc/docker/daemon.json
# 5. Start daemon
sudo systemctl start docker
```
### Scenario C: Data Loss After Namespace Change
**Symptoms:**
- Volumes appear empty
- Database containers can't find data
- Application state lost
**Solution:**
```bash
# 1. STOP - Do not proceed with data recovery attempts
# 2. DO NOT restart containers
# 3. Immediately revert to snapshot
ssh grokbox "sudo virsh snapshot-revert <vm_name> backup_<timestamp>"
# 4. After VM restore, verify data
docker exec <database_container> <verification_command>
# Example for MySQL
docker exec mailcowdockerized-mysql-mailcow-1 mysql -u root -p<password> -e "SHOW DATABASES;"
```
---
## Testing Rollback Procedures
### Monthly Rollback Drill
**Schedule:** First Monday of each month
**Duration:** 30 minutes
**Environment:** Development/Test VMs only
#### Drill Steps
1. **Create test VM or use derp**
```bash
# Deploy test container
docker run -d --name test-nginx nginx:latest
```
2. **Create snapshot**
```bash
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['test-vm']"
```
3. **Make intentional breaking change**
```bash
# Break Docker config
echo '{"invalid": json}' | sudo tee /etc/docker/daemon.json
sudo systemctl restart docker # This will fail
```
4. **Practice rollback**
```bash
# Follow Procedure 2 above
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
sudo systemctl start docker
```
5. **Practice snapshot restore**
```bash
# Follow VM Snapshot Restore procedure
ssh grokbox "sudo virsh snapshot-revert test-vm backup_<timestamp>"
```
6. **Document issues found**
- Update this runbook
- Note any steps that were unclear
- Time each procedure
---
## Emergency Contacts
### Escalation Path
| Level | Contact | Response Time | Responsibility |
|-------|---------|---------------|----------------|
| L1 | Infrastructure Team | Immediate | Execute runbook |
| L2 | Senior Sysadmin | 15 minutes | Complex issues |
| L3 | Vendor Support | 1-4 hours | Critical failures |
### Service-Specific Contacts
**Mailcow:**
- Documentation: https://docs.mailcow.email/
- Community: https://community.mailcow.email/
- Emergency: Check for known issues in GitHub
**Docker:**
- Documentation: https://docs.docker.com/
- Community Forums: https://forums.docker.com/
---
## Post-Rollback Actions
### After Any Rollback
1. **Update incident log**
```markdown
Date: <timestamp>
VM: <vm_name>
Change Attempted: <description>
Rollback Procedure Used: <procedure_number>
Success: Yes/No
Time to Restore: <minutes>
Issues Encountered: <list>
```
2. **Verify service monitoring**
- Check all alerts cleared
- Verify metrics returning to normal
- Test service endpoints
3. **Document lessons learned**
- What went wrong?
- What could be improved?
- Update this runbook
4. **Schedule post-mortem** (for critical incidents)
- Within 48 hours
- All stakeholders present
- Action items assigned
5. **Update change management records**
- Mark change as rolled back
- Document reason for failure
- Plan for retry (if applicable)
---
## Preventive Measures
### Before Making High-Risk Changes
1. **Test in development first**
- Use derp VM or test environment
- Replicate production as closely as possible
- Document exact steps that work
2. **Review Docker/Mailcow changelogs**
- Check for known issues
- Review breaking changes
- Search community forums
3. **Peer review change plan**
- Have colleague review procedure
- Walk through rollback steps
- Verify backup procedures
4. **Schedule during low-traffic period**
- Weekend or late evening
- Notify users in advance
- Have monitoring ready
---
## Appendix A: Quick Reference Commands
### Snapshot Management
```bash
# Create snapshot
ansible-playbook playbooks/backup_vm_snapshot.yml -e "target_vms=['vm']"
# List snapshots
ssh grokbox "sudo virsh snapshot-list <vm>"
# Revert to snapshot
ssh grokbox "sudo virsh snapshot-revert <vm> <snapshot_name>"
# Delete snapshot
ssh grokbox "sudo virsh snapshot-delete <vm> <snapshot_name>"
```
### Docker Backup/Restore
```bash
# Backup
sudo tar -czf docker-backup.tar.gz /etc/docker /var/lib/docker/volumes
# Restore
sudo tar -xzf docker-backup.tar.gz -C /
```
### Service Verification
```bash
# Docker
systemctl status docker
docker info
docker ps
# Mailcow
cd /opt/mailcow-dockerized
docker-compose ps
docker-compose logs --tail 50
```
---
**Document End**
**Review Schedule:** Monthly
**Next Review:** 2025-12-11
**Approval:** Infrastructure Team Lead

View File

@@ -0,0 +1,206 @@
---
# ==============================================================================
# VM Snapshot Backup Playbook
# ==============================================================================
# Create snapshots of VMs before risky operations
# Supports KVM/libvirt VMs via hypervisor connection
# ==============================================================================
- name: Create VM Snapshots for Backup
hosts: localhost
gather_facts: true
vars:
hypervisor_uri: "qemu+ssh://grok@grok.home.serneels.xyz/system"
snapshot_description: "Pre-maintenance backup"
snapshot_prefix: "backup"
target_vms: [] # Empty list means all running VMs
tasks:
- name: Display snapshot operation information
ansible.builtin.debug:
msg:
- "=== VM Snapshot Backup Operation ==="
- "Hypervisor: {{ hypervisor_uri }}"
- "Date: {{ ansible_date_time.iso8601 }}"
- "Target VMs: {{ target_vms | default('all running VMs') }}"
tags: [always]
- name: Validate target_vms variable
ansible.builtin.assert:
that:
- target_vms is defined
- target_vms is iterable
fail_msg: "target_vms must be a list of VM names"
tags: [always]
# ==========================================================================
# Get VM List
# ==========================================================================
- name: Get list of all running VMs
ansible.builtin.shell: |
ssh grokbox "sudo virsh list --name"
register: all_vms_raw
changed_when: false
when: target_vms | length == 0
tags: [discover]
- name: Parse running VMs list
ansible.builtin.set_fact:
discovered_vms: "{{ all_vms_raw.stdout_lines | select() | list }}"
when: target_vms | length == 0
tags: [discover]
- name: Set final VM list
ansible.builtin.set_fact:
vms_to_backup: "{{ target_vms if target_vms | length > 0 else discovered_vms }}"
tags: [discover]
- name: Display VMs to be backed up
ansible.builtin.debug:
msg: "VMs to backup: {{ vms_to_backup }}"
tags: [discover]
# ==========================================================================
# Pre-flight Checks
# ==========================================================================
- name: Check if VMs exist and are running
ansible.builtin.shell: |
ssh grokbox "sudo virsh domstate {{ item }}"
register: vm_states
failed_when: vm_states.rc != 0
changed_when: false
loop: "{{ vms_to_backup }}"
tags: [validate]
- name: Verify all VMs are running
ansible.builtin.assert:
that:
- item.stdout == 'running'
fail_msg: "VM {{ item.item }} is not running (state: {{ item.stdout }})"
success_msg: "VM {{ item.item }} is running"
loop: "{{ vm_states.results }}"
tags: [validate]
- name: Check for existing snapshots
ansible.builtin.shell: |
ssh grokbox "sudo virsh snapshot-list {{ item }} --name"
register: existing_snapshots
changed_when: false
loop: "{{ vms_to_backup }}"
tags: [validate]
- name: Display existing snapshots
ansible.builtin.debug:
msg:
- "VM: {{ item.item }}"
- "Existing snapshots: {{ item.stdout_lines | default(['none']) | join(', ') }}"
loop: "{{ existing_snapshots.results }}"
tags: [validate]
# ==========================================================================
# Create Snapshots
# ==========================================================================
- name: Generate snapshot name with timestamp
ansible.builtin.set_fact:
snapshot_timestamp: "{{ ansible_date_time.epoch }}"
tags: [snapshot]
- name: Create VM snapshots
ansible.builtin.shell: |
ssh grokbox "sudo virsh snapshot-create-as {{ item }} \
--name '{{ snapshot_prefix }}_{{ snapshot_timestamp }}' \
--description '{{ snapshot_description }} - {{ ansible_date_time.iso8601 }}' \
--atomic"
register: snapshot_create
loop: "{{ vms_to_backup }}"
tags: [snapshot]
- name: Verify snapshot creation
ansible.builtin.shell: |
ssh grokbox "sudo virsh snapshot-info {{ item }} {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
register: snapshot_info
changed_when: false
loop: "{{ vms_to_backup }}"
tags: [snapshot, verify]
# ==========================================================================
# Generate Backup Report
# ==========================================================================
- name: Create backup report directory
ansible.builtin.file:
path: "./stats/vm_backups"
state: directory
mode: '0755'
tags: [report]
- name: Generate backup report
ansible.builtin.copy:
content: |
================================================================================
VM SNAPSHOT BACKUP REPORT
================================================================================
Date: {{ ansible_date_time.iso8601 }}
Hypervisor: {{ hypervisor_uri }}
Snapshot Name: {{ snapshot_prefix }}_{{ snapshot_timestamp }}
Description: {{ snapshot_description }}
VMs Backed Up:
{% for vm in vms_to_backup %}
- {{ vm }}
{% endfor %}
Snapshot Details:
{% for result in snapshot_info.results %}
VM: {{ result.item }}
{{ result.stdout }}
{% endfor %}
ROLLBACK INSTRUCTIONS
================================================================================
To restore a VM to this snapshot:
1. Stop the VM (if running):
ssh grokbox "sudo virsh shutdown <vm_name>"
2. Revert to snapshot:
ssh grokbox "sudo virsh snapshot-revert <vm_name> {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
3. Start the VM:
ssh grokbox "sudo virsh start <vm_name>"
To delete this snapshot after verification:
ssh grokbox "sudo virsh snapshot-delete <vm_name> {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
================================================================================
END OF REPORT
================================================================================
dest: "./stats/vm_backups/backup_{{ snapshot_timestamp }}.txt"
mode: '0644'
tags: [report]
# ==========================================================================
# Display Summary
# ==========================================================================
- name: Display backup summary
ansible.builtin.debug:
msg:
- "=== VM Snapshot Backup Complete ==="
- "Snapshot Name: {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
- "VMs Backed Up: {{ vms_to_backup | length }}"
- "Backup Report: ./stats/vm_backups/backup_{{ snapshot_timestamp }}.txt"
- ""
- "⚠️ IMPORTANT NOTES:"
- "1. Snapshots are point-in-time copies"
- "2. Test restoration procedure before relying on snapshots"
- "3. Snapshots consume disk space - clean up old snapshots"
- "4. For critical changes, consider full VM backups"
- ""
- "To restore: virsh snapshot-revert <vm> {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
tags: [always]