- Add comprehensive Docker user namespace testing documentation - Add Docker configuration rollback runbook for disaster recovery - Add VM snapshot backup playbook for system protection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
12 KiB
Docker Configuration Rollback Procedures
Document Version: 1.0 Last Updated: 2025-11-11 Owner: Infrastructure Team Risk Level: HIGH - User Namespace Remapping / LOW - Resource Limits
Table of Contents
Overview
This runbook provides step-by-step rollback procedures for Docker configuration changes, with special focus on high-risk modifications like user namespace remapping.
Risk Classification
| Change Type | Risk Level | Rollback Complexity | Downtime |
|---|---|---|---|
| Resource limits | LOW | Simple | < 1 min |
| Image version pinning | LOW | Simple | < 1 min |
| User namespace remapping | HIGH | Complex | 5-15 min |
| Network configuration | MEDIUM | Moderate | 2-5 min |
| Storage driver change | CRITICAL | Complex | 15-30 min |
Pre-Change Requirements
Before ANY Docker Configuration Change
MANDATORY STEPS - DO NOT SKIP:
-
Create VM Snapshot
# From Ansible control node ansible-playbook playbooks/backup_vm_snapshot.yml \ -e "target_vms=['pihole']" \ -e "snapshot_description='Pre Docker config change'" -
Backup Docker Configuration
# On target host sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s) sudo tar -czf /root/docker-backup-$(date +%s).tar.gz \ /etc/docker \ /var/lib/docker/volumes -
Document Current State
# Capture current container list docker ps -a --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" > /tmp/containers-before.txt # Capture current configuration docker info > /tmp/docker-info-before.txt # Capture volume list docker volume ls > /tmp/volumes-before.txt -
Verify Connectivity
# Test from Ansible control node ansible pihole -m ping ansible pihole -m shell -a "docker ps" -
Schedule Maintenance Window
- Notify stakeholders
- Plan for 30-60 minute window
- Have second person available for verification
Rollback Procedures
Procedure 1: Quick Rollback (Resource Limits / Image Versions)
Time Estimate: 1-2 minutes Risk: LOW Downtime: < 1 minute per container
Steps
-
Stop affected container
docker stop <container_name> -
Restore previous configuration
# For docker run commands # Simply re-run with old parameters # For docker-compose git checkout HEAD~1 docker-compose.yml docker-compose up -d <container_name> -
Verify service
docker ps | grep <container_name> docker logs <container_name> --tail 50 # Test application functionality curl -I http://<service_url>
Success Criteria
- Container running
- Logs show normal operation
- Service accessible
- No errors in
docker logs
Procedure 2: Daemon Configuration Rollback (Non-Breaking Changes)
Time Estimate: 3-5 minutes Risk: MEDIUM Downtime: 2-3 minutes
Steps
-
Restore daemon.json
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json -
Restart Docker daemon
sudo systemctl restart docker -
Verify Docker is running
sudo systemctl status docker docker info -
Check all containers
docker ps -a # Restart any stopped containers docker start $(docker ps -aq) -
Verify services
# Test each service docker logs <container> --tail 20
Success Criteria
- Docker daemon running
- All containers started
- Services accessible
- No errors in
journalctl -u docker
Procedure 3: User Namespace Remapping Rollback (HIGH RISK)
Time Estimate: 10-15 minutes Risk: HIGH Downtime: 10-15 minutes Data Loss Risk: LOW (if volumes backed up)
⚠️ WARNING: This is the most complex rollback. Follow carefully.
Pre-Rollback Verification
# Verify snapshot exists
ssh grokbox "sudo virsh snapshot-list <vm_name>"
# Verify backup archive exists
ls -lh /root/docker-backup-*.tar.gz
Steps
-
Stop all containers gracefully
# Mailcow example cd /opt/mailcow-dockerized docker-compose down # Or generic docker stop $(docker ps -q) -
Stop Docker daemon
sudo systemctl stop docker -
Restore daemon.json (remove userns-remap)
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json # Verify userns-remap is removed grep -i userns /etc/docker/daemon.json -
CRITICAL: Handle user namespace volume mappings
# User namespaced volumes are in a different location # /var/lib/docker/<uid>.<gid>/volumes/ # List namespaced volumes sudo ls -la /var/lib/docker/*/volumes/ # Copy volumes back to main location (if needed) sudo rsync -av /var/lib/docker/*/volumes/* /var/lib/docker/volumes/ -
Start Docker daemon
sudo systemctl start docker sudo systemctl status docker -
Verify Docker info
docker info | grep -i "userns" # Should NOT show user namespace remapping -
Recreate containers
# Mailcow example cd /opt/mailcow-dockerized docker-compose up -d # Wait for all containers to start watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"' -
Verify all services
# Check container logs docker-compose logs --tail 50 # Test services curl -I https://cow.mymx.me # Verify email functionality (mailcow) docker-compose exec postfix-mailcow postqueue -p
If Rollback Fails: VM Snapshot Restore
# From Ansible control node or directly on hypervisor
# 1. Shutdown VM
ssh grokbox "sudo virsh shutdown <vm_name>"
# 2. Wait for shutdown (max 60 seconds)
sleep 30
# 3. Force stop if needed
ssh grokbox "sudo virsh destroy <vm_name>"
# 4. Revert to snapshot
ssh grokbox "sudo virsh snapshot-revert <vm_name> backup_<timestamp>"
# 5. Start VM
ssh grokbox "sudo virsh start <vm_name>"
# 6. Verify SSH access (may take 1-2 minutes)
ansible <vm_name> -m ping
# 7. Verify services
ansible <vm_name> -m shell -a "docker ps"
Success Criteria
- Docker daemon running WITHOUT user namespace remapping
- All containers running
- All services accessible
- Volume data intact
- No permission errors in logs
Specific Scenarios
Scenario A: Mailcow Container Won't Start After Namespace Change
Symptoms:
- Containers exit immediately
- Permission denied errors in logs
- Volume mount failures
Solution:
# 1. Check volume permissions
docker run --rm -v mailcowdockerized_vmail-vol-1:/volume alpine ls -la /volume
# 2. Fix permissions if needed (DANGEROUS - only if you know UID mapping)
# This example assumes standard userns mapping (165536 offset)
sudo chown -R 165536:165536 /var/lib/docker/volumes/mailcowdockerized_vmail-vol-1
# 3. If permissions are unfixable, revert to snapshot
# See "VM Snapshot Restore" above
Scenario B: Docker Daemon Won't Start After Config Change
Symptoms:
systemctl start dockerfails- Errors in
journalctl -u docker
Solution:
# 1. Check exact error
sudo journalctl -u docker -n 50 --no-pager
# 2. Validate daemon.json syntax
sudo cat /etc/docker/daemon.json | jq '.'
# 3. If syntax error, restore backup
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
# 4. If configuration conflict, check docs
sudo dockerd --validate --config-file /etc/docker/daemon.json
# 5. Start daemon
sudo systemctl start docker
Scenario C: Data Loss After Namespace Change
Symptoms:
- Volumes appear empty
- Database containers can't find data
- Application state lost
Solution:
# 1. STOP - Do not proceed with data recovery attempts
# 2. DO NOT restart containers
# 3. Immediately revert to snapshot
ssh grokbox "sudo virsh snapshot-revert <vm_name> backup_<timestamp>"
# 4. After VM restore, verify data
docker exec <database_container> <verification_command>
# Example for MySQL
docker exec mailcowdockerized-mysql-mailcow-1 mysql -u root -p<password> -e "SHOW DATABASES;"
Testing Rollback Procedures
Monthly Rollback Drill
Schedule: First Monday of each month Duration: 30 minutes Environment: Development/Test VMs only
Drill Steps
-
Create test VM or use derp
# Deploy test container docker run -d --name test-nginx nginx:latest -
Create snapshot
ansible-playbook playbooks/backup_vm_snapshot.yml \ -e "target_vms=['test-vm']" -
Make intentional breaking change
# Break Docker config echo '{"invalid": json}' | sudo tee /etc/docker/daemon.json sudo systemctl restart docker # This will fail -
Practice rollback
# Follow Procedure 2 above sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json sudo systemctl start docker -
Practice snapshot restore
# Follow VM Snapshot Restore procedure ssh grokbox "sudo virsh snapshot-revert test-vm backup_<timestamp>" -
Document issues found
- Update this runbook
- Note any steps that were unclear
- Time each procedure
Emergency Contacts
Escalation Path
| Level | Contact | Response Time | Responsibility |
|---|---|---|---|
| L1 | Infrastructure Team | Immediate | Execute runbook |
| L2 | Senior Sysadmin | 15 minutes | Complex issues |
| L3 | Vendor Support | 1-4 hours | Critical failures |
Service-Specific Contacts
Mailcow:
- Documentation: https://docs.mailcow.email/
- Community: https://community.mailcow.email/
- Emergency: Check for known issues in GitHub
Docker:
- Documentation: https://docs.docker.com/
- Community Forums: https://forums.docker.com/
Post-Rollback Actions
After Any Rollback
-
Update incident log
Date: <timestamp> VM: <vm_name> Change Attempted: <description> Rollback Procedure Used: <procedure_number> Success: Yes/No Time to Restore: <minutes> Issues Encountered: <list> -
Verify service monitoring
- Check all alerts cleared
- Verify metrics returning to normal
- Test service endpoints
-
Document lessons learned
- What went wrong?
- What could be improved?
- Update this runbook
-
Schedule post-mortem (for critical incidents)
- Within 48 hours
- All stakeholders present
- Action items assigned
-
Update change management records
- Mark change as rolled back
- Document reason for failure
- Plan for retry (if applicable)
Preventive Measures
Before Making High-Risk Changes
-
Test in development first
- Use derp VM or test environment
- Replicate production as closely as possible
- Document exact steps that work
-
Review Docker/Mailcow changelogs
- Check for known issues
- Review breaking changes
- Search community forums
-
Peer review change plan
- Have colleague review procedure
- Walk through rollback steps
- Verify backup procedures
-
Schedule during low-traffic period
- Weekend or late evening
- Notify users in advance
- Have monitoring ready
Appendix A: Quick Reference Commands
Snapshot Management
# Create snapshot
ansible-playbook playbooks/backup_vm_snapshot.yml -e "target_vms=['vm']"
# List snapshots
ssh grokbox "sudo virsh snapshot-list <vm>"
# Revert to snapshot
ssh grokbox "sudo virsh snapshot-revert <vm> <snapshot_name>"
# Delete snapshot
ssh grokbox "sudo virsh snapshot-delete <vm> <snapshot_name>"
Docker Backup/Restore
# Backup
sudo tar -czf docker-backup.tar.gz /etc/docker /var/lib/docker/volumes
# Restore
sudo tar -xzf docker-backup.tar.gz -C /
Service Verification
# Docker
systemctl status docker
docker info
docker ps
# Mailcow
cd /opt/mailcow-dockerized
docker-compose ps
docker-compose logs --tail 50
Document End
Review Schedule: Monthly Next Review: 2025-12-11 Approval: Infrastructure Team Lead