- Add comprehensive Docker user namespace testing documentation - Add Docker configuration rollback runbook for disaster recovery - Add VM snapshot backup playbook for system protection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
550 lines
12 KiB
Markdown
550 lines
12 KiB
Markdown
# Docker Configuration Rollback Procedures
|
|
|
|
**Document Version:** 1.0
|
|
**Last Updated:** 2025-11-11
|
|
**Owner:** Infrastructure Team
|
|
**Risk Level:** HIGH - User Namespace Remapping / LOW - Resource Limits
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Overview](#overview)
|
|
2. [Pre-Change Requirements](#pre-change-requirements)
|
|
3. [Rollback Procedures](#rollback-procedures)
|
|
4. [Specific Scenarios](#specific-scenarios)
|
|
5. [Emergency Contacts](#emergency-contacts)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This runbook provides step-by-step rollback procedures for Docker configuration changes, with special focus on high-risk modifications like user namespace remapping.
|
|
|
|
### Risk Classification
|
|
|
|
| Change Type | Risk Level | Rollback Complexity | Downtime |
|
|
|-------------|-----------|---------------------|----------|
|
|
| Resource limits | LOW | Simple | < 1 min |
|
|
| Image version pinning | LOW | Simple | < 1 min |
|
|
| User namespace remapping | HIGH | Complex | 5-15 min |
|
|
| Network configuration | MEDIUM | Moderate | 2-5 min |
|
|
| Storage driver change | CRITICAL | Complex | 15-30 min |
|
|
|
|
---
|
|
|
|
## Pre-Change Requirements
|
|
|
|
### Before ANY Docker Configuration Change
|
|
|
|
**MANDATORY STEPS - DO NOT SKIP:**
|
|
|
|
1. **Create VM Snapshot**
|
|
```bash
|
|
# From Ansible control node
|
|
ansible-playbook playbooks/backup_vm_snapshot.yml \
|
|
-e "target_vms=['pihole']" \
|
|
-e "snapshot_description='Pre Docker config change'"
|
|
```
|
|
|
|
2. **Backup Docker Configuration**
|
|
```bash
|
|
# On target host
|
|
sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)
|
|
sudo tar -czf /root/docker-backup-$(date +%s).tar.gz \
|
|
/etc/docker \
|
|
/var/lib/docker/volumes
|
|
```
|
|
|
|
3. **Document Current State**
|
|
```bash
|
|
# Capture current container list
|
|
docker ps -a --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" > /tmp/containers-before.txt
|
|
|
|
# Capture current configuration
|
|
docker info > /tmp/docker-info-before.txt
|
|
|
|
# Capture volume list
|
|
docker volume ls > /tmp/volumes-before.txt
|
|
```
|
|
|
|
4. **Verify Connectivity**
|
|
```bash
|
|
# Test from Ansible control node
|
|
ansible pihole -m ping
|
|
ansible pihole -m shell -a "docker ps"
|
|
```
|
|
|
|
5. **Schedule Maintenance Window**
|
|
- Notify stakeholders
|
|
- Plan for 30-60 minute window
|
|
- Have second person available for verification
|
|
|
|
---
|
|
|
|
## Rollback Procedures
|
|
|
|
### Procedure 1: Quick Rollback (Resource Limits / Image Versions)
|
|
|
|
**Time Estimate:** 1-2 minutes
|
|
**Risk:** LOW
|
|
**Downtime:** < 1 minute per container
|
|
|
|
#### Steps
|
|
|
|
1. **Stop affected container**
|
|
```bash
|
|
docker stop <container_name>
|
|
```
|
|
|
|
2. **Restore previous configuration**
|
|
```bash
|
|
# For docker run commands
|
|
# Simply re-run with old parameters
|
|
|
|
# For docker-compose
|
|
git checkout HEAD~1 docker-compose.yml
|
|
docker-compose up -d <container_name>
|
|
```
|
|
|
|
3. **Verify service**
|
|
```bash
|
|
docker ps | grep <container_name>
|
|
docker logs <container_name> --tail 50
|
|
|
|
# Test application functionality
|
|
curl -I http://<service_url>
|
|
```
|
|
|
|
#### Success Criteria
|
|
- Container running
|
|
- Logs show normal operation
|
|
- Service accessible
|
|
- No errors in `docker logs`
|
|
|
|
---
|
|
|
|
### Procedure 2: Daemon Configuration Rollback (Non-Breaking Changes)
|
|
|
|
**Time Estimate:** 3-5 minutes
|
|
**Risk:** MEDIUM
|
|
**Downtime:** 2-3 minutes
|
|
|
|
#### Steps
|
|
|
|
1. **Restore daemon.json**
|
|
```bash
|
|
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
|
|
```
|
|
|
|
2. **Restart Docker daemon**
|
|
```bash
|
|
sudo systemctl restart docker
|
|
```
|
|
|
|
3. **Verify Docker is running**
|
|
```bash
|
|
sudo systemctl status docker
|
|
docker info
|
|
```
|
|
|
|
4. **Check all containers**
|
|
```bash
|
|
docker ps -a
|
|
|
|
# Restart any stopped containers
|
|
docker start $(docker ps -aq)
|
|
```
|
|
|
|
5. **Verify services**
|
|
```bash
|
|
# Test each service
|
|
docker logs <container> --tail 20
|
|
```
|
|
|
|
#### Success Criteria
|
|
- Docker daemon running
|
|
- All containers started
|
|
- Services accessible
|
|
- No errors in `journalctl -u docker`
|
|
|
|
---
|
|
|
|
### Procedure 3: User Namespace Remapping Rollback (HIGH RISK)
|
|
|
|
**Time Estimate:** 10-15 minutes
|
|
**Risk:** HIGH
|
|
**Downtime:** 10-15 minutes
|
|
**Data Loss Risk:** LOW (if volumes backed up)
|
|
|
|
⚠️ **WARNING:** This is the most complex rollback. Follow carefully.
|
|
|
|
#### Pre-Rollback Verification
|
|
|
|
```bash
|
|
# Verify snapshot exists
|
|
ssh grokbox "sudo virsh snapshot-list <vm_name>"
|
|
|
|
# Verify backup archive exists
|
|
ls -lh /root/docker-backup-*.tar.gz
|
|
```
|
|
|
|
#### Steps
|
|
|
|
1. **Stop all containers gracefully**
|
|
```bash
|
|
# Mailcow example
|
|
cd /opt/mailcow-dockerized
|
|
docker-compose down
|
|
|
|
# Or generic
|
|
docker stop $(docker ps -q)
|
|
```
|
|
|
|
2. **Stop Docker daemon**
|
|
```bash
|
|
sudo systemctl stop docker
|
|
```
|
|
|
|
3. **Restore daemon.json (remove userns-remap)**
|
|
```bash
|
|
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
|
|
|
|
# Verify userns-remap is removed
|
|
grep -i userns /etc/docker/daemon.json
|
|
```
|
|
|
|
4. **CRITICAL: Handle user namespace volume mappings**
|
|
```bash
|
|
# User namespaced volumes are in a different location
|
|
# /var/lib/docker/<uid>.<gid>/volumes/
|
|
|
|
# List namespaced volumes
|
|
sudo ls -la /var/lib/docker/*/volumes/
|
|
|
|
# Copy volumes back to main location (if needed)
|
|
sudo rsync -av /var/lib/docker/*/volumes/* /var/lib/docker/volumes/
|
|
```
|
|
|
|
5. **Start Docker daemon**
|
|
```bash
|
|
sudo systemctl start docker
|
|
sudo systemctl status docker
|
|
```
|
|
|
|
6. **Verify Docker info**
|
|
```bash
|
|
docker info | grep -i "userns"
|
|
# Should NOT show user namespace remapping
|
|
```
|
|
|
|
7. **Recreate containers**
|
|
```bash
|
|
# Mailcow example
|
|
cd /opt/mailcow-dockerized
|
|
docker-compose up -d
|
|
|
|
# Wait for all containers to start
|
|
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'
|
|
```
|
|
|
|
8. **Verify all services**
|
|
```bash
|
|
# Check container logs
|
|
docker-compose logs --tail 50
|
|
|
|
# Test services
|
|
curl -I https://cow.mymx.me
|
|
|
|
# Verify email functionality (mailcow)
|
|
docker-compose exec postfix-mailcow postqueue -p
|
|
```
|
|
|
|
#### If Rollback Fails: VM Snapshot Restore
|
|
|
|
```bash
|
|
# From Ansible control node or directly on hypervisor
|
|
|
|
# 1. Shutdown VM
|
|
ssh grokbox "sudo virsh shutdown <vm_name>"
|
|
|
|
# 2. Wait for shutdown (max 60 seconds)
|
|
sleep 30
|
|
|
|
# 3. Force stop if needed
|
|
ssh grokbox "sudo virsh destroy <vm_name>"
|
|
|
|
# 4. Revert to snapshot
|
|
ssh grokbox "sudo virsh snapshot-revert <vm_name> backup_<timestamp>"
|
|
|
|
# 5. Start VM
|
|
ssh grokbox "sudo virsh start <vm_name>"
|
|
|
|
# 6. Verify SSH access (may take 1-2 minutes)
|
|
ansible <vm_name> -m ping
|
|
|
|
# 7. Verify services
|
|
ansible <vm_name> -m shell -a "docker ps"
|
|
```
|
|
|
|
#### Success Criteria
|
|
- Docker daemon running WITHOUT user namespace remapping
|
|
- All containers running
|
|
- All services accessible
|
|
- Volume data intact
|
|
- No permission errors in logs
|
|
|
|
---
|
|
|
|
## Specific Scenarios
|
|
|
|
### Scenario A: Mailcow Container Won't Start After Namespace Change
|
|
|
|
**Symptoms:**
|
|
- Containers exit immediately
|
|
- Permission denied errors in logs
|
|
- Volume mount failures
|
|
|
|
**Solution:**
|
|
```bash
|
|
# 1. Check volume permissions
|
|
docker run --rm -v mailcowdockerized_vmail-vol-1:/volume alpine ls -la /volume
|
|
|
|
# 2. Fix permissions if needed (DANGEROUS - only if you know UID mapping)
|
|
# This example assumes standard userns mapping (165536 offset)
|
|
sudo chown -R 165536:165536 /var/lib/docker/volumes/mailcowdockerized_vmail-vol-1
|
|
|
|
# 3. If permissions are unfixable, revert to snapshot
|
|
# See "VM Snapshot Restore" above
|
|
```
|
|
|
|
### Scenario B: Docker Daemon Won't Start After Config Change
|
|
|
|
**Symptoms:**
|
|
- `systemctl start docker` fails
|
|
- Errors in `journalctl -u docker`
|
|
|
|
**Solution:**
|
|
```bash
|
|
# 1. Check exact error
|
|
sudo journalctl -u docker -n 50 --no-pager
|
|
|
|
# 2. Validate daemon.json syntax
|
|
sudo cat /etc/docker/daemon.json | jq '.'
|
|
|
|
# 3. If syntax error, restore backup
|
|
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
|
|
|
|
# 4. If configuration conflict, check docs
|
|
sudo dockerd --validate --config-file /etc/docker/daemon.json
|
|
|
|
# 5. Start daemon
|
|
sudo systemctl start docker
|
|
```
|
|
|
|
### Scenario C: Data Loss After Namespace Change
|
|
|
|
**Symptoms:**
|
|
- Volumes appear empty
|
|
- Database containers can't find data
|
|
- Application state lost
|
|
|
|
**Solution:**
|
|
```bash
|
|
# 1. STOP - Do not proceed with data recovery attempts
|
|
# 2. DO NOT restart containers
|
|
# 3. Immediately revert to snapshot
|
|
|
|
ssh grokbox "sudo virsh snapshot-revert <vm_name> backup_<timestamp>"
|
|
|
|
# 4. After VM restore, verify data
|
|
docker exec <database_container> <verification_command>
|
|
|
|
# Example for MySQL
|
|
docker exec mailcowdockerized-mysql-mailcow-1 mysql -u root -p<password> -e "SHOW DATABASES;"
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Rollback Procedures
|
|
|
|
### Monthly Rollback Drill
|
|
|
|
**Schedule:** First Monday of each month
|
|
**Duration:** 30 minutes
|
|
**Environment:** Development/Test VMs only
|
|
|
|
#### Drill Steps
|
|
|
|
1. **Create test VM or use derp**
|
|
```bash
|
|
# Deploy test container
|
|
docker run -d --name test-nginx nginx:latest
|
|
```
|
|
|
|
2. **Create snapshot**
|
|
```bash
|
|
ansible-playbook playbooks/backup_vm_snapshot.yml \
|
|
-e "target_vms=['test-vm']"
|
|
```
|
|
|
|
3. **Make intentional breaking change**
|
|
```bash
|
|
# Break Docker config
|
|
echo '{"invalid": json}' | sudo tee /etc/docker/daemon.json
|
|
sudo systemctl restart docker # This will fail
|
|
```
|
|
|
|
4. **Practice rollback**
|
|
```bash
|
|
# Follow Procedure 2 above
|
|
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
|
|
sudo systemctl start docker
|
|
```
|
|
|
|
5. **Practice snapshot restore**
|
|
```bash
|
|
# Follow VM Snapshot Restore procedure
|
|
ssh grokbox "sudo virsh snapshot-revert test-vm backup_<timestamp>"
|
|
```
|
|
|
|
6. **Document issues found**
|
|
- Update this runbook
|
|
- Note any steps that were unclear
|
|
- Time each procedure
|
|
|
|
---
|
|
|
|
## Emergency Contacts
|
|
|
|
### Escalation Path
|
|
|
|
| Level | Contact | Response Time | Responsibility |
|
|
|-------|---------|---------------|----------------|
|
|
| L1 | Infrastructure Team | Immediate | Execute runbook |
|
|
| L2 | Senior Sysadmin | 15 minutes | Complex issues |
|
|
| L3 | Vendor Support | 1-4 hours | Critical failures |
|
|
|
|
### Service-Specific Contacts
|
|
|
|
**Mailcow:**
|
|
- Documentation: https://docs.mailcow.email/
|
|
- Community: https://community.mailcow.email/
|
|
- Emergency: Check for known issues in GitHub
|
|
|
|
**Docker:**
|
|
- Documentation: https://docs.docker.com/
|
|
- Community Forums: https://forums.docker.com/
|
|
|
|
---
|
|
|
|
## Post-Rollback Actions
|
|
|
|
### After Any Rollback
|
|
|
|
1. **Update incident log**
|
|
```markdown
|
|
Date: <timestamp>
|
|
VM: <vm_name>
|
|
Change Attempted: <description>
|
|
Rollback Procedure Used: <procedure_number>
|
|
Success: Yes/No
|
|
Time to Restore: <minutes>
|
|
Issues Encountered: <list>
|
|
```
|
|
|
|
2. **Verify service monitoring**
|
|
- Check all alerts cleared
|
|
- Verify metrics returning to normal
|
|
- Test service endpoints
|
|
|
|
3. **Document lessons learned**
|
|
- What went wrong?
|
|
- What could be improved?
|
|
- Update this runbook
|
|
|
|
4. **Schedule post-mortem** (for critical incidents)
|
|
- Within 48 hours
|
|
- All stakeholders present
|
|
- Action items assigned
|
|
|
|
5. **Update change management records**
|
|
- Mark change as rolled back
|
|
- Document reason for failure
|
|
- Plan for retry (if applicable)
|
|
|
|
---
|
|
|
|
## Preventive Measures
|
|
|
|
### Before Making High-Risk Changes
|
|
|
|
1. **Test in development first**
|
|
- Use derp VM or test environment
|
|
- Replicate production as closely as possible
|
|
- Document exact steps that work
|
|
|
|
2. **Review Docker/Mailcow changelogs**
|
|
- Check for known issues
|
|
- Review breaking changes
|
|
- Search community forums
|
|
|
|
3. **Peer review change plan**
|
|
- Have colleague review procedure
|
|
- Walk through rollback steps
|
|
- Verify backup procedures
|
|
|
|
4. **Schedule during low-traffic period**
|
|
- Weekend or late evening
|
|
- Notify users in advance
|
|
- Have monitoring ready
|
|
|
|
---
|
|
|
|
## Appendix A: Quick Reference Commands
|
|
|
|
### Snapshot Management
|
|
```bash
|
|
# Create snapshot
|
|
ansible-playbook playbooks/backup_vm_snapshot.yml -e "target_vms=['vm']"
|
|
|
|
# List snapshots
|
|
ssh grokbox "sudo virsh snapshot-list <vm>"
|
|
|
|
# Revert to snapshot
|
|
ssh grokbox "sudo virsh snapshot-revert <vm> <snapshot_name>"
|
|
|
|
# Delete snapshot
|
|
ssh grokbox "sudo virsh snapshot-delete <vm> <snapshot_name>"
|
|
```
|
|
|
|
### Docker Backup/Restore
|
|
```bash
|
|
# Backup
|
|
sudo tar -czf docker-backup.tar.gz /etc/docker /var/lib/docker/volumes
|
|
|
|
# Restore
|
|
sudo tar -xzf docker-backup.tar.gz -C /
|
|
```
|
|
|
|
### Service Verification
|
|
```bash
|
|
# Docker
|
|
systemctl status docker
|
|
docker info
|
|
docker ps
|
|
|
|
# Mailcow
|
|
cd /opt/mailcow-dockerized
|
|
docker-compose ps
|
|
docker-compose logs --tail 50
|
|
```
|
|
|
|
---
|
|
|
|
**Document End**
|
|
|
|
**Review Schedule:** Monthly
|
|
**Next Review:** 2025-12-11
|
|
**Approval:** Infrastructure Team Lead
|