Add Docker user namespace testing guide, rollback runbook, and VM backup playbook
- Add comprehensive Docker user namespace testing documentation - Add Docker configuration rollback runbook for disaster recovery - Add VM snapshot backup playbook for system protection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
762
docs/docker-userns-testing-guide.md
Normal file
762
docs/docker-userns-testing-guide.md
Normal file
@@ -0,0 +1,762 @@
|
||||
# Docker User Namespace Remapping - Testing and Implementation Guide
|
||||
|
||||
**Document Version:** 1.0
|
||||
**Last Updated:** 2025-11-11
|
||||
**Risk Level:** HIGH
|
||||
**Testing Required:** YES (Mandatory in dev/test first)
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Overview](#overview)
|
||||
2. [Security Benefits](#security-benefits)
|
||||
3. [Prerequisites](#prerequisites)
|
||||
4. [Testing Phase (Week 48-49)](#testing-phase-week-48-49)
|
||||
5. [Production Implementation (Week 50)](#production-implementation-week-50)
|
||||
6. [Mailcow-Specific Considerations](#mailcow-specific-considerations)
|
||||
7. [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
User namespace remapping is a Docker security feature that maps container UID/GIDs to different values on the host, preventing container root from being host root.
|
||||
|
||||
### Current Status
|
||||
|
||||
| Host | User Namespaces | Risk Level | Implementation Priority |
|
||||
|------|-----------------|------------|------------------------|
|
||||
| pihole | Not configured | MEDIUM | Week 49 (after testing) |
|
||||
| mymx | Not configured | HIGH | Week 50 (mailcow complexity) |
|
||||
|
||||
### Impact Assessment
|
||||
|
||||
**Benefits:**
|
||||
- ✅ Container root ≠ host root (major security improvement)
|
||||
- ✅ Reduces container escape impact
|
||||
- ✅ CIS Docker Benchmark compliance (2.13)
|
||||
|
||||
**Risks:**
|
||||
- ⚠️ **ALL containers must be recreated**
|
||||
- ⚠️ Volume permissions must be remapped
|
||||
- ⚠️ Breaking change for existing deployments
|
||||
- ⚠️ Mailcow may have specific requirements
|
||||
|
||||
**Recommendation:** Test thoroughly in dev, then pihole, then mymx (last)
|
||||
|
||||
---
|
||||
|
||||
## Security Benefits
|
||||
|
||||
### Without User Namespace Remapping (Current State)
|
||||
|
||||
```
|
||||
Container: Host:
|
||||
UID 0 (root) → UID 0 (root) ❌ DANGEROUS
|
||||
UID 1000 → UID 1000
|
||||
```
|
||||
|
||||
**Problem:** Container root can potentially escape and has host root privileges.
|
||||
|
||||
### With User Namespace Remapping (Target State)
|
||||
|
||||
```
|
||||
Container: Host:
|
||||
UID 0 (root) → UID 165536 ✅ SAFE
|
||||
UID 1000 → UID 166536
|
||||
```
|
||||
|
||||
**Benefit:** Container root is unprivileged user on host.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Before Starting Testing
|
||||
|
||||
1. **VM Snapshots Created**
|
||||
```bash
|
||||
ansible-playbook playbooks/backup_vm_snapshot.yml \
|
||||
-e "target_vms=['pihole', 'mymx']"
|
||||
```
|
||||
|
||||
2. **Rollback Procedures Reviewed**
|
||||
- Read: `docs/runbooks/docker-configuration-rollback.md`
|
||||
- Understand VM snapshot restore process
|
||||
- Have emergency contact information ready
|
||||
|
||||
3. **Maintenance Window Scheduled**
|
||||
- Duration: 2-3 hours for testing
|
||||
- Low-traffic period recommended
|
||||
- Second person available for verification
|
||||
|
||||
4. **Documentation Ready**
|
||||
- This guide printed or accessible offline
|
||||
- Docker and mailcow documentation available
|
||||
- Notepad for documenting issues
|
||||
|
||||
---
|
||||
|
||||
## Testing Phase (Week 48-49)
|
||||
|
||||
### Phase 1: Test Environment Setup (Week 48)
|
||||
|
||||
**Objective:** Validate user namespace remapping with simple container
|
||||
|
||||
#### Option A: Use derp VM (Recommended)
|
||||
|
||||
```bash
|
||||
# 1. Start derp VM (if stopped)
|
||||
ssh grokbox "sudo virsh start derp"
|
||||
|
||||
# 2. Create ansible user and configure SSH
|
||||
# (Use deploy_linux_vm role or manual setup)
|
||||
|
||||
# 3. Install Docker
|
||||
ansible derp -m apt -a "name=docker.io state=present" -b
|
||||
|
||||
# 4. Create snapshot before testing
|
||||
ansible-playbook playbooks/backup_vm_snapshot.yml \
|
||||
-e "target_vms=['derp']"
|
||||
```
|
||||
|
||||
#### Option B: Create temporary test container on existing host
|
||||
|
||||
```bash
|
||||
# On pihole (low risk - only 1 container)
|
||||
# Create test container first
|
||||
|
||||
docker run -d --name userns-test \
|
||||
-v test-volume:/data \
|
||||
alpine:latest sleep infinity
|
||||
```
|
||||
|
||||
### Phase 2: Enable User Namespace Remapping (Week 48)
|
||||
|
||||
#### Step 1: Configure Docker Daemon
|
||||
|
||||
```bash
|
||||
# On test host (derp or pihole)
|
||||
sudo tee /etc/docker/daemon.json <<EOF
|
||||
{
|
||||
"userns-remap": "default"
|
||||
}
|
||||
EOF
|
||||
|
||||
# Validate syntax
|
||||
cat /etc/docker/daemon.json | jq '.'
|
||||
```
|
||||
|
||||
#### Step 2: Restart Docker
|
||||
|
||||
```bash
|
||||
# Stop all containers first
|
||||
docker stop $(docker ps -q)
|
||||
|
||||
# Restart Docker daemon
|
||||
sudo systemctl restart docker
|
||||
|
||||
# Verify it started
|
||||
sudo systemctl status docker
|
||||
|
||||
# Check for user namespace in docker info
|
||||
docker info | grep -i "userns"
|
||||
# Should show: "userns": true
|
||||
```
|
||||
|
||||
#### Step 3: Verify UID Mapping
|
||||
|
||||
```bash
|
||||
# Check subuid/subgid configuration
|
||||
cat /etc/subuid
|
||||
cat /etc/subgid
|
||||
|
||||
# Should show something like:
|
||||
# dockremap:165536:65536
|
||||
|
||||
# Verify Docker is using remapping
|
||||
docker info --format '{{.SecurityOptions}}'
|
||||
```
|
||||
|
||||
#### Step 4: Recreate Test Container
|
||||
|
||||
```bash
|
||||
# Remove old container (data is in volume)
|
||||
docker rm userns-test
|
||||
|
||||
# Recreate container
|
||||
docker run -d --name userns-test \
|
||||
-v test-volume:/data \
|
||||
alpine:latest sleep infinity
|
||||
|
||||
# Verify it's running
|
||||
docker ps | grep userns-test
|
||||
```
|
||||
|
||||
#### Step 5: Test Volume Permissions
|
||||
|
||||
```bash
|
||||
# Create test file in container
|
||||
docker exec userns-test sh -c 'echo "test" > /data/test.txt'
|
||||
|
||||
# Check file ownership on host
|
||||
# Volume location changed! It's now in:
|
||||
sudo ls -la /var/lib/docker/165536.165536/volumes/test-volume/_data/
|
||||
|
||||
# UID should be 165536 (remapped root)
|
||||
|
||||
# Test read/write in container
|
||||
docker exec userns-test cat /data/test.txt
|
||||
docker exec userns-test sh -c 'echo "test2" >> /data/test.txt'
|
||||
```
|
||||
|
||||
### Phase 3: Test with Real Application (Week 48-49)
|
||||
|
||||
#### Test Scenario 1: Simple Web Server (pihole preparation)
|
||||
|
||||
```bash
|
||||
# Deploy nginx with volume
|
||||
docker run -d --name test-nginx \
|
||||
-p 8080:80 \
|
||||
-v nginx-data:/usr/share/nginx/html \
|
||||
nginx:alpine
|
||||
|
||||
# Test access
|
||||
curl http://localhost:8080
|
||||
|
||||
# Create content
|
||||
docker exec test-nginx sh -c 'echo "<h1>User Namespace Test</h1>" > /usr/share/nginx/html/test.html'
|
||||
|
||||
# Verify access
|
||||
curl http://localhost:8080/test.html
|
||||
|
||||
# Check logs
|
||||
docker logs test-nginx
|
||||
```
|
||||
|
||||
#### Test Scenario 2: Database Container (mailcow preparation)
|
||||
|
||||
```bash
|
||||
# Deploy MariaDB with volume
|
||||
docker run -d --name test-db \
|
||||
-e MYSQL_ROOT_PASSWORD=testpass123 \
|
||||
-v mysql-data:/var/lib/mysql \
|
||||
mariadb:10.11
|
||||
|
||||
# Wait for startup
|
||||
sleep 30
|
||||
|
||||
# Test database
|
||||
docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;"
|
||||
|
||||
# Create test database
|
||||
docker exec test-db mysql -ptest pass123 -e "CREATE DATABASE testdb;"
|
||||
|
||||
# Stop and restart to test persistence
|
||||
docker stop test-db
|
||||
docker start test-db
|
||||
sleep 20
|
||||
|
||||
# Verify data persisted
|
||||
docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;" | grep testdb
|
||||
```
|
||||
|
||||
#### Test Scenario 3: Application with File Uploads
|
||||
|
||||
```bash
|
||||
# Create upload directory
|
||||
mkdir -p /tmp/test-uploads
|
||||
|
||||
# Run container with bind mount
|
||||
docker run -d --name test-upload \
|
||||
-v /tmp/test-uploads:/uploads \
|
||||
alpine:latest sleep infinity
|
||||
|
||||
# Test file creation
|
||||
docker exec test-upload sh -c 'echo "test" > /uploads/test.txt'
|
||||
|
||||
# Check host permissions
|
||||
ls -la /tmp/test-uploads/
|
||||
# File should be owned by UID 165536
|
||||
|
||||
# Test file access from container
|
||||
docker exec test-upload cat /uploads/test.txt
|
||||
```
|
||||
|
||||
### Phase 4: Identify Issues (Week 48-49)
|
||||
|
||||
#### Common Issues to Check
|
||||
|
||||
1. **Permission Denied Errors**
|
||||
```bash
|
||||
# Check container logs
|
||||
docker logs <container_name> 2>&1 | grep -i "permission"
|
||||
```
|
||||
|
||||
2. **Volume Mount Failures**
|
||||
```bash
|
||||
# List volumes
|
||||
docker volume ls
|
||||
|
||||
# Inspect volume
|
||||
docker volume inspect <volume_name>
|
||||
|
||||
# Check actual location on disk
|
||||
sudo ls -la /var/lib/docker/*/volumes/
|
||||
```
|
||||
|
||||
3. **Bind Mount Issues**
|
||||
```bash
|
||||
# For bind mounts, may need to adjust host permissions
|
||||
# Example: Allow remapped UID to write
|
||||
sudo chown 165536:165536 /path/to/host/dir
|
||||
```
|
||||
|
||||
4. **Privileged Container Conflicts**
|
||||
```bash
|
||||
# Test if privileged containers still work
|
||||
docker run --rm --privileged alpine:latest id
|
||||
# Note: Privileged containers bypass userns remapping
|
||||
```
|
||||
|
||||
#### Document All Findings
|
||||
|
||||
Create test log:
|
||||
```markdown
|
||||
## User Namespace Remapping Test Log
|
||||
|
||||
Date: <date>
|
||||
Host: <hostname>
|
||||
Docker Version: <version>
|
||||
|
||||
### Test 1: Simple Container
|
||||
- Result: PASS/FAIL
|
||||
- Issues: <none or list>
|
||||
- Notes: <observations>
|
||||
|
||||
### Test 2: Web Server
|
||||
- Result: PASS/FAIL
|
||||
- Issues: <none or list>
|
||||
- Notes: <observations>
|
||||
|
||||
### Test 3: Database
|
||||
- Result: PASS/FAIL
|
||||
- Issues: <none or list>
|
||||
- Notes: <observations>
|
||||
|
||||
### Conclusion
|
||||
Ready for production: YES/NO
|
||||
Blockers: <list if any>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Production Implementation (Week 50)
|
||||
|
||||
### Implementation Order
|
||||
|
||||
1. **pihole** (Week 49 end / Week 50 start) - Lowest risk
|
||||
2. **mymx** (Week 50 end) - Highest risk, requires mailcow-specific testing
|
||||
|
||||
### pihole Implementation
|
||||
|
||||
**Prerequisites:**
|
||||
- ✅ Testing completed successfully on derp/test environment
|
||||
- ✅ VM snapshot created
|
||||
- ✅ Maintenance window scheduled
|
||||
- ✅ Rollback procedure reviewed
|
||||
|
||||
**Steps:**
|
||||
|
||||
```bash
|
||||
# 1. Create snapshot
|
||||
ansible-playbook playbooks/backup_vm_snapshot.yml \
|
||||
-e "target_vms=['pihole']" \
|
||||
-e "snapshot_description='Pre user namespace implementation'"
|
||||
|
||||
# 2. Backup current configuration
|
||||
ansible pihole -m shell -a "sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)" -b
|
||||
|
||||
# 3. Stop pihole container
|
||||
ansible pihole -m shell -a "docker stop pihole" -b
|
||||
|
||||
# 4. Configure user namespace remapping
|
||||
ansible pihole -m copy -b -a "
|
||||
dest=/etc/docker/daemon.json
|
||||
content='{\"userns-remap\": \"default\"}'
|
||||
owner=root
|
||||
group=root
|
||||
mode='0644'
|
||||
"
|
||||
|
||||
# 5. Restart Docker
|
||||
ansible pihole -m systemd -a "name=docker state=restarted" -b
|
||||
|
||||
# 6. Verify Docker started
|
||||
ansible pihole -m shell -a "docker info | grep -i userns" -b
|
||||
|
||||
# 7. Recreate pihole container (adjust based on actual deployment)
|
||||
# If using docker run command, re-run it
|
||||
# If using docker-compose, run: docker-compose up -d
|
||||
|
||||
# 8. Verify pihole is working
|
||||
ansible pihole -m shell -a "docker ps" -b
|
||||
ansible pihole -m shell -a "docker logs pihole --tail 50" -b
|
||||
|
||||
# 9. Test DNS functionality
|
||||
dig @192.168.122.12 google.com
|
||||
|
||||
# 10. Monitor for 1 hour
|
||||
watch -n 60 'ansible pihole -m shell -a "docker ps" -b'
|
||||
```
|
||||
|
||||
**Rollback if Issues:**
|
||||
```bash
|
||||
# Follow docs/runbooks/docker-configuration-rollback.md
|
||||
# Procedure 3: User Namespace Remapping Rollback
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mailcow-Specific Considerations
|
||||
|
||||
### Why Mailcow is Complex
|
||||
|
||||
1. **Multiple interconnected containers** (24 containers)
|
||||
2. **Persistent data in multiple volumes** (mail, databases, configs)
|
||||
3. **File permissions critical** for mail delivery
|
||||
4. **Active production service** - downtime impact high
|
||||
|
||||
### Mailcow Testing Approach (Week 49-50)
|
||||
|
||||
#### Phase 1: Research (Week 49)
|
||||
|
||||
```bash
|
||||
# 1. Check mailcow documentation
|
||||
# Search: "user namespace" or "userns-remap"
|
||||
# URL: https://docs.mailcow.email/
|
||||
|
||||
# 2. Check mailcow GitHub issues
|
||||
# Search for: userns, user namespace, permission issues
|
||||
|
||||
# 3. Check mailcow community forum
|
||||
# URL: https://community.mailcow.email/
|
||||
# Search for similar implementations
|
||||
```
|
||||
|
||||
#### Phase 2: Mailcow Test Environment (Week 49)
|
||||
|
||||
**Option A: Deploy test mailcow on derp**
|
||||
|
||||
```bash
|
||||
# Requires:
|
||||
# - 4GB+ RAM (derp may be too small)
|
||||
# - 20GB+ disk space
|
||||
# - Domain for testing
|
||||
|
||||
# Install mailcow on derp
|
||||
git clone https://github.com/mailcow/mailcow-dockerized
|
||||
cd mailcow-dockerized
|
||||
./generate_config.sh
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
**Option B: Clone mymx mailcow config to test environment**
|
||||
|
||||
```bash
|
||||
# Create test VM clone
|
||||
# Copy mailcow configuration
|
||||
# Test with user namespaces
|
||||
```
|
||||
|
||||
#### Phase 3: Mailcow Volume Analysis (Week 49)
|
||||
|
||||
```bash
|
||||
# On mymx, identify all volumes
|
||||
docker volume ls | grep mailcow
|
||||
|
||||
# Check critical volumes
|
||||
docker volume inspect mailcowdockerized_vmail-vol-1
|
||||
docker volume inspect mailcowdockerized_mysql-vol-1
|
||||
|
||||
# Document current permissions
|
||||
for vol in $(docker volume ls -q | grep mailcow); do
|
||||
echo "=== $vol ==="
|
||||
sudo ls -la /var/lib/docker/volumes/$vol/_data/ | head -20
|
||||
done > /tmp/mailcow-permissions-before.txt
|
||||
```
|
||||
|
||||
#### Phase 4: Mailcow Implementation (Week 50 - IF testing successful)
|
||||
|
||||
**ONLY proceed if:**
|
||||
- ✅ Testing in dev environment successful
|
||||
- ✅ pihole implementation successful
|
||||
- ✅ Mailcow community confirms no known issues
|
||||
- ✅ Extended maintenance window available (2-4 hours)
|
||||
- ✅ Full backups completed
|
||||
- ✅ Rollback tested and confirmed working
|
||||
|
||||
**Implementation Steps:**
|
||||
|
||||
```bash
|
||||
# 1. Create snapshot
|
||||
ansible-playbook playbooks/backup_vm_snapshot.yml \
|
||||
-e "target_vms=['mymx']" \
|
||||
-e "snapshot_description='Pre mailcow user namespace'"
|
||||
|
||||
# 2. Backup ALL mailcow data
|
||||
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && ./helper-scripts/backup_and_restore.sh backup all" -b
|
||||
|
||||
# 3. Stop mailcow
|
||||
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose down" -b
|
||||
|
||||
# 4. Backup current state
|
||||
ansible mymx -m shell -a "
|
||||
sudo tar -czf /root/mailcow-pre-userns-$(date +%s).tar.gz \
|
||||
/etc/docker \
|
||||
/opt/mailcow-dockerized \
|
||||
/var/lib/docker/volumes/mailcow*
|
||||
" -b
|
||||
|
||||
# 5. Configure user namespace
|
||||
ansible mymx -m shell -a "
|
||||
sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)
|
||||
echo '{\"userns-remap\": \"default\"}' | sudo tee /etc/docker/daemon.json
|
||||
" -b
|
||||
|
||||
# 6. Restart Docker
|
||||
ansible mymx -m systemd -a "name=docker state=restarted" -b
|
||||
|
||||
# 7. Verify Docker started with user namespaces
|
||||
ansible mymx -m shell -a "docker info | grep -i userns" -b
|
||||
|
||||
# 8. Start mailcow (will recreate all containers)
|
||||
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose up -d" -b
|
||||
|
||||
# 9. Monitor startup
|
||||
watch -n 10 'ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose ps" -b'
|
||||
|
||||
# 10. Check logs for permission errors
|
||||
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose logs --tail 100" -b | grep -i "permission\|denied\|failed"
|
||||
|
||||
# 11. Test mail functionality
|
||||
# - Send test email
|
||||
# - Receive test email
|
||||
# - Check webmail access
|
||||
# - Verify SOGo groupware
|
||||
# - Test IMAP/SMTP connections
|
||||
|
||||
# 12. Monitor for 4-8 hours before declaring success
|
||||
```
|
||||
|
||||
**Known Potential Issues with Mailcow:**
|
||||
|
||||
1. **Vmail Volume Permissions**
|
||||
```bash
|
||||
# If mail delivery fails with permission errors
|
||||
# May need to adjust permissions (LAST RESORT)
|
||||
sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes/mailcowdockerized_vmail-vol-1/_data/
|
||||
```
|
||||
|
||||
2. **MySQL Volume Issues**
|
||||
```bash
|
||||
# If database won't start
|
||||
# Check MySQL logs
|
||||
docker logs mailcowdockerized-mysql-mailcow-1
|
||||
|
||||
# May need database permission fixes
|
||||
# This is why testing is CRITICAL
|
||||
```
|
||||
|
||||
3. **Dovecot Permission Issues**
|
||||
```bash
|
||||
# Dovecot is sensitive to mail file permissions
|
||||
# May require config adjustments in mailcow.conf
|
||||
```
|
||||
|
||||
### Mailcow Rollback Decision Point
|
||||
|
||||
**Roll back immediately if:**
|
||||
- Docker daemon won't start
|
||||
- MySQL container won't start
|
||||
- Cannot send/receive mail after 15 minutes
|
||||
- Permission errors in critical containers
|
||||
- Data appears missing/inaccessible
|
||||
|
||||
**Use VM snapshot restore if:**
|
||||
- Multiple containers failing
|
||||
- Data corruption suspected
|
||||
- Cannot resolve within 30 minutes
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue 1: Docker Daemon Won't Start
|
||||
|
||||
**Symptoms:**
|
||||
```bash
|
||||
systemctl status docker
|
||||
# Failed to start Docker Application Container Engine
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check logs
|
||||
journalctl -u docker -n 100 --no-pager
|
||||
|
||||
# Common causes:
|
||||
# 1. Invalid daemon.json syntax
|
||||
cat /etc/docker/daemon.json | jq '.'
|
||||
|
||||
# 2. Subuid/subgid not configured
|
||||
cat /etc/subuid
|
||||
cat /etc/subgid
|
||||
# Should have dockremap:165536:65536
|
||||
|
||||
# 3. Restore backup
|
||||
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
|
||||
sudo systemctl start docker
|
||||
```
|
||||
|
||||
### Issue 2: Container Won't Start - Permission Denied
|
||||
|
||||
**Symptoms:**
|
||||
```bash
|
||||
docker logs <container>
|
||||
# Permission denied errors
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# 1. Check volume location
|
||||
docker volume inspect <volume_name>
|
||||
|
||||
# 2. Check permissions on host
|
||||
sudo ls -la /var/lib/docker/165536.165536/volumes/<volume>/_data/
|
||||
|
||||
# 3. If permissions wrong, may need to adjust
|
||||
# (Avoid this if possible - indicates larger problem)
|
||||
sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes/<volume>/_data/
|
||||
```
|
||||
|
||||
### Issue 3: Bind Mounts Not Working
|
||||
|
||||
**Symptoms:**
|
||||
```bash
|
||||
docker logs <container>
|
||||
# Cannot access /bind/mount/path
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Bind mounts need host directory permissions adjusted
|
||||
sudo chown 165536:165536 /path/to/bind/mount
|
||||
|
||||
# Or use volumes instead of bind mounts
|
||||
# Volumes are handled automatically by Docker
|
||||
```
|
||||
|
||||
### Issue 4: Privileged Container Needed
|
||||
|
||||
**Note:** Privileged containers (like mailcow netfilter) bypass user namespace remapping.
|
||||
|
||||
```bash
|
||||
# Verify privileged container still works
|
||||
docker inspect <container> | grep -i privileged
|
||||
# Should show: "Privileged": true
|
||||
|
||||
# Privileged containers run as actual root (userns bypassed)
|
||||
# This is expected for netfilter, acceptable risk (documented)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Testing Phase Success (Before Production)
|
||||
|
||||
- [ ] Simple container runs successfully
|
||||
- [ ] Web server container accessible
|
||||
- [ ] Database container stores/retrieves data
|
||||
- [ ] Volume permissions correct (165536 UID)
|
||||
- [ ] Bind mounts work (if needed)
|
||||
- [ ] No permission errors in logs
|
||||
- [ ] Can recreate containers after Docker restart
|
||||
- [ ] Rollback procedure tested and successful
|
||||
|
||||
### Production Implementation Success
|
||||
|
||||
#### pihole
|
||||
- [ ] VM snapshot created
|
||||
- [ ] Docker daemon running with user namespaces
|
||||
- [ ] pihole container running
|
||||
- [ ] DNS queries working
|
||||
- [ ] No permission errors in logs
|
||||
- [ ] Monitoring shows normal operation for 24+ hours
|
||||
|
||||
#### mymx/mailcow
|
||||
- [ ] VM snapshot created
|
||||
- [ ] Docker daemon running with user namespaces
|
||||
- [ ] All 24 containers running
|
||||
- [ ] Can send email
|
||||
- [ ] Can receive email
|
||||
- [ ] Webmail accessible
|
||||
- [ ] SOGo groupware working
|
||||
- [ ] No permission errors in logs
|
||||
- [ ] Monitoring shows normal operation for 48+ hours
|
||||
- [ ] Full service verification completed
|
||||
|
||||
---
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
START: Ready to enable user namespaces?
|
||||
│
|
||||
├─ Testing completed in dev?
|
||||
│ ├─ NO → STOP: Complete testing first
|
||||
│ └─ YES → Continue
|
||||
│
|
||||
├─ VM snapshots created?
|
||||
│ ├─ NO → STOP: Create snapshots first
|
||||
│ └─ YES → Continue
|
||||
│
|
||||
├─ Rollback procedure reviewed?
|
||||
│ ├─ NO → STOP: Review rollback docs
|
||||
│ └─ YES → Continue
|
||||
│
|
||||
├─ Which host?
|
||||
│ ├─ pihole → Proceed (lower risk)
|
||||
│ └─ mymx → Additional checks needed
|
||||
│ │
|
||||
│ ├─ Mailcow community research done?
|
||||
│ │ ├─ NO → STOP: Research first
|
||||
│ │ └─ YES → Continue
|
||||
│ │
|
||||
│ ├─ pihole implementation successful?
|
||||
│ │ ├─ NO → STOP: Fix pihole first
|
||||
│ │ └─ YES → Continue
|
||||
│ │
|
||||
│ ├─ Extended maintenance window?
|
||||
│ │ ├─ NO → STOP: Schedule proper window
|
||||
│ │ └─ YES → Proceed with caution
|
||||
│ │
|
||||
│ └─ Proceed with mymx (high risk)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Docker User Namespace Documentation: https://docs.docker.com/engine/security/userns-remap/
|
||||
- CIS Docker Benchmark 2.13: Enable user namespace support
|
||||
- Mailcow Documentation: https://docs.mailcow.email/
|
||||
- NIST SP 800-190: Section 4.4 - Host OS and multi-tenancy
|
||||
|
||||
---
|
||||
|
||||
**Document Version:** 1.0
|
||||
**Next Review:** After testing completion (Week 49)
|
||||
**Owner:** Infrastructure Security Team
|
||||
549
docs/runbooks/docker-configuration-rollback.md
Normal file
549
docs/runbooks/docker-configuration-rollback.md
Normal file
@@ -0,0 +1,549 @@
|
||||
# Docker Configuration Rollback Procedures
|
||||
|
||||
**Document Version:** 1.0
|
||||
**Last Updated:** 2025-11-11
|
||||
**Owner:** Infrastructure Team
|
||||
**Risk Level:** HIGH - User Namespace Remapping / LOW - Resource Limits
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Overview](#overview)
|
||||
2. [Pre-Change Requirements](#pre-change-requirements)
|
||||
3. [Rollback Procedures](#rollback-procedures)
|
||||
4. [Specific Scenarios](#specific-scenarios)
|
||||
5. [Emergency Contacts](#emergency-contacts)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This runbook provides step-by-step rollback procedures for Docker configuration changes, with special focus on high-risk modifications like user namespace remapping.
|
||||
|
||||
### Risk Classification
|
||||
|
||||
| Change Type | Risk Level | Rollback Complexity | Downtime |
|
||||
|-------------|-----------|---------------------|----------|
|
||||
| Resource limits | LOW | Simple | < 1 min |
|
||||
| Image version pinning | LOW | Simple | < 1 min |
|
||||
| User namespace remapping | HIGH | Complex | 5-15 min |
|
||||
| Network configuration | MEDIUM | Moderate | 2-5 min |
|
||||
| Storage driver change | CRITICAL | Complex | 15-30 min |
|
||||
|
||||
---
|
||||
|
||||
## Pre-Change Requirements
|
||||
|
||||
### Before ANY Docker Configuration Change
|
||||
|
||||
**MANDATORY STEPS - DO NOT SKIP:**
|
||||
|
||||
1. **Create VM Snapshot**
|
||||
```bash
|
||||
# From Ansible control node
|
||||
ansible-playbook playbooks/backup_vm_snapshot.yml \
|
||||
-e "target_vms=['pihole']" \
|
||||
-e "snapshot_description='Pre Docker config change'"
|
||||
```
|
||||
|
||||
2. **Backup Docker Configuration**
|
||||
```bash
|
||||
# On target host
|
||||
sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)
|
||||
sudo tar -czf /root/docker-backup-$(date +%s).tar.gz \
|
||||
/etc/docker \
|
||||
/var/lib/docker/volumes
|
||||
```
|
||||
|
||||
3. **Document Current State**
|
||||
```bash
|
||||
# Capture current container list
|
||||
docker ps -a --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" > /tmp/containers-before.txt
|
||||
|
||||
# Capture current configuration
|
||||
docker info > /tmp/docker-info-before.txt
|
||||
|
||||
# Capture volume list
|
||||
docker volume ls > /tmp/volumes-before.txt
|
||||
```
|
||||
|
||||
4. **Verify Connectivity**
|
||||
```bash
|
||||
# Test from Ansible control node
|
||||
ansible pihole -m ping
|
||||
ansible pihole -m shell -a "docker ps"
|
||||
```
|
||||
|
||||
5. **Schedule Maintenance Window**
|
||||
- Notify stakeholders
|
||||
- Plan for 30-60 minute window
|
||||
- Have second person available for verification
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### Procedure 1: Quick Rollback (Resource Limits / Image Versions)
|
||||
|
||||
**Time Estimate:** 1-2 minutes
|
||||
**Risk:** LOW
|
||||
**Downtime:** < 1 minute per container
|
||||
|
||||
#### Steps
|
||||
|
||||
1. **Stop affected container**
|
||||
```bash
|
||||
docker stop <container_name>
|
||||
```
|
||||
|
||||
2. **Restore previous configuration**
|
||||
```bash
|
||||
# For docker run commands
|
||||
# Simply re-run with old parameters
|
||||
|
||||
# For docker-compose
|
||||
git checkout HEAD~1 docker-compose.yml
|
||||
docker-compose up -d <container_name>
|
||||
```
|
||||
|
||||
3. **Verify service**
|
||||
```bash
|
||||
docker ps | grep <container_name>
|
||||
docker logs <container_name> --tail 50
|
||||
|
||||
# Test application functionality
|
||||
curl -I http://<service_url>
|
||||
```
|
||||
|
||||
#### Success Criteria
|
||||
- Container running
|
||||
- Logs show normal operation
|
||||
- Service accessible
|
||||
- No errors in `docker logs`
|
||||
|
||||
---
|
||||
|
||||
### Procedure 2: Daemon Configuration Rollback (Non-Breaking Changes)
|
||||
|
||||
**Time Estimate:** 3-5 minutes
|
||||
**Risk:** MEDIUM
|
||||
**Downtime:** 2-3 minutes
|
||||
|
||||
#### Steps
|
||||
|
||||
1. **Restore daemon.json**
|
||||
```bash
|
||||
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
|
||||
```
|
||||
|
||||
2. **Restart Docker daemon**
|
||||
```bash
|
||||
sudo systemctl restart docker
|
||||
```
|
||||
|
||||
3. **Verify Docker is running**
|
||||
```bash
|
||||
sudo systemctl status docker
|
||||
docker info
|
||||
```
|
||||
|
||||
4. **Check all containers**
|
||||
```bash
|
||||
docker ps -a
|
||||
|
||||
# Restart any stopped containers
|
||||
docker start $(docker ps -aq)
|
||||
```
|
||||
|
||||
5. **Verify services**
|
||||
```bash
|
||||
# Test each service
|
||||
docker logs <container> --tail 20
|
||||
```
|
||||
|
||||
#### Success Criteria
|
||||
- Docker daemon running
|
||||
- All containers started
|
||||
- Services accessible
|
||||
- No errors in `journalctl -u docker`
|
||||
|
||||
---
|
||||
|
||||
### Procedure 3: User Namespace Remapping Rollback (HIGH RISK)
|
||||
|
||||
**Time Estimate:** 10-15 minutes
|
||||
**Risk:** HIGH
|
||||
**Downtime:** 10-15 minutes
|
||||
**Data Loss Risk:** LOW (if volumes backed up)
|
||||
|
||||
⚠️ **WARNING:** This is the most complex rollback. Follow carefully.
|
||||
|
||||
#### Pre-Rollback Verification
|
||||
|
||||
```bash
|
||||
# Verify snapshot exists
|
||||
ssh grokbox "sudo virsh snapshot-list <vm_name>"
|
||||
|
||||
# Verify backup archive exists
|
||||
ls -lh /root/docker-backup-*.tar.gz
|
||||
```
|
||||
|
||||
#### Steps
|
||||
|
||||
1. **Stop all containers gracefully**
|
||||
```bash
|
||||
# Mailcow example
|
||||
cd /opt/mailcow-dockerized
|
||||
docker-compose down
|
||||
|
||||
# Or generic
|
||||
docker stop $(docker ps -q)
|
||||
```
|
||||
|
||||
2. **Stop Docker daemon**
|
||||
```bash
|
||||
sudo systemctl stop docker
|
||||
```
|
||||
|
||||
3. **Restore daemon.json (remove userns-remap)**
|
||||
```bash
|
||||
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
|
||||
|
||||
# Verify userns-remap is removed
|
||||
grep -i userns /etc/docker/daemon.json
|
||||
```
|
||||
|
||||
4. **CRITICAL: Handle user namespace volume mappings**
|
||||
```bash
|
||||
# User namespaced volumes are in a different location
|
||||
# /var/lib/docker/<uid>.<gid>/volumes/
|
||||
|
||||
# List namespaced volumes
|
||||
sudo ls -la /var/lib/docker/*/volumes/
|
||||
|
||||
# Copy volumes back to main location (if needed)
|
||||
sudo rsync -av /var/lib/docker/*/volumes/* /var/lib/docker/volumes/
|
||||
```
|
||||
|
||||
5. **Start Docker daemon**
|
||||
```bash
|
||||
sudo systemctl start docker
|
||||
sudo systemctl status docker
|
||||
```
|
||||
|
||||
6. **Verify Docker info**
|
||||
```bash
|
||||
docker info | grep -i "userns"
|
||||
# Should NOT show user namespace remapping
|
||||
```
|
||||
|
||||
7. **Recreate containers**
|
||||
```bash
|
||||
# Mailcow example
|
||||
cd /opt/mailcow-dockerized
|
||||
docker-compose up -d
|
||||
|
||||
# Wait for all containers to start
|
||||
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'
|
||||
```
|
||||
|
||||
8. **Verify all services**
|
||||
```bash
|
||||
# Check container logs
|
||||
docker-compose logs --tail 50
|
||||
|
||||
# Test services
|
||||
curl -I https://cow.mymx.me
|
||||
|
||||
# Verify email functionality (mailcow)
|
||||
docker-compose exec postfix-mailcow postqueue -p
|
||||
```
|
||||
|
||||
#### If Rollback Fails: VM Snapshot Restore
|
||||
|
||||
```bash
|
||||
# From Ansible control node or directly on hypervisor
|
||||
|
||||
# 1. Shutdown VM
|
||||
ssh grokbox "sudo virsh shutdown <vm_name>"
|
||||
|
||||
# 2. Wait for shutdown (max 60 seconds)
|
||||
sleep 30
|
||||
|
||||
# 3. Force stop if needed
|
||||
ssh grokbox "sudo virsh destroy <vm_name>"
|
||||
|
||||
# 4. Revert to snapshot
|
||||
ssh grokbox "sudo virsh snapshot-revert <vm_name> backup_<timestamp>"
|
||||
|
||||
# 5. Start VM
|
||||
ssh grokbox "sudo virsh start <vm_name>"
|
||||
|
||||
# 6. Verify SSH access (may take 1-2 minutes)
|
||||
ansible <vm_name> -m ping
|
||||
|
||||
# 7. Verify services
|
||||
ansible <vm_name> -m shell -a "docker ps"
|
||||
```
|
||||
|
||||
#### Success Criteria
|
||||
- Docker daemon running WITHOUT user namespace remapping
|
||||
- All containers running
|
||||
- All services accessible
|
||||
- Volume data intact
|
||||
- No permission errors in logs
|
||||
|
||||
---
|
||||
|
||||
## Specific Scenarios
|
||||
|
||||
### Scenario A: Mailcow Container Won't Start After Namespace Change
|
||||
|
||||
**Symptoms:**
|
||||
- Containers exit immediately
|
||||
- Permission denied errors in logs
|
||||
- Volume mount failures
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# 1. Check volume permissions
|
||||
docker run --rm -v mailcowdockerized_vmail-vol-1:/volume alpine ls -la /volume
|
||||
|
||||
# 2. Fix permissions if needed (DANGEROUS - only if you know UID mapping)
|
||||
# This example assumes standard userns mapping (165536 offset)
|
||||
sudo chown -R 165536:165536 /var/lib/docker/volumes/mailcowdockerized_vmail-vol-1
|
||||
|
||||
# 3. If permissions are unfixable, revert to snapshot
|
||||
# See "VM Snapshot Restore" above
|
||||
```
|
||||
|
||||
### Scenario B: Docker Daemon Won't Start After Config Change
|
||||
|
||||
**Symptoms:**
|
||||
- `systemctl start docker` fails
|
||||
- Errors in `journalctl -u docker`
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# 1. Check exact error
|
||||
sudo journalctl -u docker -n 50 --no-pager
|
||||
|
||||
# 2. Validate daemon.json syntax
|
||||
sudo cat /etc/docker/daemon.json | jq '.'
|
||||
|
||||
# 3. If syntax error, restore backup
|
||||
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
|
||||
|
||||
# 4. If configuration conflict, check docs
|
||||
sudo dockerd --validate --config-file /etc/docker/daemon.json
|
||||
|
||||
# 5. Start daemon
|
||||
sudo systemctl start docker
|
||||
```
|
||||
|
||||
### Scenario C: Data Loss After Namespace Change
|
||||
|
||||
**Symptoms:**
|
||||
- Volumes appear empty
|
||||
- Database containers can't find data
|
||||
- Application state lost
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# 1. STOP - Do not proceed with data recovery attempts
|
||||
# 2. DO NOT restart containers
|
||||
# 3. Immediately revert to snapshot
|
||||
|
||||
ssh grokbox "sudo virsh snapshot-revert <vm_name> backup_<timestamp>"
|
||||
|
||||
# 4. After VM restore, verify data
|
||||
docker exec <database_container> <verification_command>
|
||||
|
||||
# Example for MySQL
|
||||
docker exec mailcowdockerized-mysql-mailcow-1 mysql -u root -p<password> -e "SHOW DATABASES;"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Rollback Procedures
|
||||
|
||||
### Monthly Rollback Drill
|
||||
|
||||
**Schedule:** First Monday of each month
|
||||
**Duration:** 30 minutes
|
||||
**Environment:** Development/Test VMs only
|
||||
|
||||
#### Drill Steps
|
||||
|
||||
1. **Create test VM or use derp**
|
||||
```bash
|
||||
# Deploy test container
|
||||
docker run -d --name test-nginx nginx:latest
|
||||
```
|
||||
|
||||
2. **Create snapshot**
|
||||
```bash
|
||||
ansible-playbook playbooks/backup_vm_snapshot.yml \
|
||||
-e "target_vms=['test-vm']"
|
||||
```
|
||||
|
||||
3. **Make intentional breaking change**
|
||||
```bash
|
||||
# Break Docker config
|
||||
echo '{"invalid": json}' | sudo tee /etc/docker/daemon.json
|
||||
sudo systemctl restart docker # This will fail
|
||||
```
|
||||
|
||||
4. **Practice rollback**
|
||||
```bash
|
||||
# Follow Procedure 2 above
|
||||
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
|
||||
sudo systemctl start docker
|
||||
```
|
||||
|
||||
5. **Practice snapshot restore**
|
||||
```bash
|
||||
# Follow VM Snapshot Restore procedure
|
||||
ssh grokbox "sudo virsh snapshot-revert test-vm backup_<timestamp>"
|
||||
```
|
||||
|
||||
6. **Document issues found**
|
||||
- Update this runbook
|
||||
- Note any steps that were unclear
|
||||
- Time each procedure
|
||||
|
||||
---
|
||||
|
||||
## Emergency Contacts
|
||||
|
||||
### Escalation Path
|
||||
|
||||
| Level | Contact | Response Time | Responsibility |
|
||||
|-------|---------|---------------|----------------|
|
||||
| L1 | Infrastructure Team | Immediate | Execute runbook |
|
||||
| L2 | Senior Sysadmin | 15 minutes | Complex issues |
|
||||
| L3 | Vendor Support | 1-4 hours | Critical failures |
|
||||
|
||||
### Service-Specific Contacts
|
||||
|
||||
**Mailcow:**
|
||||
- Documentation: https://docs.mailcow.email/
|
||||
- Community: https://community.mailcow.email/
|
||||
- Emergency: Check for known issues in GitHub
|
||||
|
||||
**Docker:**
|
||||
- Documentation: https://docs.docker.com/
|
||||
- Community Forums: https://forums.docker.com/
|
||||
|
||||
---
|
||||
|
||||
## Post-Rollback Actions
|
||||
|
||||
### After Any Rollback
|
||||
|
||||
1. **Update incident log**
|
||||
```markdown
|
||||
Date: <timestamp>
|
||||
VM: <vm_name>
|
||||
Change Attempted: <description>
|
||||
Rollback Procedure Used: <procedure_number>
|
||||
Success: Yes/No
|
||||
Time to Restore: <minutes>
|
||||
Issues Encountered: <list>
|
||||
```
|
||||
|
||||
2. **Verify service monitoring**
|
||||
- Check all alerts cleared
|
||||
- Verify metrics returning to normal
|
||||
- Test service endpoints
|
||||
|
||||
3. **Document lessons learned**
|
||||
- What went wrong?
|
||||
- What could be improved?
|
||||
- Update this runbook
|
||||
|
||||
4. **Schedule post-mortem** (for critical incidents)
|
||||
- Within 48 hours
|
||||
- All stakeholders present
|
||||
- Action items assigned
|
||||
|
||||
5. **Update change management records**
|
||||
- Mark change as rolled back
|
||||
- Document reason for failure
|
||||
- Plan for retry (if applicable)
|
||||
|
||||
---
|
||||
|
||||
## Preventive Measures
|
||||
|
||||
### Before Making High-Risk Changes
|
||||
|
||||
1. **Test in development first**
|
||||
- Use derp VM or test environment
|
||||
- Replicate production as closely as possible
|
||||
- Document exact steps that work
|
||||
|
||||
2. **Review Docker/Mailcow changelogs**
|
||||
- Check for known issues
|
||||
- Review breaking changes
|
||||
- Search community forums
|
||||
|
||||
3. **Peer review change plan**
|
||||
- Have colleague review procedure
|
||||
- Walk through rollback steps
|
||||
- Verify backup procedures
|
||||
|
||||
4. **Schedule during low-traffic period**
|
||||
- Weekend or late evening
|
||||
- Notify users in advance
|
||||
- Have monitoring ready
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: Quick Reference Commands
|
||||
|
||||
### Snapshot Management
|
||||
```bash
|
||||
# Create snapshot
|
||||
ansible-playbook playbooks/backup_vm_snapshot.yml -e "target_vms=['vm']"
|
||||
|
||||
# List snapshots
|
||||
ssh grokbox "sudo virsh snapshot-list <vm>"
|
||||
|
||||
# Revert to snapshot
|
||||
ssh grokbox "sudo virsh snapshot-revert <vm> <snapshot_name>"
|
||||
|
||||
# Delete snapshot
|
||||
ssh grokbox "sudo virsh snapshot-delete <vm> <snapshot_name>"
|
||||
```
|
||||
|
||||
### Docker Backup/Restore
|
||||
```bash
|
||||
# Backup
|
||||
sudo tar -czf docker-backup.tar.gz /etc/docker /var/lib/docker/volumes
|
||||
|
||||
# Restore
|
||||
sudo tar -xzf docker-backup.tar.gz -C /
|
||||
```
|
||||
|
||||
### Service Verification
|
||||
```bash
|
||||
# Docker
|
||||
systemctl status docker
|
||||
docker info
|
||||
docker ps
|
||||
|
||||
# Mailcow
|
||||
cd /opt/mailcow-dockerized
|
||||
docker-compose ps
|
||||
docker-compose logs --tail 50
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Document End**
|
||||
|
||||
**Review Schedule:** Monthly
|
||||
**Next Review:** 2025-12-11
|
||||
**Approval:** Infrastructure Team Lead
|
||||
206
playbooks/backup_vm_snapshot.yml
Normal file
206
playbooks/backup_vm_snapshot.yml
Normal file
@@ -0,0 +1,206 @@
|
||||
---
|
||||
# ==============================================================================
|
||||
# VM Snapshot Backup Playbook
|
||||
# ==============================================================================
|
||||
# Create snapshots of VMs before risky operations
|
||||
# Supports KVM/libvirt VMs via hypervisor connection
|
||||
# ==============================================================================
|
||||
|
||||
- name: Create VM Snapshots for Backup
|
||||
hosts: localhost
|
||||
gather_facts: true
|
||||
vars:
|
||||
hypervisor_uri: "qemu+ssh://grok@grok.home.serneels.xyz/system"
|
||||
snapshot_description: "Pre-maintenance backup"
|
||||
snapshot_prefix: "backup"
|
||||
target_vms: [] # Empty list means all running VMs
|
||||
|
||||
tasks:
|
||||
- name: Display snapshot operation information
|
||||
ansible.builtin.debug:
|
||||
msg:
|
||||
- "=== VM Snapshot Backup Operation ==="
|
||||
- "Hypervisor: {{ hypervisor_uri }}"
|
||||
- "Date: {{ ansible_date_time.iso8601 }}"
|
||||
- "Target VMs: {{ target_vms | default('all running VMs') }}"
|
||||
tags: [always]
|
||||
|
||||
- name: Validate target_vms variable
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- target_vms is defined
|
||||
- target_vms is iterable
|
||||
fail_msg: "target_vms must be a list of VM names"
|
||||
tags: [always]
|
||||
|
||||
# ==========================================================================
|
||||
# Get VM List
|
||||
# ==========================================================================
|
||||
|
||||
- name: Get list of all running VMs
|
||||
ansible.builtin.shell: |
|
||||
ssh grokbox "sudo virsh list --name"
|
||||
register: all_vms_raw
|
||||
changed_when: false
|
||||
when: target_vms | length == 0
|
||||
tags: [discover]
|
||||
|
||||
- name: Parse running VMs list
|
||||
ansible.builtin.set_fact:
|
||||
discovered_vms: "{{ all_vms_raw.stdout_lines | select() | list }}"
|
||||
when: target_vms | length == 0
|
||||
tags: [discover]
|
||||
|
||||
- name: Set final VM list
|
||||
ansible.builtin.set_fact:
|
||||
vms_to_backup: "{{ target_vms if target_vms | length > 0 else discovered_vms }}"
|
||||
tags: [discover]
|
||||
|
||||
- name: Display VMs to be backed up
|
||||
ansible.builtin.debug:
|
||||
msg: "VMs to backup: {{ vms_to_backup }}"
|
||||
tags: [discover]
|
||||
|
||||
# ==========================================================================
|
||||
# Pre-flight Checks
|
||||
# ==========================================================================
|
||||
|
||||
- name: Check if VMs exist and are running
|
||||
ansible.builtin.shell: |
|
||||
ssh grokbox "sudo virsh domstate {{ item }}"
|
||||
register: vm_states
|
||||
failed_when: vm_states.rc != 0
|
||||
changed_when: false
|
||||
loop: "{{ vms_to_backup }}"
|
||||
tags: [validate]
|
||||
|
||||
- name: Verify all VMs are running
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- item.stdout == 'running'
|
||||
fail_msg: "VM {{ item.item }} is not running (state: {{ item.stdout }})"
|
||||
success_msg: "VM {{ item.item }} is running"
|
||||
loop: "{{ vm_states.results }}"
|
||||
tags: [validate]
|
||||
|
||||
- name: Check for existing snapshots
|
||||
ansible.builtin.shell: |
|
||||
ssh grokbox "sudo virsh snapshot-list {{ item }} --name"
|
||||
register: existing_snapshots
|
||||
changed_when: false
|
||||
loop: "{{ vms_to_backup }}"
|
||||
tags: [validate]
|
||||
|
||||
- name: Display existing snapshots
|
||||
ansible.builtin.debug:
|
||||
msg:
|
||||
- "VM: {{ item.item }}"
|
||||
- "Existing snapshots: {{ item.stdout_lines | default(['none']) | join(', ') }}"
|
||||
loop: "{{ existing_snapshots.results }}"
|
||||
tags: [validate]
|
||||
|
||||
# ==========================================================================
|
||||
# Create Snapshots
|
||||
# ==========================================================================
|
||||
|
||||
- name: Generate snapshot name with timestamp
|
||||
ansible.builtin.set_fact:
|
||||
snapshot_timestamp: "{{ ansible_date_time.epoch }}"
|
||||
tags: [snapshot]
|
||||
|
||||
- name: Create VM snapshots
|
||||
ansible.builtin.shell: |
|
||||
ssh grokbox "sudo virsh snapshot-create-as {{ item }} \
|
||||
--name '{{ snapshot_prefix }}_{{ snapshot_timestamp }}' \
|
||||
--description '{{ snapshot_description }} - {{ ansible_date_time.iso8601 }}' \
|
||||
--atomic"
|
||||
register: snapshot_create
|
||||
loop: "{{ vms_to_backup }}"
|
||||
tags: [snapshot]
|
||||
|
||||
- name: Verify snapshot creation
|
||||
ansible.builtin.shell: |
|
||||
ssh grokbox "sudo virsh snapshot-info {{ item }} {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
|
||||
register: snapshot_info
|
||||
changed_when: false
|
||||
loop: "{{ vms_to_backup }}"
|
||||
tags: [snapshot, verify]
|
||||
|
||||
# ==========================================================================
|
||||
# Generate Backup Report
|
||||
# ==========================================================================
|
||||
|
||||
- name: Create backup report directory
|
||||
ansible.builtin.file:
|
||||
path: "./stats/vm_backups"
|
||||
state: directory
|
||||
mode: '0755'
|
||||
tags: [report]
|
||||
|
||||
- name: Generate backup report
|
||||
ansible.builtin.copy:
|
||||
content: |
|
||||
================================================================================
|
||||
VM SNAPSHOT BACKUP REPORT
|
||||
================================================================================
|
||||
Date: {{ ansible_date_time.iso8601 }}
|
||||
Hypervisor: {{ hypervisor_uri }}
|
||||
Snapshot Name: {{ snapshot_prefix }}_{{ snapshot_timestamp }}
|
||||
Description: {{ snapshot_description }}
|
||||
|
||||
VMs Backed Up:
|
||||
{% for vm in vms_to_backup %}
|
||||
- {{ vm }}
|
||||
{% endfor %}
|
||||
|
||||
Snapshot Details:
|
||||
{% for result in snapshot_info.results %}
|
||||
|
||||
VM: {{ result.item }}
|
||||
{{ result.stdout }}
|
||||
{% endfor %}
|
||||
|
||||
ROLLBACK INSTRUCTIONS
|
||||
================================================================================
|
||||
|
||||
To restore a VM to this snapshot:
|
||||
|
||||
1. Stop the VM (if running):
|
||||
ssh grokbox "sudo virsh shutdown <vm_name>"
|
||||
|
||||
2. Revert to snapshot:
|
||||
ssh grokbox "sudo virsh snapshot-revert <vm_name> {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
|
||||
|
||||
3. Start the VM:
|
||||
ssh grokbox "sudo virsh start <vm_name>"
|
||||
|
||||
To delete this snapshot after verification:
|
||||
ssh grokbox "sudo virsh snapshot-delete <vm_name> {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
|
||||
|
||||
================================================================================
|
||||
END OF REPORT
|
||||
================================================================================
|
||||
dest: "./stats/vm_backups/backup_{{ snapshot_timestamp }}.txt"
|
||||
mode: '0644'
|
||||
tags: [report]
|
||||
|
||||
# ==========================================================================
|
||||
# Display Summary
|
||||
# ==========================================================================
|
||||
|
||||
- name: Display backup summary
|
||||
ansible.builtin.debug:
|
||||
msg:
|
||||
- "=== VM Snapshot Backup Complete ==="
|
||||
- "Snapshot Name: {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
|
||||
- "VMs Backed Up: {{ vms_to_backup | length }}"
|
||||
- "Backup Report: ./stats/vm_backups/backup_{{ snapshot_timestamp }}.txt"
|
||||
- ""
|
||||
- "⚠️ IMPORTANT NOTES:"
|
||||
- "1. Snapshots are point-in-time copies"
|
||||
- "2. Test restoration procedure before relying on snapshots"
|
||||
- "3. Snapshots consume disk space - clean up old snapshots"
|
||||
- "4. For critical changes, consider full VM backups"
|
||||
- ""
|
||||
- "To restore: virsh snapshot-revert <vm> {{ snapshot_prefix }}_{{ snapshot_timestamp }}"
|
||||
tags: [always]
|
||||
Reference in New Issue
Block a user