Files
infra-automation/docs/docker-userns-testing-guide.md
ansible e124bc2a96 Add Docker user namespace testing guide, rollback runbook, and VM backup playbook
- Add comprehensive Docker user namespace testing documentation
- Add Docker configuration rollback runbook for disaster recovery
- Add VM snapshot backup playbook for system protection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 09:55:20 +01:00

763 lines
18 KiB
Markdown

# Docker User Namespace Remapping - Testing and Implementation Guide
**Document Version:** 1.0
**Last Updated:** 2025-11-11
**Risk Level:** HIGH
**Testing Required:** YES (Mandatory in dev/test first)
---
## Table of Contents
1. [Overview](#overview)
2. [Security Benefits](#security-benefits)
3. [Prerequisites](#prerequisites)
4. [Testing Phase (Week 48-49)](#testing-phase-week-48-49)
5. [Production Implementation (Week 50)](#production-implementation-week-50)
6. [Mailcow-Specific Considerations](#mailcow-specific-considerations)
7. [Troubleshooting](#troubleshooting)
---
## Overview
User namespace remapping is a Docker security feature that maps container UID/GIDs to different values on the host, preventing container root from being host root.
### Current Status
| Host | User Namespaces | Risk Level | Implementation Priority |
|------|-----------------|------------|------------------------|
| pihole | Not configured | MEDIUM | Week 49 (after testing) |
| mymx | Not configured | HIGH | Week 50 (mailcow complexity) |
### Impact Assessment
**Benefits:**
- ✅ Container root ≠ host root (major security improvement)
- ✅ Reduces container escape impact
- ✅ CIS Docker Benchmark compliance (2.13)
**Risks:**
- ⚠️ **ALL containers must be recreated**
- ⚠️ Volume permissions must be remapped
- ⚠️ Breaking change for existing deployments
- ⚠️ Mailcow may have specific requirements
**Recommendation:** Test thoroughly in dev, then pihole, then mymx (last)
---
## Security Benefits
### Without User Namespace Remapping (Current State)
```
Container: Host:
UID 0 (root) → UID 0 (root) ❌ DANGEROUS
UID 1000 → UID 1000
```
**Problem:** Container root can potentially escape and has host root privileges.
### With User Namespace Remapping (Target State)
```
Container: Host:
UID 0 (root) → UID 165536 ✅ SAFE
UID 1000 → UID 166536
```
**Benefit:** Container root is unprivileged user on host.
---
## Prerequisites
### Before Starting Testing
1. **VM Snapshots Created**
```bash
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['pihole', 'mymx']"
```
2. **Rollback Procedures Reviewed**
- Read: `docs/runbooks/docker-configuration-rollback.md`
- Understand VM snapshot restore process
- Have emergency contact information ready
3. **Maintenance Window Scheduled**
- Duration: 2-3 hours for testing
- Low-traffic period recommended
- Second person available for verification
4. **Documentation Ready**
- This guide printed or accessible offline
- Docker and mailcow documentation available
- Notepad for documenting issues
---
## Testing Phase (Week 48-49)
### Phase 1: Test Environment Setup (Week 48)
**Objective:** Validate user namespace remapping with simple container
#### Option A: Use derp VM (Recommended)
```bash
# 1. Start derp VM (if stopped)
ssh grokbox "sudo virsh start derp"
# 2. Create ansible user and configure SSH
# (Use deploy_linux_vm role or manual setup)
# 3. Install Docker
ansible derp -m apt -a "name=docker.io state=present" -b
# 4. Create snapshot before testing
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['derp']"
```
#### Option B: Create temporary test container on existing host
```bash
# On pihole (low risk - only 1 container)
# Create test container first
docker run -d --name userns-test \
-v test-volume:/data \
alpine:latest sleep infinity
```
### Phase 2: Enable User Namespace Remapping (Week 48)
#### Step 1: Configure Docker Daemon
```bash
# On test host (derp or pihole)
sudo tee /etc/docker/daemon.json <<EOF
{
"userns-remap": "default"
}
EOF
# Validate syntax
cat /etc/docker/daemon.json | jq '.'
```
#### Step 2: Restart Docker
```bash
# Stop all containers first
docker stop $(docker ps -q)
# Restart Docker daemon
sudo systemctl restart docker
# Verify it started
sudo systemctl status docker
# Check for user namespace in docker info
docker info | grep -i "userns"
# Should show: "userns": true
```
#### Step 3: Verify UID Mapping
```bash
# Check subuid/subgid configuration
cat /etc/subuid
cat /etc/subgid
# Should show something like:
# dockremap:165536:65536
# Verify Docker is using remapping
docker info --format '{{.SecurityOptions}}'
```
#### Step 4: Recreate Test Container
```bash
# Remove old container (data is in volume)
docker rm userns-test
# Recreate container
docker run -d --name userns-test \
-v test-volume:/data \
alpine:latest sleep infinity
# Verify it's running
docker ps | grep userns-test
```
#### Step 5: Test Volume Permissions
```bash
# Create test file in container
docker exec userns-test sh -c 'echo "test" > /data/test.txt'
# Check file ownership on host
# Volume location changed! It's now in:
sudo ls -la /var/lib/docker/165536.165536/volumes/test-volume/_data/
# UID should be 165536 (remapped root)
# Test read/write in container
docker exec userns-test cat /data/test.txt
docker exec userns-test sh -c 'echo "test2" >> /data/test.txt'
```
### Phase 3: Test with Real Application (Week 48-49)
#### Test Scenario 1: Simple Web Server (pihole preparation)
```bash
# Deploy nginx with volume
docker run -d --name test-nginx \
-p 8080:80 \
-v nginx-data:/usr/share/nginx/html \
nginx:alpine
# Test access
curl http://localhost:8080
# Create content
docker exec test-nginx sh -c 'echo "<h1>User Namespace Test</h1>" > /usr/share/nginx/html/test.html'
# Verify access
curl http://localhost:8080/test.html
# Check logs
docker logs test-nginx
```
#### Test Scenario 2: Database Container (mailcow preparation)
```bash
# Deploy MariaDB with volume
docker run -d --name test-db \
-e MYSQL_ROOT_PASSWORD=testpass123 \
-v mysql-data:/var/lib/mysql \
mariadb:10.11
# Wait for startup
sleep 30
# Test database
docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;"
# Create test database
docker exec test-db mysql -ptest pass123 -e "CREATE DATABASE testdb;"
# Stop and restart to test persistence
docker stop test-db
docker start test-db
sleep 20
# Verify data persisted
docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;" | grep testdb
```
#### Test Scenario 3: Application with File Uploads
```bash
# Create upload directory
mkdir -p /tmp/test-uploads
# Run container with bind mount
docker run -d --name test-upload \
-v /tmp/test-uploads:/uploads \
alpine:latest sleep infinity
# Test file creation
docker exec test-upload sh -c 'echo "test" > /uploads/test.txt'
# Check host permissions
ls -la /tmp/test-uploads/
# File should be owned by UID 165536
# Test file access from container
docker exec test-upload cat /uploads/test.txt
```
### Phase 4: Identify Issues (Week 48-49)
#### Common Issues to Check
1. **Permission Denied Errors**
```bash
# Check container logs
docker logs <container_name> 2>&1 | grep -i "permission"
```
2. **Volume Mount Failures**
```bash
# List volumes
docker volume ls
# Inspect volume
docker volume inspect <volume_name>
# Check actual location on disk
sudo ls -la /var/lib/docker/*/volumes/
```
3. **Bind Mount Issues**
```bash
# For bind mounts, may need to adjust host permissions
# Example: Allow remapped UID to write
sudo chown 165536:165536 /path/to/host/dir
```
4. **Privileged Container Conflicts**
```bash
# Test if privileged containers still work
docker run --rm --privileged alpine:latest id
# Note: Privileged containers bypass userns remapping
```
#### Document All Findings
Create test log:
```markdown
## User Namespace Remapping Test Log
Date: <date>
Host: <hostname>
Docker Version: <version>
### Test 1: Simple Container
- Result: PASS/FAIL
- Issues: <none or list>
- Notes: <observations>
### Test 2: Web Server
- Result: PASS/FAIL
- Issues: <none or list>
- Notes: <observations>
### Test 3: Database
- Result: PASS/FAIL
- Issues: <none or list>
- Notes: <observations>
### Conclusion
Ready for production: YES/NO
Blockers: <list if any>
```
---
## Production Implementation (Week 50)
### Implementation Order
1. **pihole** (Week 49 end / Week 50 start) - Lowest risk
2. **mymx** (Week 50 end) - Highest risk, requires mailcow-specific testing
### pihole Implementation
**Prerequisites:**
- ✅ Testing completed successfully on derp/test environment
- ✅ VM snapshot created
- ✅ Maintenance window scheduled
- ✅ Rollback procedure reviewed
**Steps:**
```bash
# 1. Create snapshot
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['pihole']" \
-e "snapshot_description='Pre user namespace implementation'"
# 2. Backup current configuration
ansible pihole -m shell -a "sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)" -b
# 3. Stop pihole container
ansible pihole -m shell -a "docker stop pihole" -b
# 4. Configure user namespace remapping
ansible pihole -m copy -b -a "
dest=/etc/docker/daemon.json
content='{\"userns-remap\": \"default\"}'
owner=root
group=root
mode='0644'
"
# 5. Restart Docker
ansible pihole -m systemd -a "name=docker state=restarted" -b
# 6. Verify Docker started
ansible pihole -m shell -a "docker info | grep -i userns" -b
# 7. Recreate pihole container (adjust based on actual deployment)
# If using docker run command, re-run it
# If using docker-compose, run: docker-compose up -d
# 8. Verify pihole is working
ansible pihole -m shell -a "docker ps" -b
ansible pihole -m shell -a "docker logs pihole --tail 50" -b
# 9. Test DNS functionality
dig @192.168.122.12 google.com
# 10. Monitor for 1 hour
watch -n 60 'ansible pihole -m shell -a "docker ps" -b'
```
**Rollback if Issues:**
```bash
# Follow docs/runbooks/docker-configuration-rollback.md
# Procedure 3: User Namespace Remapping Rollback
```
---
## Mailcow-Specific Considerations
### Why Mailcow is Complex
1. **Multiple interconnected containers** (24 containers)
2. **Persistent data in multiple volumes** (mail, databases, configs)
3. **File permissions critical** for mail delivery
4. **Active production service** - downtime impact high
### Mailcow Testing Approach (Week 49-50)
#### Phase 1: Research (Week 49)
```bash
# 1. Check mailcow documentation
# Search: "user namespace" or "userns-remap"
# URL: https://docs.mailcow.email/
# 2. Check mailcow GitHub issues
# Search for: userns, user namespace, permission issues
# 3. Check mailcow community forum
# URL: https://community.mailcow.email/
# Search for similar implementations
```
#### Phase 2: Mailcow Test Environment (Week 49)
**Option A: Deploy test mailcow on derp**
```bash
# Requires:
# - 4GB+ RAM (derp may be too small)
# - 20GB+ disk space
# - Domain for testing
# Install mailcow on derp
git clone https://github.com/mailcow/mailcow-dockerized
cd mailcow-dockerized
./generate_config.sh
docker-compose up -d
```
**Option B: Clone mymx mailcow config to test environment**
```bash
# Create test VM clone
# Copy mailcow configuration
# Test with user namespaces
```
#### Phase 3: Mailcow Volume Analysis (Week 49)
```bash
# On mymx, identify all volumes
docker volume ls | grep mailcow
# Check critical volumes
docker volume inspect mailcowdockerized_vmail-vol-1
docker volume inspect mailcowdockerized_mysql-vol-1
# Document current permissions
for vol in $(docker volume ls -q | grep mailcow); do
echo "=== $vol ==="
sudo ls -la /var/lib/docker/volumes/$vol/_data/ | head -20
done > /tmp/mailcow-permissions-before.txt
```
#### Phase 4: Mailcow Implementation (Week 50 - IF testing successful)
**ONLY proceed if:**
- ✅ Testing in dev environment successful
- ✅ pihole implementation successful
- ✅ Mailcow community confirms no known issues
- ✅ Extended maintenance window available (2-4 hours)
- ✅ Full backups completed
- ✅ Rollback tested and confirmed working
**Implementation Steps:**
```bash
# 1. Create snapshot
ansible-playbook playbooks/backup_vm_snapshot.yml \
-e "target_vms=['mymx']" \
-e "snapshot_description='Pre mailcow user namespace'"
# 2. Backup ALL mailcow data
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && ./helper-scripts/backup_and_restore.sh backup all" -b
# 3. Stop mailcow
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose down" -b
# 4. Backup current state
ansible mymx -m shell -a "
sudo tar -czf /root/mailcow-pre-userns-$(date +%s).tar.gz \
/etc/docker \
/opt/mailcow-dockerized \
/var/lib/docker/volumes/mailcow*
" -b
# 5. Configure user namespace
ansible mymx -m shell -a "
sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)
echo '{\"userns-remap\": \"default\"}' | sudo tee /etc/docker/daemon.json
" -b
# 6. Restart Docker
ansible mymx -m systemd -a "name=docker state=restarted" -b
# 7. Verify Docker started with user namespaces
ansible mymx -m shell -a "docker info | grep -i userns" -b
# 8. Start mailcow (will recreate all containers)
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose up -d" -b
# 9. Monitor startup
watch -n 10 'ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose ps" -b'
# 10. Check logs for permission errors
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose logs --tail 100" -b | grep -i "permission\|denied\|failed"
# 11. Test mail functionality
# - Send test email
# - Receive test email
# - Check webmail access
# - Verify SOGo groupware
# - Test IMAP/SMTP connections
# 12. Monitor for 4-8 hours before declaring success
```
**Known Potential Issues with Mailcow:**
1. **Vmail Volume Permissions**
```bash
# If mail delivery fails with permission errors
# May need to adjust permissions (LAST RESORT)
sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes/mailcowdockerized_vmail-vol-1/_data/
```
2. **MySQL Volume Issues**
```bash
# If database won't start
# Check MySQL logs
docker logs mailcowdockerized-mysql-mailcow-1
# May need database permission fixes
# This is why testing is CRITICAL
```
3. **Dovecot Permission Issues**
```bash
# Dovecot is sensitive to mail file permissions
# May require config adjustments in mailcow.conf
```
### Mailcow Rollback Decision Point
**Roll back immediately if:**
- Docker daemon won't start
- MySQL container won't start
- Cannot send/receive mail after 15 minutes
- Permission errors in critical containers
- Data appears missing/inaccessible
**Use VM snapshot restore if:**
- Multiple containers failing
- Data corruption suspected
- Cannot resolve within 30 minutes
---
## Troubleshooting
### Issue 1: Docker Daemon Won't Start
**Symptoms:**
```bash
systemctl status docker
# Failed to start Docker Application Container Engine
```
**Solutions:**
```bash
# Check logs
journalctl -u docker -n 100 --no-pager
# Common causes:
# 1. Invalid daemon.json syntax
cat /etc/docker/daemon.json | jq '.'
# 2. Subuid/subgid not configured
cat /etc/subuid
cat /etc/subgid
# Should have dockremap:165536:65536
# 3. Restore backup
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
sudo systemctl start docker
```
### Issue 2: Container Won't Start - Permission Denied
**Symptoms:**
```bash
docker logs <container>
# Permission denied errors
```
**Solutions:**
```bash
# 1. Check volume location
docker volume inspect <volume_name>
# 2. Check permissions on host
sudo ls -la /var/lib/docker/165536.165536/volumes/<volume>/_data/
# 3. If permissions wrong, may need to adjust
# (Avoid this if possible - indicates larger problem)
sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes/<volume>/_data/
```
### Issue 3: Bind Mounts Not Working
**Symptoms:**
```bash
docker logs <container>
# Cannot access /bind/mount/path
```
**Solutions:**
```bash
# Bind mounts need host directory permissions adjusted
sudo chown 165536:165536 /path/to/bind/mount
# Or use volumes instead of bind mounts
# Volumes are handled automatically by Docker
```
### Issue 4: Privileged Container Needed
**Note:** Privileged containers (like mailcow netfilter) bypass user namespace remapping.
```bash
# Verify privileged container still works
docker inspect <container> | grep -i privileged
# Should show: "Privileged": true
# Privileged containers run as actual root (userns bypassed)
# This is expected for netfilter, acceptable risk (documented)
```
---
## Success Criteria
### Testing Phase Success (Before Production)
- [ ] Simple container runs successfully
- [ ] Web server container accessible
- [ ] Database container stores/retrieves data
- [ ] Volume permissions correct (165536 UID)
- [ ] Bind mounts work (if needed)
- [ ] No permission errors in logs
- [ ] Can recreate containers after Docker restart
- [ ] Rollback procedure tested and successful
### Production Implementation Success
#### pihole
- [ ] VM snapshot created
- [ ] Docker daemon running with user namespaces
- [ ] pihole container running
- [ ] DNS queries working
- [ ] No permission errors in logs
- [ ] Monitoring shows normal operation for 24+ hours
#### mymx/mailcow
- [ ] VM snapshot created
- [ ] Docker daemon running with user namespaces
- [ ] All 24 containers running
- [ ] Can send email
- [ ] Can receive email
- [ ] Webmail accessible
- [ ] SOGo groupware working
- [ ] No permission errors in logs
- [ ] Monitoring shows normal operation for 48+ hours
- [ ] Full service verification completed
---
## Decision Tree
```
START: Ready to enable user namespaces?
├─ Testing completed in dev?
│ ├─ NO → STOP: Complete testing first
│ └─ YES → Continue
├─ VM snapshots created?
│ ├─ NO → STOP: Create snapshots first
│ └─ YES → Continue
├─ Rollback procedure reviewed?
│ ├─ NO → STOP: Review rollback docs
│ └─ YES → Continue
├─ Which host?
│ ├─ pihole → Proceed (lower risk)
│ └─ mymx → Additional checks needed
│ │
│ ├─ Mailcow community research done?
│ │ ├─ NO → STOP: Research first
│ │ └─ YES → Continue
│ │
│ ├─ pihole implementation successful?
│ │ ├─ NO → STOP: Fix pihole first
│ │ └─ YES → Continue
│ │
│ ├─ Extended maintenance window?
│ │ ├─ NO → STOP: Schedule proper window
│ │ └─ YES → Proceed with caution
│ │
│ └─ Proceed with mymx (high risk)
```
---
## References
- Docker User Namespace Documentation: https://docs.docker.com/engine/security/userns-remap/
- CIS Docker Benchmark 2.13: Enable user namespace support
- Mailcow Documentation: https://docs.mailcow.email/
- NIST SP 800-190: Section 4.4 - Host OS and multi-tenancy
---
**Document Version:** 1.0
**Next Review:** After testing completion (Week 49)
**Owner:** Infrastructure Security Team