Files
infra-automation/docs/docker-userns-testing-guide.md
ansible e124bc2a96 Add Docker user namespace testing guide, rollback runbook, and VM backup playbook
- Add comprehensive Docker user namespace testing documentation
- Add Docker configuration rollback runbook for disaster recovery
- Add VM snapshot backup playbook for system protection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 09:55:20 +01:00

18 KiB

Docker User Namespace Remapping - Testing and Implementation Guide

Document Version: 1.0 Last Updated: 2025-11-11 Risk Level: HIGH Testing Required: YES (Mandatory in dev/test first)


Table of Contents

  1. Overview
  2. Security Benefits
  3. Prerequisites
  4. Testing Phase (Week 48-49)
  5. Production Implementation (Week 50)
  6. Mailcow-Specific Considerations
  7. Troubleshooting

Overview

User namespace remapping is a Docker security feature that maps container UID/GIDs to different values on the host, preventing container root from being host root.

Current Status

Host User Namespaces Risk Level Implementation Priority
pihole Not configured MEDIUM Week 49 (after testing)
mymx Not configured HIGH Week 50 (mailcow complexity)

Impact Assessment

Benefits:

  • Container root ≠ host root (major security improvement)
  • Reduces container escape impact
  • CIS Docker Benchmark compliance (2.13)

Risks:

  • ⚠️ ALL containers must be recreated
  • ⚠️ Volume permissions must be remapped
  • ⚠️ Breaking change for existing deployments
  • ⚠️ Mailcow may have specific requirements

Recommendation: Test thoroughly in dev, then pihole, then mymx (last)


Security Benefits

Without User Namespace Remapping (Current State)

Container:     Host:
UID 0 (root) → UID 0 (root)     ❌ DANGEROUS
UID 1000     → UID 1000

Problem: Container root can potentially escape and has host root privileges.

With User Namespace Remapping (Target State)

Container:     Host:
UID 0 (root) → UID 165536       ✅ SAFE
UID 1000     → UID 166536

Benefit: Container root is unprivileged user on host.


Prerequisites

Before Starting Testing

  1. VM Snapshots Created

    ansible-playbook playbooks/backup_vm_snapshot.yml \
      -e "target_vms=['pihole', 'mymx']"
    
  2. Rollback Procedures Reviewed

    • Read: docs/runbooks/docker-configuration-rollback.md
    • Understand VM snapshot restore process
    • Have emergency contact information ready
  3. Maintenance Window Scheduled

    • Duration: 2-3 hours for testing
    • Low-traffic period recommended
    • Second person available for verification
  4. Documentation Ready

    • This guide printed or accessible offline
    • Docker and mailcow documentation available
    • Notepad for documenting issues

Testing Phase (Week 48-49)

Phase 1: Test Environment Setup (Week 48)

Objective: Validate user namespace remapping with simple container

# 1. Start derp VM (if stopped)
ssh grokbox "sudo virsh start derp"

# 2. Create ansible user and configure SSH
# (Use deploy_linux_vm role or manual setup)

# 3. Install Docker
ansible derp -m apt -a "name=docker.io state=present" -b

# 4. Create snapshot before testing
ansible-playbook playbooks/backup_vm_snapshot.yml \
  -e "target_vms=['derp']"

Option B: Create temporary test container on existing host

# On pihole (low risk - only 1 container)
# Create test container first

docker run -d --name userns-test \
  -v test-volume:/data \
  alpine:latest sleep infinity

Phase 2: Enable User Namespace Remapping (Week 48)

Step 1: Configure Docker Daemon

# On test host (derp or pihole)
sudo tee /etc/docker/daemon.json <<EOF
{
  "userns-remap": "default"
}
EOF

# Validate syntax
cat /etc/docker/daemon.json | jq '.'

Step 2: Restart Docker

# Stop all containers first
docker stop $(docker ps -q)

# Restart Docker daemon
sudo systemctl restart docker

# Verify it started
sudo systemctl status docker

# Check for user namespace in docker info
docker info | grep -i "userns"
# Should show: "userns": true

Step 3: Verify UID Mapping

# Check subuid/subgid configuration
cat /etc/subuid
cat /etc/subgid

# Should show something like:
# dockremap:165536:65536

# Verify Docker is using remapping
docker info --format '{{.SecurityOptions}}'

Step 4: Recreate Test Container

# Remove old container (data is in volume)
docker rm userns-test

# Recreate container
docker run -d --name userns-test \
  -v test-volume:/data \
  alpine:latest sleep infinity

# Verify it's running
docker ps | grep userns-test

Step 5: Test Volume Permissions

# Create test file in container
docker exec userns-test sh -c 'echo "test" > /data/test.txt'

# Check file ownership on host
# Volume location changed! It's now in:
sudo ls -la /var/lib/docker/165536.165536/volumes/test-volume/_data/

# UID should be 165536 (remapped root)

# Test read/write in container
docker exec userns-test cat /data/test.txt
docker exec userns-test sh -c 'echo "test2" >> /data/test.txt'

Phase 3: Test with Real Application (Week 48-49)

Test Scenario 1: Simple Web Server (pihole preparation)

# Deploy nginx with volume
docker run -d --name test-nginx \
  -p 8080:80 \
  -v nginx-data:/usr/share/nginx/html \
  nginx:alpine

# Test access
curl http://localhost:8080

# Create content
docker exec test-nginx sh -c 'echo "<h1>User Namespace Test</h1>" > /usr/share/nginx/html/test.html'

# Verify access
curl http://localhost:8080/test.html

# Check logs
docker logs test-nginx

Test Scenario 2: Database Container (mailcow preparation)

# Deploy MariaDB with volume
docker run -d --name test-db \
  -e MYSQL_ROOT_PASSWORD=testpass123 \
  -v mysql-data:/var/lib/mysql \
  mariadb:10.11

# Wait for startup
sleep 30

# Test database
docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;"

# Create test database
docker exec test-db mysql -ptest pass123 -e "CREATE DATABASE testdb;"

# Stop and restart to test persistence
docker stop test-db
docker start test-db
sleep 20

# Verify data persisted
docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;" | grep testdb

Test Scenario 3: Application with File Uploads

# Create upload directory
mkdir -p /tmp/test-uploads

# Run container with bind mount
docker run -d --name test-upload \
  -v /tmp/test-uploads:/uploads \
  alpine:latest sleep infinity

# Test file creation
docker exec test-upload sh -c 'echo "test" > /uploads/test.txt'

# Check host permissions
ls -la /tmp/test-uploads/
# File should be owned by UID 165536

# Test file access from container
docker exec test-upload cat /uploads/test.txt

Phase 4: Identify Issues (Week 48-49)

Common Issues to Check

  1. Permission Denied Errors

    # Check container logs
    docker logs <container_name> 2>&1 | grep -i "permission"
    
  2. Volume Mount Failures

    # List volumes
    docker volume ls
    
    # Inspect volume
    docker volume inspect <volume_name>
    
    # Check actual location on disk
    sudo ls -la /var/lib/docker/*/volumes/
    
  3. Bind Mount Issues

    # For bind mounts, may need to adjust host permissions
    # Example: Allow remapped UID to write
    sudo chown 165536:165536 /path/to/host/dir
    
  4. Privileged Container Conflicts

    # Test if privileged containers still work
    docker run --rm --privileged alpine:latest id
    # Note: Privileged containers bypass userns remapping
    

Document All Findings

Create test log:

## User Namespace Remapping Test Log

Date: <date>
Host: <hostname>
Docker Version: <version>

### Test 1: Simple Container
- Result: PASS/FAIL
- Issues: <none or list>
- Notes: <observations>

### Test 2: Web Server
- Result: PASS/FAIL
- Issues: <none or list>
- Notes: <observations>

### Test 3: Database
- Result: PASS/FAIL
- Issues: <none or list>
- Notes: <observations>

### Conclusion
Ready for production: YES/NO
Blockers: <list if any>

Production Implementation (Week 50)

Implementation Order

  1. pihole (Week 49 end / Week 50 start) - Lowest risk
  2. mymx (Week 50 end) - Highest risk, requires mailcow-specific testing

pihole Implementation

Prerequisites:

  • Testing completed successfully on derp/test environment
  • VM snapshot created
  • Maintenance window scheduled
  • Rollback procedure reviewed

Steps:

# 1. Create snapshot
ansible-playbook playbooks/backup_vm_snapshot.yml \
  -e "target_vms=['pihole']" \
  -e "snapshot_description='Pre user namespace implementation'"

# 2. Backup current configuration
ansible pihole -m shell -a "sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)" -b

# 3. Stop pihole container
ansible pihole -m shell -a "docker stop pihole" -b

# 4. Configure user namespace remapping
ansible pihole -m copy -b -a "
  dest=/etc/docker/daemon.json
  content='{\"userns-remap\": \"default\"}'
  owner=root
  group=root
  mode='0644'
"

# 5. Restart Docker
ansible pihole -m systemd -a "name=docker state=restarted" -b

# 6. Verify Docker started
ansible pihole -m shell -a "docker info | grep -i userns" -b

# 7. Recreate pihole container (adjust based on actual deployment)
# If using docker run command, re-run it
# If using docker-compose, run: docker-compose up -d

# 8. Verify pihole is working
ansible pihole -m shell -a "docker ps" -b
ansible pihole -m shell -a "docker logs pihole --tail 50" -b

# 9. Test DNS functionality
dig @192.168.122.12 google.com

# 10. Monitor for 1 hour
watch -n 60 'ansible pihole -m shell -a "docker ps" -b'

Rollback if Issues:

# Follow docs/runbooks/docker-configuration-rollback.md
# Procedure 3: User Namespace Remapping Rollback

Mailcow-Specific Considerations

Why Mailcow is Complex

  1. Multiple interconnected containers (24 containers)
  2. Persistent data in multiple volumes (mail, databases, configs)
  3. File permissions critical for mail delivery
  4. Active production service - downtime impact high

Mailcow Testing Approach (Week 49-50)

Phase 1: Research (Week 49)

# 1. Check mailcow documentation
# Search: "user namespace" or "userns-remap"
# URL: https://docs.mailcow.email/

# 2. Check mailcow GitHub issues
# Search for: userns, user namespace, permission issues

# 3. Check mailcow community forum
# URL: https://community.mailcow.email/
# Search for similar implementations

Phase 2: Mailcow Test Environment (Week 49)

Option A: Deploy test mailcow on derp

# Requires:
# - 4GB+ RAM (derp may be too small)
# - 20GB+ disk space
# - Domain for testing

# Install mailcow on derp
git clone https://github.com/mailcow/mailcow-dockerized
cd mailcow-dockerized
./generate_config.sh
docker-compose up -d

Option B: Clone mymx mailcow config to test environment

# Create test VM clone
# Copy mailcow configuration
# Test with user namespaces

Phase 3: Mailcow Volume Analysis (Week 49)

# On mymx, identify all volumes
docker volume ls | grep mailcow

# Check critical volumes
docker volume inspect mailcowdockerized_vmail-vol-1
docker volume inspect mailcowdockerized_mysql-vol-1

# Document current permissions
for vol in $(docker volume ls -q | grep mailcow); do
  echo "=== $vol ==="
  sudo ls -la /var/lib/docker/volumes/$vol/_data/ | head -20
done > /tmp/mailcow-permissions-before.txt

Phase 4: Mailcow Implementation (Week 50 - IF testing successful)

ONLY proceed if:

  • Testing in dev environment successful
  • pihole implementation successful
  • Mailcow community confirms no known issues
  • Extended maintenance window available (2-4 hours)
  • Full backups completed
  • Rollback tested and confirmed working

Implementation Steps:

# 1. Create snapshot
ansible-playbook playbooks/backup_vm_snapshot.yml \
  -e "target_vms=['mymx']" \
  -e "snapshot_description='Pre mailcow user namespace'"

# 2. Backup ALL mailcow data
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && ./helper-scripts/backup_and_restore.sh backup all" -b

# 3. Stop mailcow
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose down" -b

# 4. Backup current state
ansible mymx -m shell -a "
  sudo tar -czf /root/mailcow-pre-userns-$(date +%s).tar.gz \
    /etc/docker \
    /opt/mailcow-dockerized \
    /var/lib/docker/volumes/mailcow*
" -b

# 5. Configure user namespace
ansible mymx -m shell -a "
  sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)
  echo '{\"userns-remap\": \"default\"}' | sudo tee /etc/docker/daemon.json
" -b

# 6. Restart Docker
ansible mymx -m systemd -a "name=docker state=restarted" -b

# 7. Verify Docker started with user namespaces
ansible mymx -m shell -a "docker info | grep -i userns" -b

# 8. Start mailcow (will recreate all containers)
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose up -d" -b

# 9. Monitor startup
watch -n 10 'ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose ps" -b'

# 10. Check logs for permission errors
ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose logs --tail 100" -b | grep -i "permission\|denied\|failed"

# 11. Test mail functionality
# - Send test email
# - Receive test email
# - Check webmail access
# - Verify SOGo groupware
# - Test IMAP/SMTP connections

# 12. Monitor for 4-8 hours before declaring success

Known Potential Issues with Mailcow:

  1. Vmail Volume Permissions

    # If mail delivery fails with permission errors
    # May need to adjust permissions (LAST RESORT)
    sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes/mailcowdockerized_vmail-vol-1/_data/
    
  2. MySQL Volume Issues

    # If database won't start
    # Check MySQL logs
    docker logs mailcowdockerized-mysql-mailcow-1
    
    # May need database permission fixes
    # This is why testing is CRITICAL
    
  3. Dovecot Permission Issues

    # Dovecot is sensitive to mail file permissions
    # May require config adjustments in mailcow.conf
    

Mailcow Rollback Decision Point

Roll back immediately if:

  • Docker daemon won't start
  • MySQL container won't start
  • Cannot send/receive mail after 15 minutes
  • Permission errors in critical containers
  • Data appears missing/inaccessible

Use VM snapshot restore if:

  • Multiple containers failing
  • Data corruption suspected
  • Cannot resolve within 30 minutes

Troubleshooting

Issue 1: Docker Daemon Won't Start

Symptoms:

systemctl status docker
# Failed to start Docker Application Container Engine

Solutions:

# Check logs
journalctl -u docker -n 100 --no-pager

# Common causes:
# 1. Invalid daemon.json syntax
cat /etc/docker/daemon.json | jq '.'

# 2. Subuid/subgid not configured
cat /etc/subuid
cat /etc/subgid
# Should have dockremap:165536:65536

# 3. Restore backup
sudo cp /etc/docker/daemon.json.backup.<timestamp> /etc/docker/daemon.json
sudo systemctl start docker

Issue 2: Container Won't Start - Permission Denied

Symptoms:

docker logs <container>
# Permission denied errors

Solutions:

# 1. Check volume location
docker volume inspect <volume_name>

# 2. Check permissions on host
sudo ls -la /var/lib/docker/165536.165536/volumes/<volume>/_data/

# 3. If permissions wrong, may need to adjust
# (Avoid this if possible - indicates larger problem)
sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes/<volume>/_data/

Issue 3: Bind Mounts Not Working

Symptoms:

docker logs <container>
# Cannot access /bind/mount/path

Solutions:

# Bind mounts need host directory permissions adjusted
sudo chown 165536:165536 /path/to/bind/mount

# Or use volumes instead of bind mounts
# Volumes are handled automatically by Docker

Issue 4: Privileged Container Needed

Note: Privileged containers (like mailcow netfilter) bypass user namespace remapping.

# Verify privileged container still works
docker inspect <container> | grep -i privileged
# Should show: "Privileged": true

# Privileged containers run as actual root (userns bypassed)
# This is expected for netfilter, acceptable risk (documented)

Success Criteria

Testing Phase Success (Before Production)

  • Simple container runs successfully
  • Web server container accessible
  • Database container stores/retrieves data
  • Volume permissions correct (165536 UID)
  • Bind mounts work (if needed)
  • No permission errors in logs
  • Can recreate containers after Docker restart
  • Rollback procedure tested and successful

Production Implementation Success

pihole

  • VM snapshot created
  • Docker daemon running with user namespaces
  • pihole container running
  • DNS queries working
  • No permission errors in logs
  • Monitoring shows normal operation for 24+ hours

mymx/mailcow

  • VM snapshot created
  • Docker daemon running with user namespaces
  • All 24 containers running
  • Can send email
  • Can receive email
  • Webmail accessible
  • SOGo groupware working
  • No permission errors in logs
  • Monitoring shows normal operation for 48+ hours
  • Full service verification completed

Decision Tree

START: Ready to enable user namespaces?
│
├─ Testing completed in dev?
│  ├─ NO → STOP: Complete testing first
│  └─ YES → Continue
│
├─ VM snapshots created?
│  ├─ NO → STOP: Create snapshots first
│  └─ YES → Continue
│
├─ Rollback procedure reviewed?
│  ├─ NO → STOP: Review rollback docs
│  └─ YES → Continue
│
├─ Which host?
│  ├─ pihole → Proceed (lower risk)
│  └─ mymx → Additional checks needed
│     │
│     ├─ Mailcow community research done?
│     │  ├─ NO → STOP: Research first
│     │  └─ YES → Continue
│     │
│     ├─ pihole implementation successful?
│     │  ├─ NO → STOP: Fix pihole first
│     │  └─ YES → Continue
│     │
│     ├─ Extended maintenance window?
│     │  ├─ NO → STOP: Schedule proper window
│     │  └─ YES → Proceed with caution
│     │
│     └─ Proceed with mymx (high risk)

References


Document Version: 1.0 Next Review: After testing completion (Week 49) Owner: Infrastructure Security Team