Files
jmp-stack/.agents/skills/container-infrastructure-ops/SKILL.md
2026-01-10 23:34:39 +01:00

8.7 KiB

name, description
name description
container-infrastructure-ops Maintains, troubleshoots, and optimizes containerized infrastructure using Docker, Docker Compose, Kubernetes, Helm, and CI/CD pipelines. Enables system stability, security, reproducibility, and clear technical execution. Use for deployment operations, container management, networking, storage, secrets management, monitoring, and infrastructure troubleshooting.

Container Infrastructure Operations

Comprehensive skill for managing and maintaining software stacks hosted on containerized infrastructure.

Core Capabilities

  • Docker & Docker Compose: Service orchestration, container lifecycle, volume management, networking
  • Kubernetes & Helm: Cluster operations, deployment manifests, package management, upgrades
  • CI/CD Pipelines: GitLab CI, GitHub Actions, and runner configuration
  • Networking & Routing: Reverse proxies (Traefik, nginx), TLS/HTTPS, service discovery
  • Storage & Data: Volume mounting, backup/restore, database operations, data persistence
  • Security: Secrets management, access control, network policies, RBAC
  • Monitoring & Logging: Health checks, log aggregation, observability
  • Troubleshooting: Container debugging, resource issues, log analysis, dependency resolution

Operational Workflows

1. Service Startup & Deployment

Docker Compose:

# Start all services
docker compose up -d

# Start specific service
docker compose up -d [service_name]

# Build and start
docker compose up --build -d

# With environment file
docker compose --env-file .env up -d

Kubernetes:

# Apply manifest
kubectl apply -f deployment.yaml

# Rolling update
kubectl set image deployment/[name] [container]=[image]:[tag]

# Check rollout status
kubectl rollout status deployment/[name]

Helm:

# Install release
helm install [release-name] [chart] -f values.yaml

# Upgrade existing release
helm upgrade [release-name] [chart] -f values.yaml

# Rollback to previous version
helm rollback [release-name] [revision]

2. Service Inspection & Monitoring

Docker Compose:

# View running services
docker compose ps

# View logs (follow)
docker compose logs -f [service_name]

# View logs with time range
docker compose logs --since 10m [service_name]

# Inspect container stats
docker stats [container_id]

Kubernetes:

# List resources
kubectl get pods -n [namespace]
kubectl get svc -n [namespace]

# Describe resource (detailed info)
kubectl describe pod [pod_name] -n [namespace]

# View logs
kubectl logs [pod_name] -n [namespace]
kubectl logs -f [pod_name] -n [namespace]  # Follow

# Watch resources in real-time
kubectl get pods -w -n [namespace]

3. Environment & Configuration Management

Load environment variables:

# From .env file
set -a
source .env
set +a

# Apply to specific command
env $(cat .env | xargs) docker compose up -d

Manage secrets:

# Docker Compose (from file)
docker secrets create [name] /path/to/secret

# Kubernetes
kubectl create secret generic [name] --from-file=key=/path/to/secret
kubectl create secret docker-registry [name] --docker-server=[url]

4. Troubleshooting Workflow

Container health check:

  1. Verify container is running: docker compose ps or kubectl get pods
  2. Check logs: docker compose logs [service] or kubectl logs [pod]
  3. Inspect configuration: Check environment variables, mounted volumes, network connectivity
  4. Test connectivity: docker exec [container] curl [service] or kubectl exec [pod] -- curl [service]
  5. Resource analysis: docker stats or kubectl top pods

Network troubleshooting:

# Docker Compose
docker network ls
docker network inspect [network_name]

# Kubernetes
kubectl get networkpolicies -n [namespace]
kubectl describe networkpolicy [name] -n [namespace]

Volume & storage issues:

# Docker Compose
docker volume ls
docker volume inspect [volume_name]

# Kubernetes
kubectl get pv
kubectl get pvc -n [namespace]
kubectl describe pvc [name] -n [namespace]

5. Backup & Restore Operations

Docker Compose volumes:

# Backup volume
docker run --rm -v [volume]:/data -v $(pwd):/backup busybox tar czf /backup/backup.tar.gz -C /data .

# Restore volume
docker run --rm -v [volume]:/data -v $(pwd):/backup busybox tar xzf /backup/backup.tar.gz -C /data

Database backup within containers:

# PostgreSQL
docker compose exec [postgres_service] pg_dump -U [user] [db] > backup.sql

# MySQL/MariaDB
docker compose exec [mysql_service] mysqldump -u [user] -p [db] > backup.sql

6. Security & Access Control

Docker security best practices:

  • Use read-only root filesystem: read_only: true
  • Drop unnecessary capabilities: cap_drop: [ALL]
  • Run as non-root user: user: "1000:1000"
  • Use secrets for sensitive data (not environment variables)

Kubernetes RBAC:

# Create service account
kubectl create serviceaccount [name] -n [namespace]

# Bind role to account
kubectl create rolebinding [binding-name] --clusterrole=[role] --serviceaccount=[namespace]:[account]

Debugging Strategies

Container execution:

# Docker Compose
docker compose exec [service] /bin/bash      # Interactive shell
docker compose exec [service] ps aux         # List processes
docker compose exec [service] env            # View environment

# Kubernetes
kubectl exec -it [pod] -- /bin/bash
kubectl exec [pod] -- ps aux

Log analysis:

  • Check application logs: docker logs or kubectl logs
  • Check container startup logs: Look for early exit, missing dependencies, config errors
  • Cross-reference with timestamps to correlate events across services

Resource constraints:

# Docker
docker inspect [container] | grep -A 10 Memory

# Kubernetes
kubectl top nodes
kubectl top pods -n [namespace]

Configuration Best Practices

  • Immutable infrastructure: Rebuild containers rather than modifying running instances
  • Health checks: Define liveness and readiness probes
  • Resource limits: Set CPU/memory requests and limits to prevent resource contention
  • Rolling updates: Use rolling deployment strategies to maintain availability
  • Secrets separation: Store secrets outside version control (use .env, K8s secrets, or secret managers)
  • Logging: Aggregate logs centrally; avoid storing logs in containers

Common Error Patterns

Issue Symptom Troubleshooting
Port conflict bind: address already in use Check existing process: lsof -i :[port]; Kill if needed
Missing dependency Service fails to start Check logs for missing service/network; Verify service startup order
Resource exhaustion Slow/hanging containers Check CPU/memory usage; Increase limits; Reduce replica count
Networking Services can't communicate Verify network name; Check firewall rules; Test DNS resolution
Volume mount Permission denied in container Verify mount path exists; Check file permissions; Confirm user ID
Config error Parse/validation error at startup Validate YAML syntax; Check environment variable substitution

File Structure Reference

Docker Compose project:

project/
├── docker-compose.yaml      # Main orchestration
├── .env                      # Environment variables (secrets)
├── .env.example             # Template (tracked in git)
├── config/                  # Configuration files
│   ├── traefik.yml
│   └── app.config
└── data/                    # Persistent volumes
    ├── db/
    └── uploads/

Kubernetes project:

k8s/
├── manifests/               # YAML definitions
│   ├── deployment.yaml
│   ├── service.yaml
│   └── configmap.yaml
├── helm/                    # Helm charts
│   └── [chart-name]/
├── kustomization.yaml       # Kustomize overlays
└── secrets/                 # Sealed/encrypted secrets

Context-Specific Workflows

Working with JMP Server

For jmp-server Docker Compose stack:

# View all services
docker compose ps

# Start specific service
docker compose up -d gitea      # Or: bookstack, traefik, etc.

# View logs for troubleshooting
docker compose logs -f traefik
docker compose logs -f gitea

# Backup database
docker compose exec gitea-db pg_dump -U gitea gitea > gitea-backup.sql

# Restart service cleanly
docker compose restart gitea

Actionable Execution

When troubleshooting or deploying:

  1. State the objective clearly
  2. Run targeted diagnostic commands
  3. Report findings with specific evidence (logs, output, metrics)
  4. Execute corrective actions with clear before/after confirmation
  5. Document any configuration changes for reproducibility