S

Deploy AI Models with Docker

Complete guide to containerizing and deploying AI models with Docker

Deploy Open-Source AI Models with Docker

Docker provides a consistent, portable way to deploy AI models across any environment.

Prerequisites

  • Docker installed (20.10+)
  • NVIDIA Docker runtime (for GPU support)
  • Basic Docker knowledge
  • Sufficient disk space (50GB+ recommended)

Basic Deployment

Step 1: Pull Pre-built Image

# vLLM (recommended for LLMs)
docker pull vllm/vllm-openai:latest

# Text Generation Inference
docker pull ghcr.io/huggingface/text-generation-inference:latest

# Ollama
docker pull ollama/ollama:latest

Step 2: Run Container

# Run LLaMA with vLLM
docker run --gpus all -p 8000:8000   -v ~/.cache/huggingface:/root/.cache/huggingface   vllm/vllm-openai:latest   --model meta-llama/Llama-3.1-70b   --tensor-parallel-size 1

Custom Dockerfile

For LLM Deployment

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y     python3.10     python3-pip     git

# Install vLLM
RUN pip3 install vllm

# Download model (optional, can be done at runtime)
RUN python3 -c "from huggingface_hub import snapshot_download; snapshot_download('meta-llama/Llama-3.1-8B-Instruct')"

# Expose port
EXPOSE 8000

# Run vLLM server
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", "--model", "meta-llama/Llama-3.1-8B-Instruct"]

Build and Run

# Build image
docker build -t my-llama-model .

# Run container
docker run --gpus all -p 8000:8000 my-llama-model

Docker Compose

Multi-Service Setup

version: '3.8'

services:
  llama-model:
    image: vllm/vllm-openai:latest
    command: --model meta-llama/Llama-3.1-70b-Instruct
    ports:
      - "8000:8000"
    volumes:
      - model-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - HUGGING_FACE_TOKEN=your-huggingface-token

  nginx:
    image: nginx:latest
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - llama-model

volumes:
  model-cache:

Run with Compose

docker-compose up -d

Optimization

Multi-Stage Builds

# Build stage
FROM python:3.10 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt

# Runtime stage
FROM python:3.10-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
CMD ["python", "app.py"]

Layer Caching

  • Order Dockerfile commands from least to most frequently changing
  • Use .dockerignore to exclude unnecessary files
  • Leverage build cache with --cache-from

Resource Limits

docker run --gpus all   --memory="16g"   --cpus="4"   --shm-size="8g"   -p 8000:8000   vllm/vllm-openai:latest

Monitoring

Health Checks

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3   CMD curl -f http://localhost:8000/health || exit 1

Logging

# View logs
docker logs -f container-name

# Configure logging driver
docker run --log-driver=json-file --log-opt max-size=10m --log-opt max-file=3 ...

Resource Monitoring

# Monitor container stats
docker stats

# Inspect container
docker inspect container-name

Security

Best Practices

  1. Use official base images
  2. Run as non-root user
RUN useradd -m -u 1000 appuser
USER appuser
  1. Scan for vulnerabilities
docker scan my-image:latest
  1. Use secrets management
docker secret create hf_token token.txt
docker service create --secret hf_token my-service

Troubleshooting

GPU Not Detected

# Check NVIDIA Docker runtime
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# Reinstall NVIDIA Container Toolkit if needed
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Out of Memory

  • Increase --shm-size
  • Reduce batch size
  • Use model quantization
  • Add swap space

Slow Performance

  • Use volume mounts for model cache
  • Enable GPU support
  • Optimize batch size
  • Use faster storage (SSD)

Production Checklist

  • [ ] Use multi-stage builds
  • [ ] Implement health checks
  • [ ] Configure resource limits
  • [ ] Set up logging
  • [ ] Scan for vulnerabilities
  • [ ] Use secrets management
  • [ ] Implement restart policies
  • [ ] Configure networking
  • [ ] Set up monitoring
  • [ ] Document deployment process