Deploy Open-Source AI Models with Docker

Docker provides a consistent, portable way to deploy AI models across any environment.

Prerequisites

Docker installed (20.10+)
NVIDIA Docker runtime (for GPU support)
Basic Docker knowledge
Sufficient disk space (50GB+ recommended)

Basic Deployment

Step 1: Pull Pre-built Image

# vLLM (recommended for LLMs)
docker pull vllm/vllm-openai:latest

# Text Generation Inference
docker pull ghcr.io/huggingface/text-generation-inference:latest

# Ollama
docker pull ollama/ollama:latest

Step 2: Run Container

# Run LLaMA with vLLM
docker run --gpus all -p 8000:8000   -v ~/.cache/huggingface:/root/.cache/huggingface   vllm/vllm-openai:latest   --model meta-llama/Llama-3.1-70b   --tensor-parallel-size 1

Custom Dockerfile

For LLM Deployment

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y     python3.10     python3-pip     git

# Install vLLM
RUN pip3 install vllm

# Download model (optional, can be done at runtime)
RUN python3 -c "from huggingface_hub import snapshot_download; snapshot_download('meta-llama/Llama-3.1-8B-Instruct')"

# Expose port
EXPOSE 8000

# Run vLLM server
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", "--model", "meta-llama/Llama-3.1-8B-Instruct"]

Build and Run

# Build image
docker build -t my-llama-model .

# Run container
docker run --gpus all -p 8000:8000 my-llama-model

Docker Compose

Multi-Service Setup

version: '3.8'

services:
  llama-model:
    image: vllm/vllm-openai:latest
    command: --model meta-llama/Llama-3.1-70b-Instruct
    ports:
      - "8000:8000"
    volumes:
      - model-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - HUGGING_FACE_TOKEN=your-huggingface-token

  nginx:
    image: nginx:latest
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - llama-model

volumes:
  model-cache:

Run with Compose

docker-compose up -d

Optimization

Multi-Stage Builds

# Build stage
FROM python:3.10 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt

# Runtime stage
FROM python:3.10-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
CMD ["python", "app.py"]

Layer Caching

Order Dockerfile commands from least to most frequently changing
Use .dockerignore to exclude unnecessary files
Leverage build cache with --cache-from

Resource Limits

docker run --gpus all   --memory="16g"   --cpus="4"   --shm-size="8g"   -p 8000:8000   vllm/vllm-openai:latest

Monitoring

Health Checks

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3   CMD curl -f http://localhost:8000/health || exit 1

Logging

# View logs
docker logs -f container-name

# Configure logging driver
docker run --log-driver=json-file --log-opt max-size=10m --log-opt max-file=3 ...

Resource Monitoring

# Monitor container stats
docker stats

# Inspect container
docker inspect container-name

Security

Best Practices

Use official base images
Run as non-root user

RUN useradd -m -u 1000 appuser
USER appuser

Scan for vulnerabilities

docker scan my-image:latest

Use secrets management

docker secret create hf_token token.txt
docker service create --secret hf_token my-service

Troubleshooting

GPU Not Detected

# Check NVIDIA Docker runtime
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# Reinstall NVIDIA Container Toolkit if needed
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Out of Memory

Increase --shm-size
Reduce batch size
Use model quantization
Add swap space

Slow Performance

Use volume mounts for model cache
Enable GPU support
Optimize batch size
Use faster storage (SSD)

Production Checklist

[ ] Use multi-stage builds
[ ] Implement health checks
[ ] Configure resource limits
[ ] Set up logging
[ ] Scan for vulnerabilities
[ ] Use secrets management
[ ] Implement restart policies
[ ] Configure networking
[ ] Set up monitoring
[ ] Document deployment process

Deploy AI Models with Docker

Deploy Open-Source AI Models with Docker

Prerequisites

Basic Deployment

Step 1: Pull Pre-built Image

Step 2: Run Container

Custom Dockerfile

For LLM Deployment

Build and Run

Docker Compose

Multi-Service Setup

Run with Compose

Optimization

Multi-Stage Builds

Layer Caching

Resource Limits

Monitoring

Health Checks

Logging

Resource Monitoring

Security

Best Practices

Troubleshooting

GPU Not Detected

Out of Memory

Slow Performance

Production Checklist

Related Guides

Deploy AI Models on AWS

Deploy AI Models on Google Cloud Platform

Deploy AI Models on Microsoft Azure

Deploy AI Models with Docker