Deploy AI Models on Google Cloud Platform
Complete guide to deploying open-source AI models on GCP
Deploy Open-Source AI Models on Google Cloud Platform
Google Cloud Platform offers powerful infrastructure and AI-specific services for deploying machine learning models at scale.
Prerequisites
- GCP Account with billing enabled
- gcloud CLI installed
- Project created in GCP Console
- Basic understanding of Kubernetes (for GKE)
Deployment Options
1. Compute Engine Deployment
Best for: Full control over infrastructure
Step 1: Create GPU Instance
gcloud compute instances create ai-model-instance --zone=us-central1-a --machine-type=n1-standard-8 --accelerator=type=nvidia-tesla-t4,count=1 --image-family=ubuntu-2004-lts --image-project=ubuntu-os-cloud --boot-disk-size=100GB --maintenance-policy=TERMINATE
Step 2: Install NVIDIA Drivers
# SSH into instance
gcloud compute ssh ai-model-instance --zone=us-central1-a
# Install drivers
curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py
sudo python3 install_gpu_driver.py
2. Google Kubernetes Engine (GKE)
Best for: Scalable, production workloads
Step 1: Create GKE Cluster
gcloud container clusters create ai-cluster --zone=us-central1-a --machine-type=n1-standard-4 --num-nodes=3 --enable-autoscaling --min-nodes=1 --max-nodes=10
Step 2: Add GPU Node Pool
gcloud container node-pools create gpu-pool --cluster=ai-cluster --zone=us-central1-a --machine-type=n1-standard-4 --accelerator=type=nvidia-tesla-t4,count=1 --num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=5
Step 3: Deploy Model
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-deployment
spec:
replicas: 2
selector:
matchLabels:
app: llama
template:
metadata:
labels:
app: llama
spec:
containers:
- name: llama
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
3. Vertex AI Deployment
Best for: Managed ML platform
from google.cloud import aiplatform
aiplatform.init(project='your-project', location='us-central1')
model = aiplatform.Model.upload(
display_name='llama-model',
artifact_uri='gs://your-bucket/model',
serving_container_image_uri='vllm/vllm-openai:latest'
)
endpoint = model.deploy(
machine_type='n1-standard-4',
accelerator_type='NVIDIA_TESLA_T4',
accelerator_count=1
)
Cost Optimization
Preemptible VMs
Save up to 80% on compute:
gcloud compute instances create preemptible-instance --preemptible --machine-type=n1-standard-8 --accelerator=type=nvidia-tesla-t4,count=1
Committed Use Discounts
- 1-year: 25% discount
- 3-year: 52% discount
Auto-scaling
gcloud compute instance-groups managed set-autoscaling ai-group --max-num-replicas=10 --min-num-replicas=1 --target-cpu-utilization=0.6
Monitoring
Cloud Monitoring
from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{project_id}"
# Create custom metric
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/inference_latency"
Security
- Use VPC Service Controls
- Enable Binary Authorization
- Implement Workload Identity
- Use Secret Manager for credentials
Troubleshooting
GPU Quota Issues
Request quota increase in GCP Console under IAM & Admin > Quotas
Pod Scheduling Failures
Check node pool has GPU resources available and correct taints/tolerations
Production Checklist
- [ ] Set up Cloud Monitoring
- [ ] Configure Cloud Logging
- [ ] Enable auto-scaling
- [ ] Implement load balancing
- [ ] Set up backup strategy
- [ ] Configure VPC and firewall rules
- [ ] Enable encryption
- [ ] Set up CI/CD with Cloud Build
Related Guides
Deploy AI Models on AWS
Complete guide to deploying open-source AI models on Amazon Web Services
Deploy AI Models on Google Cloud Platform
Complete guide to deploying open-source AI models on GCP
Deploy AI Models on Microsoft Azure
Complete guide to deploying open-source AI models on Azure
Deploy AI Models with Docker
Complete guide to containerizing and deploying AI models with Docker