jeremy-vertex-terraform
by jeremylongshorev1.0.0
Terraform configurations for Vertex AI platform and Agent Engine
Documentation
# Jeremy Vertex Terraform
**🎯 VERTEX AI MODEL GARDEN & AI INFRASTRUCTURE**
Terraform infrastructure specialist for **broader Vertex AI services** including Model Garden, Gemini endpoints, vector search, ML pipelines, and enterprise AI infrastructure (NOT Agent Engine - use jeremy-adk-terraform for that).
## ⚠️ Important: What This Plugin Is For
**✅ THIS PLUGIN IS FOR:**
- **Vertex AI Model Garden** deployments (foundation models)
- **Gemini API endpoints** (gemini-pro, gemini-2.0-flash)
- **Vector Search** infrastructure (ScaNN-based similarity search)
- **Vertex AI Pipelines** (Kubeflow Pipelines for ML workflows)
- **Endpoint deployment** (model serving infrastructure)
- **Batch prediction** jobs
- **ML model training** infrastructure
- **Feature Store** for ML feature management
**❌ THIS PLUGIN IS NOT FOR:**
- Agent Engine infrastructure (use `jeremy-adk-terraform` for ADK agents)
- Cloud Run deployments (use `jeremy-genkit-terraform`)
- Self-managed ML infrastructure
## Overview
This plugin provides Terraform modules for deploying Vertex AI services including Model Garden foundation models, Gemini API endpoints, vector search for RAG applications, ML pipelines, and production model serving infrastructure.
**Key Infrastructure Components:**
- `google_vertex_ai_endpoint` for model serving
- `google_vertex_ai_deployed_model` for model versions
- `google_vertex_ai_index` for vector search
- `google_vertex_ai_index_endpoint` for similarity search
- `google_vertex_ai_feature_store` for feature management
- Cloud Storage for model artifacts
- BigQuery for ML model training
## Installation
```bash
/plugin install jeremy-vertex-terraform@claude-code-plugins-plus
```
## Prerequisites & Dependencies
### Required Tools
**1. Terraform:**
```bash
# Install Terraform 1.5+
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
unzip terraform_1.6.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/
# Verify
terraform version # Should show 1.5.0+
```
**2. gcloud CLI:**
```bash
# Install gcloud
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
# Update to latest
gcloud components update
# Authenticate
gcloud auth application-default login
```
**3. Terraform Google Provider:**
```hcl
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
google-beta = {
source = "hashicorp/google-beta"
version = "~> 5.0"
}
}
}
```
### Required Google Cloud APIs
```bash
# Enable all required APIs
gcloud services enable \
aiplatform.googleapis.com \
compute.googleapis.com \
storage.googleapis.com \
bigquery.googleapis.com \
logging.googleapis.com \
monitoring.googleapis.com \
cloudtrace.googleapis.com \
--project=YOUR_PROJECT_ID
```
### Required IAM Permissions
```yaml
# Service account for Terraform needs:
- roles/aiplatform.admin # Deploy Vertex AI resources
- roles/storage.admin # Manage model artifacts
- roles/bigquery.admin # ML training datasets
- roles/compute.networkAdmin # VPC for private endpoints
- roles/monitoring.admin # Observability
- roles/iam.serviceAccountAdmin # Service account management
```
## Features
✅ **Model Garden Deployment**: Foundation models (Gemini, PaLM, Claude, Llama)
✅ **Gemini API Endpoints**: Dedicated endpoints with rate limiting
✅ **Vector Search**: ScaNN-based similarity search for RAG
✅ **ML Pipelines**: Kubeflow Pipelines for training workflows
✅ **Model Serving**: Production endpoints with auto-scaling
✅ **Batch Predictions**: Large-scale inference jobs
✅ **Feature Store**: Centralized feature management
✅ **Monitoring**: Model performance tracking and drift detection
## Quick Start
### Natural Language Activation
```
"Create Terraform for Gemini endpoint deployment"
"Deploy vector search for RAG application"
"Set up Vertex AI Pipeline for model training"
"Create Feature Store for ML features"
"Deploy custom model to Vertex AI endpoint"
```
## Terraform Module Structure
### 1. Gemini API Endpoint
```hcl
# gemini_endpoint.tf
# Gemini 2.0 Flash endpoint
resource "google_vertex_ai_endpoint" "gemini_endpoint" {
display_name = "gemini-2-0-flash-endpoint"
location = var.region
project = var.project_id
description = "Production Gemini 2.0 Flash endpoint"
# Network configuration
network = google_compute_network.vertex_vpc.id
# Encryption
encryption_spec {
kms_key_name = google_kms_crypto_key.model_key.id
}
}
# Deploy Gemini model
resource "google_vertex_ai_deployed_model" "gemini_flash" {
endpoint = google_vertex_ai_endpoint.gemini_endpoint.id
model = "publishers/google/models/gemini-2.0-flash-001"
display_name = "gemini-2-0-flash-001"
dedicated_resources {
machine_spec {
machine_type = "n1-standard-4"
}
min_replica_count = var.min_replicas
max_replica_count = var.max_replicas
autoscaling_metric_specs {
metric_name = "aiplatform.googleapis.com/prediction/online/accelerator/duty_cycle"
target = 70
}
}
# Traffic split
traffic_split = {
"0" = 100
}
}
# Service account for endpoint
resource "google_service_account" "vertex_sa" {
account_id = "vertex-ai-endpoint-sa"
display_name = "Vertex AI Endpoint Service Account"
}
resource "google_project_iam_member" "vertex_permissions" {
for_each = toset([
"roles/aiplatform.user",
"roles/storage.objectViewer",
"roles/logging.logWriter"
])
project = var.project_id
role = each.value
member = "serviceAccount:${google_service_account.vertex_sa.email}"
}
```
### 2. Vector Search Infrastructure
```hcl
# vector_search.tf
# Vector index for embeddings
resource "google_vertex_ai_index" "embeddings_index" {
display_name = "${var.app_name}-embeddings-index"
region = var.region
project = var.project_id
description = "Vector search index for RAG application"
metadata {
contents_delta_uri = google_storage_bucket.embeddings.url
config {
dimensions = 768 # text-embedding-gecko dimensions
approximate_neighbors_count = 150
distance_measure_type = "DOT_PRODUCT_DISTANCE"
algorithm_config {
tree_ah_config {
leaf_node_embedding_count = 1000
leaf_nodes_to_search_percent = 7
}
}
shard_size = "SHARD_SIZE_MEDIUM"
}
}
index_update_method = "STREAM_UPDATE"
}
# Index endpoint for queries
resource "google_vertex_ai_index_endpoint" "embeddings_endpoint" {
display_name = "${var.app_name}-embeddings-endpoint"
region = var.region
project = var.project_id
description = "Vector search endpoint"
# Private VPC
network = "projects/${data.google_project.project.number}/global/networks/${google_compute_network.vertex_vpc.name}"
public_endpoint_enabled = false
}
# Deploy index to endpoint
resource "google_vertex_ai_index_endpoint_deployed_index" "deployed" {
index_endpoint = google_vertex_ai_index_endpoint.embeddings_endpoint.id
index = google_vertex_ai_index.embeddings_index.id
deployed_index_id = "deployed_embeddings_index"
display_name = "Deployed Embeddings Index"
dedicated_resources {
machine_spec {
machine_type = "n1-standard-16"
}
min_replica_count = 2
max_replica_count = 10
autoscaling_metric_specs {
metric_name = "aiplatform.googleapis.com/prediction/online/cpu/utilization"
target = 70
}
}
enable_access_logging = true
}
# Storage bucket for embeddings
resource "google_storage_bucket" "embeddings" {
name = "${var.project_id}-${var.app_name}-embeddings"
location = var.region
uniform_bucket_level_access = true
versioning {
enabled = true
}
lifecycle_rule {
condition {
num_newer_versions = 3
}
action {
type = "Delete"
}
}
}
```
### 3. Custom Model Deployment
```hcl
# custom_model.tf
# Upload model to Cloud Storage
resource "google_storage_bucket" "model_artifacts" {
name = "${var.project_id}-ml-models"
location = var.region
uniform_bucket_level_access = true
versioning {
enabled = true
}
}
resource "google_storage_bucket_object" "model_artifact" {
name = "models/${var.model_name}/model.pkl"
bucket = google_storage_bucket.model_artifacts.name
source = var.model_path
}
# Register model
resource "google_vertex_ai_model" "custom_model" {
display_name = var.model_name
region = var.region
project = var.project_id
description = "Custom ML model"
version_aliases = ["production"]
# Model artifact location
artifact_uri = "gs://${google_storage_bucket.model_artifacts.name}/models/${var.model_name}/"
# Container spec for serving
container_spec {
image_uri = "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest"
env {
name = "MODEL_NAME"
value = var.model_name
}
ports {
container_port = 8080
}
predict_route = "/predict"
health_route = "/health"
}
# Encryption
encryption_spec {
kms_key_name = google_kms_crypto_key.model_key.id
}
}
# Create serving endpoint
resource "google_vertex_ai_endpoint" "model_endpoint" {
display_name = "${var.model_name}-endpoint"
location = var.region
project = var.project_id
network = google_compute_network.vertex_vpc.id
}
# Deploy model to endpoint
resource "google_vertex_ai_deployed_model" "deployed" {
endpoint = google_vertex_ai_endpoint.model_endpoint.id
model = google_vertex_ai_model.custom_model.id
display_name = "${var.model_name}-v1"
dedicated_resources {
machine_spec {
machine_type = "n1-standard-4"
accelerator_type = "NVIDIA_TESLA_T4"
accelerator_count = 1
}
min_replica_count = 1
max_replica_count = 5
autoscaling_metric_specs {
metric_name = "aiplatform.googleapis.com/prediction/online/cpu/utilization"
target = 60
}
}
traffic_split = {
"0" = 100
}
}
```
### 4. Vertex AI Pipelines
```hcl
# pipelines.tf
# Pipeline storage bucket
resource "google_storage_bucket" "pipeline_root" {
name = "${var.project_id}-pipeline-root"
location = var.region
uniform_bucket_level_access = true
lifecycle_rule {
condition {
age = 30
}
action {
type = "Delete"
}
}
}
# Artifact Registry for pipeline containers
resource "google_artifact_registry_repository" "pipeline_containers" {
repository_id = "vertex-pipelines"
location = var.region
format = "DOCKER"
description = "Container images for Vertex AI Pipelines"
}
# Service account for pipelines
resource "google_service_account" "pipeline_sa" {
account_id = "vertex-pipeline-runner"
display_name = "Vertex AI Pipeline Runner"
}
resource "google_project_iam_member" "pipeline_permissions" {
for_each = toset([
"roles/aiplatform.user",
"roles/storage.admin",
"roles/bigquery.dataEditor",
"roles/logging.logWriter"
])
project = var.project_id
role = each.value
member = "serviceAccount:${google_service_account.pipeline_sa.email}"
}
# Example: Training pipeline trigger
resource "google_cloudfunctions2_function" "pipeline_trigger" {
name = "trigger-training-pipeline"
location = var.region
build_config {
runtime = "python312"
entry_point = "trigger_pipeline"
source {
storage_source {
bucket = google_storage_bucket.pipeline_root.name
object = "functions/trigger.zip"
}
}
}
service_config {
available_memory = "256M"
timeout_seconds = 60
service_account_email = google_service_account.pipeline_sa.email
environment_variables = {
PROJECT_ID = var.project_id
PIPELINE_ROOT = "gs://${google_storage_bucket.pipeline_root.name}"
REGION = var.region
}
}
}
```
### 5. Feature Store
```hcl
# feature_store.tf
resource "google_vertex_ai_featurestore" "main" {
name = "${var.app_name}-featurestore"
region = var.region
online_serving_config {
fixed_node_count = 1
}
encryption_spec {
kms_key_name = google_kms_crypto_key.feature_key.id
}
force_destroy = false
}
# Entity type (e.g., users, items)
resource "google_vertex_ai_featurestore_entitytype" "users" {
name = "users"
featurestore = google_vertex_ai_featurestore.main.id
monitoring_config {
snapshot_analysis {
disabled = false
}
}
}
# Features
resource "google_vertex_ai_featurestore_entitytype_feature" "user_age" {
name = "age"
entitytype = google_vertex_ai_featurestore_entitytype.users.id
value_type = "INT64"
description = "User age in years"
}
resource "google_vertex_ai_featurestore_entitytype_feature" "user_ltv" {
name = "lifetime_value"
entitytype = google_vertex_ai_featurestore_entitytype.users.id
value_type = "DOUBLE"
description = "User lifetime value"
}
```
### 6. Batch Prediction
```hcl
# batch_prediction.tf
# Batch prediction job
resource "google_vertex_ai_batch_prediction_job" "batch_inference" {
display_name = "${var.model_name}-batch-prediction"
location = var.region
model = google_vertex_ai_model.custom_model.id
input_config {
instances_format = "jsonl"
gcs_source {
uris = ["gs://${google_storage_bucket.model_artifacts.name}/batch-input/*.jsonl"]
}
}
output_config {
predictions_format = "jsonl"
gcs_destination {
output_uri_prefix = "gs://${google_storage_bucket.model_artifacts.name}/batch-output/"
}
}
dedicated_resources {
machine_spec {
machine_type = "n1-standard-4"
accelerator_type = "NVIDIA_TESLA_T4"
accelerator_count = 1
}
starting_replica_count = 1
max_replica_count = 10
}
service_account = google_service_account.vertex_sa.email
}
```
### 7. Monitoring & Observability
```hcl
# monitoring.tf
# Dashboard for model endpoints
resource "google_monitoring_dashboard" "vertex_dashboard" {
dashboard_json = jsonencode({
displayName = "${var.app_name} Vertex AI Dashboard"
mosaicLayout = {
columns = 12
tiles = [
# Prediction requests
{
width = 6
height = 4
widget = {
title = "Prediction Requests"
xyChart = {
dataSets = [{
timeSeriesQuery = {
timeSeriesFilter = {
filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/prediction_count\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
}
}
}]
}
}
},
# Prediction latency
{
xPos = 6
width = 6
height = 4
widget = {
title = "Prediction Latency (p95)"
xyChart = {
dataSets = [{
timeSeriesQuery = {
timeSeriesFilter = {
filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/response_latencies\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
aggregation = {
alignmentPeriod = "60s"
perSeriesAligner = "ALIGN_DELTA"
crossSeriesReducer = "REDUCE_PERCENTILE_95"
}
}
}
}]
}
}
},
# Error rate
{
yPos = 4
width = 6
height = 4
widget = {
title = "Error Rate"
xyChart = {
dataSets = [{
timeSeriesQuery = {
timeSeriesFilter = {
filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/error_count\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
}
}
}]
}
}
},
# Replica utilization
{
xPos = 6
yPos = 4
width = 6
height = 4
widget = {
title = "Replica Utilization"
xyChart = {
dataSets = [{
timeSeriesQuery = {
timeSeriesFilter = {
filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/replicas\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
}
}
}]
}
}
}
]
}
})
}
# Alert: High latency
resource "google_monitoring_alert_policy" "high_latency" {
display_name = "${var.app_name} - High Prediction Latency"
combiner = "OR"
conditions {
display_name = "P95 latency > 5s"
condition_threshold {
filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/response_latencies\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
duration = "300s"
comparison = "COMPARISON_GT"
threshold_value = 5000
aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_DELTA"
cross_series_reducer = "REDUCE_PERCENTILE_95"
}
}
}
notification_channels = [google_monitoring_notification_channel.email.id]
}
# Alert: High error rate
resource "google_monitoring_alert_policy" "high_errors" {
display_name = "${var.app_name} - High Error Rate"
combiner = "OR"
conditions {
display_name = "Error rate > 5%"
condition_threshold {
filter = "metric.type=\"aiplatform.googleapis.com/prediction/online/error_count\" resource.type=\"aiplatform.googleapis.com/Endpoint\""
duration = "300s"
comparison = "COMPARISON_GT"
threshold_value = 0.05
aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_RATE"
}
}
}
notification_channels = [google_monitoring_notification_channel.email.id]
}
```
## Variables
```hcl
# variables.tf
variable "project_id" {
description = "Google Cloud project ID"
type = string
}
variable "region" {
description = "Region for Vertex AI resources"
type = string
default = "us-central1"
}
variable "app_name" {
description = "Application name prefix"
type = string
}
# Endpoint configuration
variable "min_replicas" {
description = "Minimum number of replicas"
type = number
default = 1
}
variable "max_replicas" {
description = "Maximum number of replicas"
type = number
default = 10
}
# Model configuration
variable "model_name" {
description = "Custom model name"
type = string
}
variable "model_path" {
description = "Local path to model artifact"
type = string
}
# Vector search
variable "embedding_dimensions" {
description = "Dimensions for embeddings (768 for gecko, 1536 for OpenAI)"
type = number
default = 768
}
# Alerting
variable "alert_email" {
description = "Email for monitoring alerts"
type = string
}
```
## Deployment Workflow
### 1. Initialize Terraform
```bash
terraform init
```
### 2. Plan Infrastructure
```bash
terraform plan \
-var="project_id=my-project" \
-var="app_name=my-app" \
-var="model_name=custom-model" \
-var="model_path=./model.pkl" \
-var="[email protected]"
```
### 3. Apply Configuration
```bash
terraform apply \
-var="project_id=my-project" \
-var="app_name=my-app" \
-var="model_name=custom-model" \
-var="model_path=./model.pkl" \
-var="[email protected]"
```
### 4. Verify Deployment
```bash
# Check endpoints
gcloud ai endpoints list --region=us-central1
# Check deployed models
gcloud ai models list --region=us-central1
# Check vector search indexes
gcloud ai indexes list --region=us-central1
# Test prediction
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT/locations/us-central1/endpoints/ENDPOINT_ID:predict \
-d '{"instances": [{"feature1": 1.0, "feature2": "value"}]}'
```
## Integration with Other Plugins
### jeremy-adk-terraform
- jeremy-adk-terraform: Agent Engine (ADK agents)
- jeremy-vertex-terraform: Model serving & ML infrastructure (this plugin)
### jeremy-vertex-engine
- Terraform provisions endpoints → Engine inspector validates (for Agent Engine only)
### jeremy-vertex-validator
- Terraform provisions infrastructure → Validator checks production readiness
## Use Cases
### Gemini API Deployment
```
"Create Terraform for Gemini 2.0 Flash endpoint"
"Deploy Gemini Pro with auto-scaling"
```
### Vector Search for RAG
```
"Set up vector search infrastructure for RAG application"
"Deploy embeddings index with 768 dimensions"
```
### Custom Model Serving
```
"Deploy custom scikit-learn model to Vertex AI"
"Create endpoint for TensorFlow model with GPU"
```
### Batch Predictions
```
"Set up batch prediction job for large dataset"
"Deploy batch inference with T4 GPUs"
```
### Feature Store
```
"Create Feature Store for user features"
"Deploy feature serving for real-time predictions"
```
## Best Practices
✅ **Private Endpoints**: Use VPC for production endpoints
✅ **Auto-scaling**: Configure based on traffic patterns
✅ **Monitoring**: Deploy dashboards and alerts
✅ **Encryption**: Use CMEK for sensitive models
✅ **Version Control**: Tag model versions
✅ **Cost Optimization**: Use preemptible VMs for batch jobs
✅ **Traffic Splitting**: Blue/green deployments
✅ **Model Registry**: Organize models in Vertex AI Model Registry
## Requirements
- Terraform >= 1.5.0
- Google Cloud Provider >= 5.0
- Google Cloud Project with billing enabled
- Appropriate IAM permissions
- Model artifacts prepared
- gcloud CLI
## License
MIT
## Support
- Issues: https://github.com/jeremylongshore/claude-code-plugins/issues
- Discussions: https://github.com/jeremylongshore/claude-code-plugins/discussions
## Version
1.0.1 (2025) - Comprehensive Vertex AI infrastructure (Model Garden, Vector Search, Pipelines)