name: google-cloud-configs
description: Google Cloud Platform configuration templates for BigQuery ML and Vertex AI training with authentication setup, GPU/TPU configs, and cost estimation tools. Use when setting up GCP ML training, configuring BigQuery ML models, deploying Vertex AI training jobs, estimating GCP costs, configuring cloud authentication, selecting GPUs/TPUs for training, or when user mentions BigQuery ML, Vertex AI, GCP training, cloud ML setup, TPU training, or Google Cloud costs.
allowed-tools: Bash, Read, Write, Edit
Use when:
- Setting up BigQuery ML for SQL-based machine learning
- Configuring Vertex AI custom training jobs
- Setting up GCP authentication for ML workflows
- Selecting appropriate GPU/TPU configurations
- Estimating costs for GCP ML training
- Deploying models to Vertex AI endpoints
- Configuring distributed training on GCP
- Optimizing cost vs performance for cloud ML
Platform Overview
BigQuery ML
What it is: SQL-based machine learning directly in BigQuery
Best for:
- Quick ML prototypes using existing data warehouse data
- Classification, regression, forecasting on structured data
- Users familiar with SQL but not Python/ML frameworks
- Large-scale batch predictions
Available Models:
- Linear/Logistic Regression
- XGBoost (BOOSTED_TREE)
- Deep Neural Networks (DNN)
- AutoML Tables
- TensorFlow/PyTorch imported models
Pricing:
- Based on data processed (same as BigQuery queries)
- $5 per TB processed for analysis
- AutoML: $19.32/hour for training
Vertex AI Training
What it is: Fully managed ML training platform
Best for:
- Custom PyTorch/TensorFlow training
- Large-scale distributed training
- GPU/TPU-accelerated workloads
- Production ML pipelines
Available Compute:
- CPUs: n1-standard, n1-highmem, n1-highcpu
- GPUs: NVIDIA T4, P4, V100, P100, A100, L4
- TPUs: v2, v3, v4, v5e (8 cores to 512 cores)
Pricing:
- CPU: $0.05-0.30/hour depending on machine type
- GPU T4: $0.35/hour
- GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB)
- TPU v3: $8.00/hour (8 cores)
- TPU v4: $11.00/hour (8 cores)
GPU/TPU Selection Guide
GPU Selection (Vertex AI)
T4 (16GB VRAM):
- Use case: Inference, light training, small models
- Cost: $0.35/hour
- Good for: BERT-base, small CNNs, inference serving
V100 (16GB VRAM):
- Use case: Mid-size training, mixed precision training
- Cost: $2.48/hour
- Good for: ResNet training, medium transformers
A100 (40GB/80GB VRAM):
- Use case: Large model training, distributed training
- Cost: $3.67/hour (40GB), $4.95/hour (80GB)
- Good for: GPT-style models, large vision models, multi-GPU training
L4 (24GB VRAM):
- Use case: Modern alternative to T4, better performance
- Cost: $0.66/hour
- Good for: Mid-size models, efficient inference
TPU Selection (Vertex AI)
TPU v2 (8 cores):
- Use case: TensorFlow/JAX training, matrix operations
- Cost: $4.50/hour
- Memory: 8GB per core (64GB total)
- Good for: Legacy TensorFlow models
TPU v3 (8 cores):
- Use case: Standard TPU training
- Cost: $8.00/hour
- Memory: 16GB per core (128GB total)
- Good for: BERT, T5, image classification
TPU v4 (8 cores):
- Use case: Latest generation, best performance
- Cost: $11.00/hour
- Memory: 32GB per core (256GB total)
- Good for: Large language models, cutting-edge research
TPU v5e (8 cores):
- Use case: Cost-optimized TPU
- Cost: $2.50/hour
- Good for: Development, training at scale on budget
Multi-node TPU Pods:
- v3-32: 32 cores, $32/hour
- v3-128: 128 cores, $128/hour
- v4-128: 128 cores, $176/hour
- Use for: Massive distributed training (GPT-3 scale)
Usage
Setup BigQuery ML Environment
bash scripts/setup-bigquery-ml.sh
Prompts for:
- GCP Project ID
- BigQuery dataset name
- Service account credentials
- Default model type preference
Creates:
bigquery_config.json - Project configuration
.bigqueryrc - CLI configuration
- Example training SQL in examples/
Setup Vertex AI Training Environment
bash scripts/setup-vertex-ai.sh
Prompts for:
- GCP Project ID
- Region (us-central1, europe-west4, etc.)
- Service account credentials
- Default machine type
- GPU/TPU preference
Creates:
vertex_config.yaml - Training job configuration
vertex_requirements.txt - Python dependencies
- Training script template
Configure GCP Authentication
bash scripts/configure-auth.sh
Prompts for:
- Authentication method (service account, user account, workload identity)
- Service account key path (if applicable)
- IAM roles needed
Creates:
.gcp_auth_config - Authentication configuration
- Sets GOOGLE_APPLICATION_CREDENTIALS environment variable
- Validates permissions
Required IAM Roles:
- BigQuery ML:
roles/bigquery.dataEditor, roles/bigquery.jobUser
- Vertex AI:
roles/aiplatform.user, roles/storage.objectAdmin
- Both:
roles/serviceusage.serviceUsageConsumer
Estimate GCP Training Costs
bash scripts/estimate-gcp-cost.sh
Interactive prompts:
- Platform: BigQuery ML or Vertex AI
- If BigQuery ML: Data size to process
- If Vertex AI:
- Machine type (CPU/GPU/TPU)
- Number of machines
- Training duration estimate
- Storage requirements
Output:
- Estimated compute cost
- Storage cost
- Data transfer cost (if applicable)
- Total estimated cost
- Cost comparison with other GCP options
Templates
BigQuery ML Training Template (templates/bigquery_ml_training.sql)
SQL template for creating and training models:
- Model creation syntax
- Feature engineering examples
- Training options (L1/L2 reg, learning rate, etc.)
- Evaluation queries
- Prediction queries
Supported model types:
- LINEAR_REG, LOGISTIC_REG
- BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR
- DNN_CLASSIFIER, DNN_REGRESSOR
- AUTOML_CLASSIFIER, AUTOML_REGRESSOR
Vertex AI Training Job Template (templates/vertex_training_job.py)
Python template for custom training:
- Training loop structure
- Distributed training setup (PyTorch DDP)
- Checkpointing and model saving
- Metrics logging to Vertex AI
- Hyperparameter tuning integration
Includes:
- Single GPU training
- Multi-GPU training (DataParallel, DistributedDataParallel)
- TPU training with PyTorch/XLA
- Cloud Storage integration
GPU Configuration Template (templates/vertex_gpu_config.yaml)
YAML configuration for GPU training jobs:
- Machine type selection
- GPU type and count
- Disk configuration
- Network configuration
- Environment variables
Presets included:
- Single T4 (budget)
- Single A100 (standard)
- 4x A100 (distributed)
- 8x A100 (large-scale)
TPU Configuration Template (templates/vertex_tpu_config.yaml)
YAML configuration for TPU training jobs:
- TPU type and topology
- TPU version selection
- JAX/TensorFlow runtime
- XLA compilation flags
Presets included:
- v3-8 (single TPU)
- v4-32 (TPU pod slice)
- v5e-8 (cost-optimized)
GCP Authentication Template (templates/gcp_auth.json)
Service account configuration template:
- Project ID
- Service account email
- Key file path
- Required scopes
- IAM role assignments
Security notes:
- Uses placeholders only (never real keys)
- Documents how to create service accounts
- Includes
.gitignore protection
Examples
BigQuery ML Regression Example (examples/bigquery-regression-example.sql)
Complete example:
- Dataset: NYC taxi trip data
- Task: Predict trip duration
- Model: BOOSTED_TREE_REGRESSOR
- Includes feature engineering, training, evaluation
Demonstrates:
- CREATE MODEL syntax
- TRANSFORM clause for feature engineering
- MODEL evaluation
- Batch predictions
Vertex AI PyTorch Training Example (examples/vertex-pytorch-training.py)
Complete training script:
- Dataset: IMDB sentiment analysis
- Model: DistilBERT fine-tuning
- Training: Single GPU
- Logging: Vertex AI experiments
Demonstrates:
- Loading data from GCS
- Training loop with mixed precision
- Checkpointing to GCS
- Metrics logging
- Model export to Vertex AI
Vertex AI Distributed Training Example (examples/vertex-distributed-training.py)
Multi-GPU training example:
- Dataset: ImageNet subset
- Model: ResNet-50
- Training: 4x A100 with DDP
- Scaling: Linear scaling rule
Demonstrates:
- PyTorch DistributedDataParallel
- Gradient accumulation
- Learning rate scaling
- Synchronized batch norm
- Multi-node coordination
Hugging Face Fine-tuning on Vertex AI (examples/vertex-huggingface-finetuning.py)
Production fine-tuning template:
- Dataset: Custom text classification
- Model: BERT/RoBERTa/DeBERTa
- Training: Hugging Face Trainer API
- Deployment: Vertex AI endpoint
Demonstrates:
- Hugging Face Trainer integration
- Hyperparameter tuning with Vertex AI
- Model versioning
- Endpoint deployment
- Online predictions
Cost Optimization Tips
BigQuery ML
Reduce data processed:
- Use partitioned tables
- Filter data in WHERE clause before training
- Use table sampling for experimentation
- Cache intermediate results
Use appropriate model types:
- Start with LINEAR_REG/LOGISTIC_REG (cheapest)
- Use BOOSTED_TREE for better accuracy at moderate cost
- Reserve AutoML for when simpler models fail
Optimize queries:
- Avoid SELECT * (specify columns)
- Use clustering on filter columns
- Materialize views for repeated training
Vertex AI
Machine type selection:
- Start with CPU for prototyping
- Use T4 for small models (cheapest GPU)
- Use A100 only for large models that need it
- Consider TPU v5e for TensorFlow/JAX (very cost-effective)
Training optimization:
- Use preemptible instances (60-70% cheaper, can be interrupted)
- Enable automatic checkpoint/resume for preemptible
- Use mixed precision training (FP16/BF16) for faster training
- Profile to eliminate CPU bottlenecks
Storage optimization:
- Store datasets in Cloud Storage (cheaper than persistent disk)
- Use Filestore only if needed for POSIX filesystem
- Clean up old model artifacts
- Use lifecycle policies to archive old data
Multi-GPU efficiency:
- Ensure near-linear scaling before adding more GPUs
- Profile inter-GPU communication
- Use gradient accumulation instead of larger batch sizes
- Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability)
Integration with ML Training Plugin
This skill integrates with other ml-training components:
- training-patterns: Provides GCP configs for generated training scripts
- cost-calculator: Uses GCP pricing data for budget planning
- monitoring-dashboard: Integrates with Vertex AI TensorBoard
- validation-scripts: Validates GCP credentials and permissions
- integration-helpers: Deploys trained models to Vertex AI endpoints
Common Workflows
Workflow 1: Quick BigQuery ML Prototype
- Run
bash scripts/setup-bigquery-ml.sh
- Copy
templates/bigquery_ml_training.sql to your project
- Modify SQL for your dataset and features
- Run training query in BigQuery console
- Evaluate with built-in ML.EVALUATE()
- Export predictions with ML.PREDICT()
Time: 30 minutes setup + training time
Cost: $5 per TB of data processed
Workflow 2: Custom PyTorch Training on Vertex AI
- Run
bash scripts/configure-auth.sh
- Run
bash scripts/setup-vertex-ai.sh
- Copy
templates/vertex_training_job.py
- Customize training loop for your model
- Copy
templates/vertex_gpu_config.yaml
- Submit job:
gcloud ai custom-jobs create ...
- Monitor in Vertex AI console
Time: 1 hour setup + training time
Cost: Depends on GPU/TPU selection
Workflow 3: Large-Scale Distributed Training
- Setup Vertex AI (workflow 2)
- Copy
examples/vertex-distributed-training.py
- Adapt for your model architecture
- Test locally with 1 GPU
- Test with 2 GPUs to verify scaling
- Scale to 4-8 GPUs for full training
- Use preemptible instances with checkpointing
Time: 2-4 hours setup + training time
Cost: $15-60/hour depending on GPU count
Troubleshooting
BigQuery ML Issues
"Insufficient permissions":
- Verify
roles/bigquery.dataEditor and roles/bigquery.jobUser
- Check dataset-level permissions
- Ensure billing is enabled
"Model training failed":
- Check for NULL values in features
- Verify data types match model expectations
- Review feature engineering TRANSFORM clause
- Check for sufficient training data
Vertex AI Issues
"Service account lacks permissions":
- Verify
roles/aiplatform.user
- Add
roles/storage.objectAdmin for GCS access
- Check project-level IAM policies
"GPU/TPU quota exceeded":
- Request quota increase in GCP console
- Use different region with availability
- Start with smaller GPU/TPU configuration
- Use preemptible instances (separate quota)
"Training job crashes":
- Check for CUDA OOM (reduce batch size)
- Verify dependencies in requirements.txt
- Review logs in Cloud Logging
- Test locally before submitting to Vertex
Security Best Practices
Credentials Management
DO:
- ✅ Use service accounts with minimal permissions
- ✅ Store credentials in Secret Manager
- ✅ Use Workload Identity for GKE deployments
- ✅ Rotate service account keys regularly
- ✅ Add
.gitignore for *.json key files
DON'T:
- ❌ Hardcode credentials in code
- ❌ Commit service account keys to git
- ❌ Use overly permissive roles (e.g., Owner)
- ❌ Share service account keys across projects
- ❌ Use personal credentials for production
IAM Best Practices
- Use separate service accounts for training vs serving
- Grant roles at resource level, not project level when possible
- Use Workload Identity Federation instead of keys when possible
- Enable Cloud Audit Logs for ML API usage
- Review IAM permissions quarterly
Performance Benchmarks
BigQuery ML vs Vertex AI
BigQuery ML:
- Best for: Structured data, SQL users, quick prototypes
- Training time: Minutes to hours (depends on data size)
- Scalability: Automatic (serverless)
- Cost: $5/TB processed
Vertex AI Custom Training:
- Best for: Deep learning, custom architectures, GPU/TPU workloads
- Training time: Hours to days (configurable hardware)
- Scalability: Manual (choose machine type)
- Cost: $0.35-20/hour depending on hardware
Rule of thumb:
- Use BigQuery ML for tabular data with < 100M rows
- Use Vertex AI for images, text, audio, or custom models
- Use Vertex AI for models requiring GPU/TPU acceleration
Additional Resources