name: cloudflare-workers-ai description: | Run LLMs and AI models on Cloudflare's global GPU network with Workers AI. Includes Llama 4, Gemma 3, Mistral 3.1, Flux image generation, BGE embeddings (2x faster, 2025), streaming support, and AI Gateway for cost tracking.

Use when: implementing LLM inference, generating images, building RAG with embeddings, streaming AI responses, using AI Gateway, troubleshooting max_tokens defaults (breaking change 2025), BGE pooling parameter (not backwards compatible), or handling AI_ERROR, rate limits, model deprecations, token limits.

Keywords: workers ai, cloudflare ai, ai bindings, llm workers, @cf/meta/llama-4-scout, @cf/google/gemma-3-12b-it, @cf/mistralai/mistral-small-3.1-24b-instruct, @cf/openai/gpt-oss-120b, workers ai models, ai inference, cloudflare llm, ai streaming, text generation ai, ai embeddings, bge pooling cls mean, image generation ai, workers ai rag, ai gateway, llama workers, flux image generation, deepgram aura, leonardo image generation, vision models ai, ai chat completion, AI_ERROR, rate limit ai, model not found, max_tokens breaking change, bge pooling backwards compatibility, model deprecations october 2025, token limit exceeded, neurons exceeded, workers ai hono, ai gateway workers, vercel ai sdk workers, openai compatible workers, workers ai vectorize, workers-ai-provider v2, ai sdk v5, lora adapters rank 32

Cloudflare Workers AI

Status: Production Ready ✅ Last Updated: 2025-11-25 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.50.0, @cloudflare/workers-types@4.20251125.0

Recent Updates (2025):

April 2025 - Performance: Llama 3.3 70B 2-4x faster (speculative decoding, prefix caching), BGE embeddings 2x faster
April 2025 - Breaking Changes: max_tokens now correctly defaults to 256 (was not respected), BGE pooling parameter (cls NOT backwards compatible with mean)
2025 - New Models (14): Mistral 3.1 24B (vision+tools), Gemma 3 12B (128K context), EmbeddingGemma 300M, Llama 4 Scout, GPT-OSS 120B/20B, Qwen models (QwQ 32B, Coder 32B), Leonardo image gen, Deepgram Aura 2, Whisper v3 Turbo, IBM Granite, Nova 3
2025 - Platform: Context windows API change (tokens not chars), unit-based pricing with per-model granularity, workers-ai-provider v2.0.0 (AI SDK v5), LoRA rank up to 32 (was 8), 100 adapters per account
October 2025: Model deprecations (use Llama 4, GPT-OSS instead)

Quick Start (5 Minutes)

// 1. Add AI binding to wrangler.jsonc
{ "ai": { "binding": "AI" } }

// 2. Run model with streaming (recommended)
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [{ role: 'user', content: 'Tell me a story' }],
      stream: true, // Always stream for text generation!
    });

    return new Response(stream, {
      headers: { 'content-type': 'text/event-stream' },
    });
  },
};

Why streaming? Prevents buffering in memory, faster time-to-first-token, avoids Worker timeout issues.

API Reference

env.AI.run(
  model: string,
  inputs: ModelInputs,
  options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>

Model Selection Guide (Updated 2025)

Text Generation (LLMs)

Model	Best For	Rate Limit	Size	Notes
2025 Models
`@cf/meta/llama-4-scout-17b-16e-instruct`	Latest Llama, general purpose	300/min	17B	NEW 2025
`@cf/openai/gpt-oss-120b`	Largest open-source GPT	300/min	120B	NEW 2025
`@cf/openai/gpt-oss-20b`	Smaller open-source GPT	300/min	20B	NEW 2025
`@cf/google/gemma-3-12b-it`	128K context, 140+ languages	300/min	12B	NEW 2025, vision
`@cf/mistralai/mistral-small-3.1-24b-instruct`	Vision + tool calling	300/min	24B	NEW 2025
`@cf/qwen/qwq-32b`	Reasoning, complex tasks	300/min	32B	NEW 2025
`@cf/qwen/qwen2.5-coder-32b-instruct`	Coding specialist	300/min	32B	NEW 2025
`@cf/qwen/qwen3-30b-a3b-fp8`	Fast quantized	300/min	30B	NEW 2025
`@cf/ibm-granite/granite-4.0-h-micro`	Small, efficient	300/min	Micro	NEW 2025
Performance (2025)
`@cf/meta/llama-3.3-70b-instruct-fp8-fast`	2-4x faster (2025 update)	300/min	70B	Speculative decoding
`@cf/meta/llama-3.1-8b-instruct-fp8-fast`	Fast 8B variant	300/min	8B	-
Standard Models
`@cf/meta/llama-3.1-8b-instruct`	General purpose	300/min	8B	-
`@cf/meta/llama-3.2-1b-instruct`	Ultra-fast, simple tasks	300/min	1B	-
`@cf/deepseek-ai/deepseek-r1-distill-qwen-32b`	Coding, technical	300/min	32B	-

Text Embeddings (2x Faster - 2025)

Model	Dimensions	Best For	Rate Limit	Notes
`@cf/google/embeddinggemma-300m`	768	Best-in-class RAG	3000/min	NEW 2025
`@cf/baai/bge-base-en-v1.5`	768	General RAG (2x faster)	3000/min	pooling: "cls" recommended
`@cf/baai/bge-large-en-v1.5`	1024	High accuracy (2x faster)	1500/min	pooling: "cls" recommended
`@cf/baai/bge-small-en-v1.5`	384	Fast, low storage (2x faster)	3000/min	pooling: "cls" recommended
`@cf/qwen/qwen3-embedding-0.6b`	768	Qwen embeddings	3000/min	NEW 2025

CRITICAL (2025): BGE models now support pooling: "cls" parameter (recommended) but NOT backwards compatible with pooling: "mean" (default).

Image Generation

Model	Best For	Rate Limit	Notes
`@cf/black-forest-labs/flux-1-schnell`	High quality, photorealistic	720/min	-
`@cf/leonardo/lucid-origin`	Leonardo AI style	720/min	NEW 2025
`@cf/leonardo/phoenix-1.0`	Leonardo AI variant	720/min	NEW 2025
`@cf/stabilityai/stable-diffusion-xl-base-1.0`	General purpose	720/min	-

Vision Models

Model	Best For	Rate Limit	Notes
`@cf/meta/llama-3.2-11b-vision-instruct`	Image understanding	720/min	-
`@cf/google/gemma-3-12b-it`	Vision + text (128K context)	300/min	NEW 2025

Audio Models (2025)

Model	Type	Rate Limit	Notes
`@cf/deepgram/aura-2-en`	Text-to-speech (English)	720/min	NEW 2025
`@cf/deepgram/aura-2-es`	Text-to-speech (Spanish)	720/min	NEW 2025
`@cf/deepgram/nova-3`	Speech-to-text (+ WebSocket)	720/min	NEW 2025
`@cf/openai/whisper-large-v3-turbo`	Speech-to-text (faster)	720/min	NEW 2025

Common Patterns

RAG (Retrieval Augmented Generation)

// 1. Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });

// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');

// 3. Generate with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: `Answer using this context:\n${context}` },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});

Structured Output with Zod

import { z } from 'zod';

const Schema = z.object({ name: z.string(), items: z.array(z.string()) });

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{
    role: 'user',
    content: `Generate JSON matching: ${JSON.stringify(Schema.shape)}`
  }],
});

const validated = Schema.parse(JSON.parse(response.response));

AI Gateway Integration

Provides caching, logging, cost tracking, and analytics for AI requests.

const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { prompt: 'Hello' },
  { gateway: { id: 'my-gateway', skipCache: false } }
);

// Access logs and send feedback
const gateway = env.AI.gateway('my-gateway');
await gateway.patchLog(env.AI.aiGatewayLogId, {
  feedback: { rating: 1, comment: 'Great response' },
});

Benefits: Cost tracking, caching (reduces duplicate inference), logging, rate limiting, analytics.

Rate Limits & Pricing (Updated 2025)

Rate Limits (per minute)

Task Type	Default Limit	Notes
Text Generation	300/min	Some fast models: 400-1500/min
Text Embeddings	3000/min	BGE-large: 1500/min
Image Generation	720/min	All image models
Vision Models	720/min	Image understanding
Audio (TTS/STT)	720/min	Deepgram, Whisper
Translation	720/min	M2M100, Opus MT
Classification	2000/min	Text classification

Pricing (Unit-Based, Billed in Neurons - 2025)

Free Tier:

10,000 neurons per day
Resets daily at 00:00 UTC

Paid Tier ($0.011 per 1,000 neurons):

10,000 neurons/day included
Unlimited usage above free allocation

2025 Model Costs (per 1M tokens):

Model	Input	Output	Notes
2025 Models
Llama 4 Scout 17B	$0.270	$0.850	NEW 2025
GPT-OSS 120B	$0.350	$0.750	NEW 2025
GPT-OSS 20B	$0.200	$0.300	NEW 2025
Gemma 3 12B	$0.345	$0.556	NEW 2025
Mistral 3.1 24B	$0.351	$0.555	NEW 2025
Qwen QwQ 32B	$0.660	$1.000	NEW 2025
Qwen Coder 32B	$0.660	$1.000	NEW 2025
IBM Granite Micro	$0.017	$0.112	NEW 2025
EmbeddingGemma 300M	$0.012	N/A	NEW 2025
Qwen3 Embedding 0.6B	$0.012	N/A	NEW 2025
Performance (2025)
Llama 3.3 70B Fast	$0.293	$2.253	2-4x faster
Llama 3.1 8B FP8 Fast	$0.045	$0.384	Fast variant
Standard Models
Llama 3.2 1B	$0.027	$0.201	-
Llama 3.1 8B	$0.282	$0.827	-
Deepseek R1 32B	$0.497	$4.881	-
BGE-base (2x faster)	$0.067	N/A	2025 speedup
BGE-large (2x faster)	$0.204	N/A	2025 speedup
Image Models (2025)
Flux 1 Schnell	$0.0000528 per 512x512 tile	-
Leonardo Lucid	$0.006996 per 512x512 tile	NEW 2025
Leonardo Phoenix	$0.005830 per 512x512 tile	NEW 2025
Audio Models (2025)
Deepgram Aura 2	$0.030 per 1k chars	NEW 2025
Deepgram Nova 3	$0.0052 per audio min	NEW 2025
Whisper v3 Turbo	$0.0005 per audio min	NEW 2025

Error Handling with Retry

async function runAIWithRetry(
  env: Env,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let lastError: Error;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await env.AI.run(model, inputs);
    } catch (error) {
      lastError = error as Error;

      // Rate limit - retry with exponential backoff
      if (lastError.message.toLowerCase().includes('rate limit')) {
        await new Promise((resolve) => setTimeout(resolve, Math.pow(2, i) * 1000));
        continue;
      }

      throw error; // Other errors - fail immediately
    }
  }

  throw lastError!;
}

OpenAI Compatibility

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.ACCOUNT_ID}/ai/v1`,
});

// Chat completions
await openai.chat.completions.create({
  model: '@cf/meta/llama-3.1-8b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Endpoints: /v1/chat/completions, /v1/embeddings

Vercel AI SDK Integration (workers-ai-provider v2.0.0)

import { createWorkersAI } from 'workers-ai-provider'; // v2.0.0 with AI SDK v5
import { generateText, streamText } from 'ai';

const workersai = createWorkersAI({ binding: env.AI });

// Generate or stream
await generateText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Write a poem',
});

References

Workers AI Docs
Models Catalog
AI Gateway
Pricing
Changelog
LoRA Adapters
MCP Tool: Use mcp__cloudflare-docs__search_cloudflare_documentation for latest docs

cloudflare-workers-ai