Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.
/plugin marketplace add secondsky/claude-skills/plugin install cloudflare-workers-ai@claude-skillsThis skill inherits all available tools. When active, it can use any tool Claude has access to.
references/best-practices.mdreferences/integrations.mdreferences/models-catalog.mdtemplates/ai-embeddings-rag.tstemplates/ai-gateway-integration.tstemplates/ai-image-generation.tstemplates/ai-text-generation.tstemplates/ai-vision-models.tstemplates/wrangler-ai-config.jsoncProduction-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.
Status: Production Ready ā Last Updated: 2025-11-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0
wrangler.jsonc:
{
"ai": {
"binding": "AI"
}
}
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt: 'What is Cloudflare?',
});
return Response.json(response);
},
};
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true, // Always use streaming for text generation!
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
Why streaming?
env.AI.run()const response = await env.AI.run(model, inputs, options?);
| Parameter | Type | Description |
|---|---|---|
model | string | Model ID (e.g., @cf/meta/llama-3.1-8b-instruct) |
inputs | object | Model-specific inputs (see model type below) |
options.gateway.id | string | AI Gateway ID for caching/logging |
options.gateway.skipCache | boolean | Skip AI Gateway cache |
Returns: Promise<ModelOutput> (non-streaming) or ReadableStream (streaming)
| Category | Key Inputs | Output |
|---|---|---|
| Text Generation | messages[], stream, max_tokens, temperature | { response: string } |
| Embeddings | text: string | string[] | { data: number[][], shape: number[] } |
| Image Generation | prompt, num_steps, guidance | Binary PNG |
| Vision | messages[].content[].image_url | { response: string } |
š Full model details: Load references/models-catalog.md for complete model list, parameters, and rate limits.
| Model | Best For | Rate Limit | Size |
|---|---|---|---|
@cf/meta/llama-3.1-8b-instruct | General purpose, fast | 300/min | 8B |
@cf/meta/llama-3.2-1b-instruct | Ultra-fast, simple tasks | 300/min | 1B |
@cf/qwen/qwen1.5-14b-chat-awq | High quality, complex reasoning | 150/min | 14B |
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b | Coding, technical content | 300/min | 32B |
@hf/thebloke/mistral-7b-instruct-v0.1-awq | Fast, efficient | 400/min | 7B |
| Model | Dimensions | Best For | Rate Limit |
|---|---|---|---|
@cf/baai/bge-base-en-v1.5 | 768 | General purpose RAG | 3000/min |
@cf/baai/bge-large-en-v1.5 | 1024 | High accuracy search | 1500/min |
@cf/baai/bge-small-en-v1.5 | 384 | Fast, low storage | 3000/min |
| Model | Best For | Rate Limit | Speed |
|---|---|---|---|
@cf/black-forest-labs/flux-1-schnell | High quality, photorealistic | 720/min | Fast |
@cf/stabilityai/stable-diffusion-xl-base-1.0 | General purpose | 720/min | Medium |
@cf/lykon/dreamshaper-8-lcm | Artistic, stylized | 720/min | Fast |
| Model | Best For | Rate Limit |
|---|---|---|
@cf/meta/llama-3.2-11b-vision-instruct | Image understanding | 720/min |
@cf/unum/uform-gen2-qwen-500m | Fast image captioning | 720/min |
app.post('/chat', async (c) => {
const { messages } = await c.req.json<{ messages: Array<{ role: string; content: string }> }>();
const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', { messages, stream: true });
return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });
});
// 1. Generate embedding for query
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });
// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
// 3. Build context
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// 4. Generate with context
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: `Answer using this context:\n${context}` },
{ role: 'user', content: userQuery },
],
stream: true,
});
return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });
š More patterns: Load references/best-practices.md for structured output, image generation, multi-model consensus, and production patterns.
Enable caching, logging, and cost tracking with AI Gateway:
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt: 'Hello' }, {
gateway: { id: 'my-gateway', skipCache: false },
});
Benefits: Cost tracking, response caching (50-90% savings on repeated queries), request logging, rate limiting, analytics.
Information last verified: 2025-01-14
Rate limits and pricing vary significantly by model. Always check the official documentation for the most current information:
Free Tier: 10,000 neurons/day Paid Tier: $0.011 per 1,000 neurons
š Per-model details: See references/models-catalog.md for specific rate limits and pricing for each model.
Essential before deploying:
š Full checklist: Load references/best-practices.md for complete production checklist, error handling patterns, monitoring, and cost optimization.
Workers AI supports OpenAI SDK compatibility and Vercel AI SDK:
// OpenAI SDK - use same patterns with Workers AI models
const openai = new OpenAI({
apiKey: env.CLOUDFLARE_API_KEY,
baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});
// Vercel AI SDK - native integration
import { createWorkersAI } from 'workers-ai-provider';
const workersai = createWorkersAI({ binding: env.AI });
š Full integration guide: Load references/integrations.md for OpenAI SDK, Vercel AI SDK, and REST API examples.
| Feature | Limit |
|---|---|
| Concurrent requests | No hard limit (rate limits apply) |
| Max input tokens | Varies by model (typically 2K-128K) |
| Max output tokens | Varies by model (typically 512-2048) |
| Streaming chunk size | ~1 KB |
| Image size (output) | ~5 MB |
| Request timeout | Workers timeout applies (30s default, 5m max CPU) |
| Daily free neurons | 10,000 |
| Rate limits | See "Rate Limits & Pricing" section |
| Reference File | Load When... |
|---|---|
references/models-catalog.md | Choosing a model, checking rate limits, comparing model capabilities |
references/best-practices.md | Production deployment, error handling, cost optimization, security |
references/integrations.md | Using OpenAI SDK, Vercel AI SDK, or REST API instead of native binding |