This skill should be used when the user asks about "Workers AI", "AI models", "text generation", "embeddings", "semantic search", "RAG", "Retrieval Augmented Generation", "AI inference", "LLaMA", "Llama", "bge embeddings", "@cf/ models", "AI Gateway", or discusses implementing AI features, choosing AI models, generating embeddings, or building RAG systems on Cloudflare Workers.
/plugin marketplace add involvex/involvex-claude-marketplace/plugin install cloudflare-expert@involvex-claude-marketplaceThis skill inherits all available tools. When active, it can use any tool Claude has access to.
examples/rag-implementation.jsThis skill provides comprehensive guidance for using Workers AI, Cloudflare's AI inference platform. It covers available models, inference patterns, embedding generation, RAG (Retrieval Augmented Generation) architectures, AI Gateway integration, and best practices for AI workloads. Use this skill when implementing AI features, selecting models, building RAG systems, or optimizing AI inference on Workers.
Workers AI provides serverless AI inference at the edge with:
LLaMA 3.1 (Recommended):
@cf/meta/llama-3.1-8b-instruct - Chat and instruction followingMistral:
@cf/mistral/mistral-7b-instruct-v0.2 - Fast instruction followingQwen:
@cf/qwen/qwen1.5-14b-chat-awq - Quantized for efficiencySee references/workers-ai-models.md for complete model catalog with specifications and use cases.
BGE Base (Recommended for English):
@cf/baai/bge-base-en-v1.5 - High-quality English embeddingsBGE Large (Higher Quality):
@cf/baai/bge-large-en-v1.5 - Higher quality, more computeBGE Small (Faster):
@cf/baai/bge-small-en-v1.5 - Faster, smaller modelMultilingual:
@cf/baai/bge-m3 - Multilingual supportStable Diffusion:
@cf/stabilityai/stable-diffusion-xl-base-1.0 - Text-to-image@cf/bytedance/stable-diffusion-xl-lightning - Faster generationImage Classification:
@cf/microsoft/resnet-50 - Object recognitionText-to-Speech:
@cf/meta/m2m100-1.2b - Multilingual speech synthesisAutomatic Speech Recognition:
@cf/openai/whisper - Speech-to-textexport default {
async fetch(request, env, ctx) {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is Cloudflare Workers?' }
]
});
return new Response(JSON.stringify(response));
}
};
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'user', content: 'Write a story about...' }
],
stream: true
});
return new Response(stream, {
headers: { 'Content-Type': 'text/event-stream' }
});
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [/* messages */],
max_tokens: 512, // Max tokens to generate
temperature: 0.7, // Creativity (0-1, higher = more random)
top_p: 0.9, // Nucleus sampling
top_k: 40, // Top-k sampling
repetition_penalty: 1.2 // Penalize repetition
});
Parameter guidelines:
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: ['Hello world', 'Another sentence']
}) as { data: number[][] };
const vector1 = embeddings.data[0]; // [0.123, -0.456, ...]
const vector2 = embeddings.data[1];
Important TypeScript note: Always add as { data: number[][] } type assertion when using embeddings API.
// Batch multiple texts for efficiency
const texts = documents.map(d => d.content);
// Process in batches of 100 (recommended batch size)
const batchSize = 100;
const allEmbeddings = [];
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const result = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: batch
}) as { data: number[][] };
allEmbeddings.push(...result.data);
}
For long documents, split into chunks before embedding:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500, // Characters per chunk
chunkOverlap: 50 // Overlap between chunks
});
const chunks = await splitter.splitText(longDocument);
// Generate embedding for each chunk
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: chunks
}) as { data: number[][] };
// Store each chunk with its embedding
for (let i = 0; i < chunks.length; i++) {
await env.VECTOR_INDEX.insert([{
id: `${docId}-chunk-${i}`,
values: embeddings.data[i],
metadata: { text: chunks[i], docId, chunkIndex: i }
}]);
}
See references/rag-architecture-patterns.md for complete RAG implementation patterns.
async function answerQuestion(question, env) {
// 1. Generate question embedding
const questionEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [question]
}) as { data: number[][] };
// 2. Find similar documents
const similar = await env.VECTOR_INDEX.query(questionEmbedding.data[0], {
topK: 3,
returnMetadata: true
});
// 3. Build context from retrieved documents
const context = similar.matches
.map(match => match.metadata.text)
.join('\n\n');
// 4. Generate answer with context
const answer = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{
role: 'system',
content: 'Answer the question using only the provided context. If the answer is not in the context, say "I don\'t have enough information."'
},
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`
}
]
});
return {
answer: answer.response,
sources: similar.matches.map(m => ({
score: m.score,
text: m.metadata.text
}))
};
}
async function advancedRAG(question, env) {
// 1. Retrieve more candidates (top 10)
const questionEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [question]
}) as { data: number[][] };
const candidates = await env.VECTOR_INDEX.query(questionEmbedding.data[0], {
topK: 10
});
// 2. Rerank with LLM for relevance
const reranked = [];
for (const candidate of candidates.matches) {
const relevance = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'user',
content: `Rate the relevance of this passage to the question on a scale of 0-10:\n\nQuestion: ${question}\n\nPassage: ${candidate.metadata.text}\n\nRating (just the number):`
}],
max_tokens: 5
});
const score = parseInt(relevance.response);
if (score >= 7) {
reranked.push({ ...candidate, rerankScore: score });
}
}
// 3. Use top reranked results
reranked.sort((a, b) => b.rerankScore - a.rerankScore);
const topResults = reranked.slice(0, 3);
const context = topResults.map(r => r.metadata.text).join('\n\n');
// 4. Generate answer
const answer = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'system',
content: 'Answer based on the context provided.'
}, {
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`
}]
});
return { answer: answer.response, sources: topResults };
}
See examples/rag-implementation.js for complete RAG examples.
AI Gateway provides caching, rate limiting, and analytics for AI requests.
// wrangler.jsonc
{
"ai": {
"binding": "AI",
"gateway_id": "my-gateway"
}
}
// Requests automatically go through AI Gateway when configured
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Hello' }]
});
// Gateway handles caching, rate limiting, analytics automatically
Text Generation:
mistral-7b-instructllama-3.1-8b-instructllama-3.1-8b-instruct (128K context)Embeddings:
bge-base-en-v1.5bge-m3bge-small-en-v1.5bge-large-en-v1.5Good prompts:
// Be specific
{ role: 'user', content: 'Summarize this article in 3 bullet points: ...' }
// Provide context
{ role: 'system', content: 'You are an expert programmer.' }
// Use examples (few-shot)
{
role: 'user',
content: 'Example: Input "hello" -> Output "HELLO"\nInput "world" ->'
}
Avoid:
Cache results: Use KV to cache AI responses
const cacheKey = `ai:${hash(prompt)}`;
let cached = await env.CACHE.get(cacheKey);
if (!cached) {
cached = await env.AI.run(model, params);
await env.CACHE.put(cacheKey, JSON.stringify(cached), {
expirationTtl: 3600
});
}
Use AI Gateway: Automatic caching and rate limiting
Batch embeddings: Process multiple texts together
Right-size models: Use smaller models when possible
Optimize prompts: Shorter prompts = lower cost
Streaming: Use streaming for long responses to improve perceived latency
Parallel requests: Use Promise.all() for independent AI calls
const [summary, sentiment] = await Promise.all([
env.AI.run(model, { messages: [summaryPrompt] }),
env.AI.run(model, { messages: [sentimentPrompt] })
]);
Early termination: Use max_tokens to limit output
Async with waitUntil: For non-critical AI tasks
ctx.waitUntil(
generateAnalytics(request, env)
);
Chunk size: 300-500 characters for optimal retrieval
Overlap: 10-20% overlap between chunks to preserve context
Top-K selection: 3-5 documents usually optimal
Reranking: Consider LLM-based reranking for better quality
Metadata: Store source information for citation
Hybrid search: Combine vector search with keyword search for best results
See Cloudflare documentation or use cloudflare-docs-specialist agent for current pricing.
// Maintain conversation history
const history = await env.KV.get(`chat:${sessionId}`, 'json') || [];
history.push({ role: 'user', content: userMessage });
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: history
});
history.push({ role: 'assistant', content: response.response });
await env.KV.put(`chat:${sessionId}`, JSON.stringify(history), {
expirationTtl: 3600
});
// Analyze document with AI
const analysis = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'user',
content: `Analyze this document and extract:\n1. Main topics\n2. Key entities\n3. Sentiment\n\nDocument: ${documentText}`
}]
});
// Generate content with specific format
const blogPost = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'system',
content: 'You are a professional blog writer.'
}, {
role: 'user',
content: `Write a blog post about ${topic}. Format:\n# Title\n## Introduction\n## Main Points\n## Conclusion`
}],
temperature: 0.8 // Higher creativity for content generation
});
// Extract structured data from unstructured text
const extracted = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'user',
content: `Extract the following from this email and return as JSON:\n- Name\n- Email\n- Company\n- Message\n\nEmail: ${emailText}\n\nJSON:`
}],
temperature: 0.1 // Low temperature for factual extraction
});
const data = JSON.parse(extracted.response);
Issue: "Model not found"
@cf/Issue: "Rate limit exceeded"
Issue: "Embeddings dimension mismatch"
Issue: "Timeout on long generation"
max_tokens, or split into smaller requestsIssue: "Poor RAG results"
For detailed information, consult:
references/workers-ai-models.md - Complete model catalog with specs and use casesreferences/rag-architecture-patterns.md - RAG implementation patterns and strategiesWorking examples in examples/:
rag-implementation.js - Complete RAG system with Vectorizetext-generation-examples.js - Various text generation patternsFor the latest Workers AI documentation:
Use the cloudflare-docs-specialist agent to search AI documentation and the workers-ai-specialist agent for implementation guidance.
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.