By BBuf
Autonomously optimize LLM serving infrastructure — profile torch traces, benchmark SGLang/vLLM/TensorRT-LLM, simulate capacity and compute, and run RLCR loops that patch code to match or beat competitor performance. Also includes human-like PR review and incident triage for production serving.
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
Use when creating or revising model PR optimization history documents for SGLang, vLLM, or another serving framework that cite GitHub PRs. Requires manual, per-PR source-diff review and documentation of motivation, key implementation approach, most important code excerpts, reviewed files, and validation implications instead of generated or one-line summaries.
Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Agent-ready playbooks for LLM serving benchmarks, capacity planning, torch-profiler triage, pipeline analysis, compute simulation, SGLang/vLLM optimization, human code review, production incidents, and model PR intelligence.
This repository is built for AI infrastructure engineers who want agents to do real work, not recite generic prompts.
It gives an agent the operational memory needed to benchmark SGLang, vLLM, and TensorRT-LLM fairly; explain serving capacity from startup logs; split prefill and decode profiler evidence; inspect traces at layer and kernel level; estimate operator FLOPs and MFU; review SGLang patches against real maintainer discussion patterns; run Humanize-governed SGLang and vLLM SOTA loops; triage SGLang production incidents from a replay; and keep model-family optimization history close to the code that actually changed.
For standalone kernel campaigns and kernel evidence tools, see the sibling project KDA-Pilot.
If this saves you one stale model-support assumption, one misleading profiler trace, or one late-night benchmark loop, a star helps more AI-infra engineers find it.
| Skill | Use it when |
|---|---|
llm-serving-auto-benchmark | You need a fair, bounded serving benchmark search for SGLang, vLLM, TensorRT-LLM, or another OpenAI-compatible stack. |
llm-serving-capacity-planner | You need to explain SGLang or vLLM startup memory, KV cache budget, request capacity, or OOM pressure from logs. |
llm-torch-profiler-analysis | You need a three-table profiler report that keeps extend/prefill and decode evidence separate. |
llm-pipeline-analysis | You need forward-pass, layer, and kernel-level timing from a torch profiler trace, including anchor boundaries and Perfetto ranges. |
model-compute-simulation | You need operator shapes, FLOPs, MFU estimates, kernel-to-op mapping, or parallelism what-if analysis for an LLM serving shape. |
sglang-humanize-review | You need SGLang code-review findings grounded in full human PR review episodes from project start through the latest refresh (June 2026), including inline code context, top-level discussion, review summaries, and multi-round replies. Every review opens with a PR comprehension pass — a change summary plus a Mermaid execution flowchart with the diff's modified steps marked — so the reviewer sees how the PR runs before the findings. |
sglang-sota-humanize-loop | You want one model-level Humanize RLCR loop that owns gap decisions, profiler triage, required layer-pipeline deep dives, SGLang patches, optional ncu-report-skill evidence, and real-model revalidation after the fixed fair benchmark. |
vllm-sota-humanize-loop | You want one model-level Humanize RLCR loop that owns gap decisions, profiler triage, required layer-pipeline deep dives, vLLM patches, optional ncu-report-skill evidence, and real-model revalidation after the fixed fair benchmark. |
sglang-prod-incident-triage | You need to turn queue growth, timeouts, wrong outputs, crashes, or distributed stalls into a replay and next debug step. |
model-architecture-diagram | You need original public architecture diagrams for popular LLM, VLM, MoE, OCR, and diffusion model families. |
npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skillsHumanize - An iterative development plugin that uses Codex to review Claude's work. Creates a feedback loop where Claude implements plans and Codex independently reviews progress, ensuring quality through continuous refinement.
Deploy and benchmark vLLM with Claude Code
Claude Code skill pack for Langfuse LLM observability (24 skills)
Agent Skills for NeMo Evaluator SDK
ML engineering plugin: Give your AI coding agent ML engineering superpowers.
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Evaluate and compare ML model performance metrics