Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
/plugin marketplace add zechenzhangAGI/AI-research-SKILLs/plugin install serving-llms-vllm@zechenzhangAGI/AI-research-SKILLs