From external-gitcode-ascend-skills
Deploys vLLM inference services on Ascend NPU servers with automatic model detection, quantization handling, tensor parallelism configuration, and service health verification. Supports local/remote deployment on bare metal, containers, or Docker images.
How this skill is triggered — by the user, by Claude, or both
Slash command
/external-gitcode-ascend-skills:vllm-ascend-serverThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill deploys vLLM inference services on Ascend NPU servers with automatic model detection, quantization handling, and performance optimization.
references/environment-variables.mdreferences/features.mdreferences/graph-mode.mdreferences/launch-templates/docker.mdreferences/launch-templates/health-check.mdreferences/launch-templates/offline-inference.mdreferences/launch-templates/online-serving.mdreferences/launch-templates/speculative-decoding.mdreferences/launch_templates.mdreferences/model_configs/deepseek-v3.yamlreferences/model_configs/glm-4.x.yamlreferences/model_configs/qwen2.5-vl.yamlreferences/model_configs/qwen3-235b-a22b.yamlreferences/model_configs/qwen3-30b.yamlreferences/model_configs/qwen3-8b.yamlreferences/model_configs/qwen3-embedding.yamlreferences/model_configs/qwen3-reranker.yamlreferences/model_configs/qwen3-vl.yamlreferences/models.mdreferences/parameters.mdThis skill deploys vLLM inference services on Ascend NPU servers with automatic model detection, quantization handling, and performance optimization.
Key Features:
quant_model_description.json)Phase 0: Platform (Local/Remote, Bare metal/Container)
↓
Phase 1: Environment Check (NPU, vLLM, Memory)
↓
Phase 2: Model Discovery (Find models, detect quantization)
↓
Phase 3: Gather Requirements (Port, TP size, mode selection)
↓
Phase 4: Generate Config (Env vars, vLLM command)
↓
Phase 5: Execute (Deploy and start service)
↓
Phase 6: Verify (Health check, test inference)
Detailed workflow: workflow-guide.md
1. Local - Deploy on this machine
2. Remote - Deploy via SSH (→ remote-server-guide skill)
1. Bare metal (裸机) - Virtual environment on host
2. Existing container (已有容器) - Connect to running container
3. Docker image (镜像) - Create with npu-docker-launcher
Docker image defaults:
-v <model-path>:/modelhost (default) or bridge with port mapping# NPU check
npu-smi info
# vLLM check
pip show vllm vllm-ascend
# Memory check
npu-smi info | grep -A 5 "Memory-Usage"
Before deployment, verify selected NPU cards are not occupied:
# Check NPU usage status
npu-smi info
# Check for running processes on specific cards
fuser -v /dev/davinci0 2>/dev/null && echo "Card 0 in use" || echo "Card 0 available"
fuser -v /dev/davinci1 2>/dev/null && echo "Card 1 in use" || echo "Card 1 available"
# Alternative: Check memory usage (high usage = occupied)
npu-smi info -t board | grep -E "NPU|Memory-Usage"
If selected cards are occupied:
## NPU Card Status
Card 0: ❌ In use (Memory: 28GB/32GB, PID: 12345)
Card 1: ✅ Available (Memory: 0GB/32GB)
Card 2: ✅ Available (Memory: 0GB/32GB)
Card 3: ❌ In use (Memory: 30GB/32GB, PID: 67890)
Selected cards [0,1] have conflicts:
- Card 0 is occupied by process 12345
Options:
1. Select different cards
2. Kill occupying process (with user confirmation)
3. Wait and retry
How to proceed? [1/2/3]
Kill process (with confirmation):
# Show what's using the card
ps aux | grep <PID>
# Confirm before killing
"Kill process <PID> (<process-name>)? [yes/no]"
# Kill if confirmed
kill -9 <PID>
Detailed NPU check: workflow-guide.md
/home/weights, /home/weight, /home/data*, /data*
Recursive search for config.json to find models.
# Quantized model
[ -f "<model>/quant_model_description.json" ] → --quantization ascend
# Non-quantized model
[ ! -f "<model>/quant_model_description.json" ] → No param
Critical: Never add --quantization ascend for non-quantized models!
See quantization.md for details.
| Parameter | Default | Notes |
|---|---|---|
| Mode | online | online/offline |
| Port | 8000 | Default for vLLM |
| NPU cards | 0 | 0,1 for TP2 |
| TP size | Auto | Based on model |
| Scenario | Mode | Config |
|---|---|---|
| Production | Graph | --no-enforce-eager |
| Development | Eager | --enforce-eager |
| Debugging | Eager | --enforce-eager |
See graph-mode.md for details.
| Model Size | TP | Cards |
|---|---|---|
| ≤14B | 1 | 1 |
| 14B-70B | 2-4 | 2-4 |
| >70B | 4-8 | 4-8 |
Single card:
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
export ASCEND_RT_VISIBLE_DEVICES=0
Multi-card:
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export ASCEND_RT_VISIBLE_DEVICES=0,1
export HCCL_BUFFSIZE=1024
export HCCL_CONNECT_TIMEOUT=600
vllm serve /model/<model-name> \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--max-num-seqs 256 \
--max-model-len 32768 \
--tensor-parallel-size <tp> \
[QUANT_PARAM] \
--gpu-memory-utilization 0.9 \
--async-scheduling \
--additional-config '{"enable_cpu_binding":true}' \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'
QUANT_PARAM:
--quantization ascendDisplay generated config, confirm with user, then execute.
Persistent session (tmux): If you connected via tmux and are already inside the target environment (remote host / container / both), execute commands directly — same as bare metal.
Stateless (SSH key / sshpass / paramiko / fabric):
| Platform | Method |
|---|---|
| Bare metal | Execute directly in shell |
| Existing container | docker exec to run command |
| Remote | SSH → run command |
| Remote container | SSH → docker exec -d for background |
For containers, start vLLM in background with logging:
# Inside container (bare metal / tmux already in container)
nohup vllm serve /model ... > /tmp/vllm.log 2>&1 &
# From host (stateless, background in container)
docker exec -d <container> bash -c 'cd /workspace && vllm serve /model ... 2>&1 | tee /tmp/vllm.log'
# Remote via SSH (stateless)
ssh user@host "docker exec -d <container> bash -c 'vllm serve /model ... 2>&1 | tee /tmp/vllm.log'"
Inside container (bare metal / tmux already in container):
ps aux | grep vllm
tail -50 /tmp/vllm.log
tail -f /tmp/vllm.log
From host or remote (stateless, each command needs docker exec):
docker exec <container> ps aux | grep vllm
docker exec <container> tail -50 /tmp/vllm.log
docker exec <container> tail -f /tmp/vllm.log
# Health check
curl http://localhost:8000/health
# Test inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "<name>", "messages": [{"role": "user", "content": "Hello"}]}'
export TASK_QUEUE_ENABLE=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
export ASCEND_RT_VISIBLE_DEVICES=0
vllm serve /model/Qwen3-8B-mxfp8 \
--port 8000 \
--trust-remote-code \
--quantization ascend \
--gpu-memory-utilization 0.9 \
--async-scheduling
vllm serve /model/Qwen3-8B \
--port 8000 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--async-scheduling
# NO --quantization param!
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export ASCEND_RT_VISIBLE_DEVICES=0,1
export HCCL_BUFFSIZE=1024
vllm serve /model/Qwen3-30B-mxfp8 \
--port 8000 \
--tensor-parallel-size 2 \
--quantization ascend \
--gpu-memory-utilization 0.9
docker run -it -d \
--name vllm-server \
--network bridge \
-p 8000:8000 \
-v /home/weights/Qwen3-8B:/model \
-e ASCEND_RT_VISIBLE_DEVICES=0 \
vllm-ascend:latest
# Inside container
vllm serve /model --quantization ascend ...
| Detection | Action |
|---|---|
quant_model_description.json exists | Add --quantization ascend |
| File not found | No quantization param |
| Use Case | Mode |
|---|---|
| Production | Graph (AclGraph) |
| First deployment | Eager → test → Graph |
| Errors in graph | Fall back to Eager |
| Debugging | Eager |
| Network | Port Config |
|---|---|
| host | No mapping needed |
| bridge | Ask user for host port |
| Error | Solution |
|---|---|
| OOM | Reduce max_num_seqs or max_model_len |
| Graph capture failed | Use --enforce-eager |
| Quantization error | Check if model is actually quantized |
| Port in use | Change port or kill process |
| HCCL timeout | Increase HCCL_CONNECT_TIMEOUT |
| Connection reset/timeout | Network issue, retry SSH connection |
| Container exits immediately | Check docker logs, verify mounts exist |
npx claudepluginhub ascend-ai-coding/awesome-ascend-skills --plugin npu-torchair-inferDeploys vLLM large model inference services on Ascend NPU platforms via SSH, with automatic configuration discovery, user confirmation, cron-based health monitoring, and service validation.
Deploys vLLM inference server using Docker (pre-built images or build-from-source) with NVIDIA GPU support and OpenAI-compatible API.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.