serving-llms-vllm
91
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
Optimizes LLM serving with high throughput using vLLM, enabling efficient deployment and inference for production APIs.
Install this skill
or
serving-llms-vllm5 files
Comments
Sign in to leave a comment.
No comments yet. Be the first to comment!
Install this skill with one command
/learn @davila7/inference-serving-vllmGitHub Stars 22.3K
Rate this skill
Categorydevelopment
UpdatedMarch 16, 2026
davila7/claude-code-templates