Table of contents

vllm — vLLM inference server

High-throughput, low-latency LLM serving with PagedAttention and continuous batching. Exposes an OpenAI-compatible API.

Image tag

docker.io/manvarharsh/vllm:cuda12

What’s in this image

  • Base: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
  • Python 3.10 (conda)
  • PyTorch with CUDA 12
  • vLLM (latest stable)
  • Hugging Face Hub CLI for model downloads
  • OpenSSH server

Default ports

PortService
22SSH
8000vLLM OpenAI-compatible API

Use cases

  • High-throughput LLM serving for production traffic
  • OpenAI-API drop-in replacement (/v1/chat/completions, /v1/completions, /v1/embeddings)
  • Tensor-parallel serving across multiple GPUs
  • LoRA-adapter swapping at request time

Environment variables

VariableDescription
ENABLE_SSHEnable SSH server
ENABLE_VLLMStart the vLLM server on port 8000
VLLM_MODELHugging Face model ID to load (e.g. meta-llama/Llama-3.1-8B-Instruct)
VLLM_EXTRA_ARGSExtra CLI args (--tensor-parallel-size 2, --max-model-len 8192, --quantization awq)
HF_TOKENHugging Face token for gated models
SSH_PUBLIC_KEYPublic key for SSH

Quick test

curl http://<pod-url>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<model>","messages":[{"role":"user","content":"Hello"}]}'

Persistence

Mount at /data. Place downloaded Hugging Face weights under /data/models/ and set HF_HOME=/data/hf-cache to keep the cache persistent.

See also