vllm — vLLM inference server

Name: Podstack GPU Cloud
Brand: Podstack
SKU: PODSTACK-GPU-CLOUD
Availability: InStock
Rating: 4.9 (180 reviews)

High-throughput, low-latency LLM serving with PagedAttention and continuous batching. Exposes an OpenAI-compatible API.

Image tag

docker.io/manvarharsh/vllm:cuda12

What’s in this image

Base: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
Python 3.10 (conda)
PyTorch with CUDA 12
vLLM (latest stable)
Hugging Face Hub CLI for model downloads
OpenSSH server

Default ports

Port	Service
22	SSH
8000	vLLM OpenAI-compatible API

Use cases

High-throughput LLM serving for production traffic
OpenAI-API drop-in replacement (/v1/chat/completions, /v1/completions, /v1/embeddings)
Tensor-parallel serving across multiple GPUs
LoRA-adapter swapping at request time

Environment variables

Variable	Description
`ENABLE_SSH`	Enable SSH server
`ENABLE_VLLM`	Start the vLLM server on port 8000
`VLLM_MODEL`	Hugging Face model ID to load (e.g. `meta-llama/Llama-3.1-8B-Instruct`)
`VLLM_EXTRA_ARGS`	Extra CLI args (`--tensor-parallel-size 2`, `--max-model-len 8192`, `--quantization awq`)
`HF_TOKEN`	Hugging Face token for gated models
`SSH_PUBLIC_KEY`	Public key for SSH

Quick test

curl http://<pod-url>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<model>","messages":[{"role":"user","content":"Hello"}]}'

Persistence

Mount at /data. Place downloaded Hugging Face weights under /data/models/ and set HF_HOME=/data/hf-cache to keep the cache persistent.