sglang — SGLang structured-generation server

Name: Podstack GPU Cloud
Brand: Podstack
SKU: PODSTACK-GPU-CLOUD
Availability: InStock
Rating: 4.9 (180 reviews)

LMSYS’s SGLang — an LLM serving runtime optimized for structured output, multi-turn agent workloads, and complex prompt programs. OpenAI-compatible API.

Image tag

docker.io/manvarharsh/sglang:cuda12

What’s in this image

Base: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
Python 3.10 (conda)
PyTorch with CUDA 12
SGLang runtime + frontend
Flash Attention, FlashInfer
OpenSSH server

Default ports

Port	Service
22	SSH
8000	SGLang OpenAI-compatible API

Use cases

JSON-schema-constrained generation (function calling, structured output)
Multi-turn agent tool-use with low overhead
Speculative decoding via SGLang’s runtime
Faster constrained generation than vanilla vLLM in many cases

Environment variables

Variable	Description
`ENABLE_SSH`	Enable SSH server
`ENABLE_SGLANG`	Start the SGLang server
`SGLANG_MODEL`	Hugging Face model ID to load
`SGLANG_EXTRA_ARGS`	Extra CLI args (`--tp 2`, `--mem-fraction 0.85`)
`HF_TOKEN`	Hugging Face token for gated models
`SSH_PUBLIC_KEY`	Public key for SSH

Persistence

Mount at /data. Set HF_HOME=/data/hf-cache to keep model weights persistent across pod restarts.