Serverless Inference

Name: Podstack GPU Cloud
Brand: Podstack
SKU: PODSTACK-GPU-CLOUD
Availability: InStock
Rating: 4.9 (180 reviews)

Serverless Inference (internally Podvirt) is Podstack’s pay-per-use GPU inference surface. Models cold-start on demand, run for the duration of your requests, and sleep when idle — you only pay for the GPU seconds you actually consume.

Find it under Inference > Serverless in the sidebar. The feature is gated by REACT_APP_ENABLE_PODVIRT_INFERENCE.

Model Catalog

The catalog (/inference/serverless) lists every serverless-enabled model with:

Vendor and display name
Type — chat, code, embedding, or video-generation
Token usage bar (used vs. plan limit, color-coded green → yellow → red)
API snippet to call the model from your code

Filter the catalog by category (chat / code / embedding / video-generation) or search by name, vendor, or type.

Chat Playground

For chat- and code-type models, click Chat to open a streaming playground (/inference/serverless/:id/chat):

Token-streamed responses
Per-conversation chat history (sidebar lists prior sessions)
Configurable system prompt
JSON tab showing the exact request/response payload sent to the API — useful for porting prompts into your own code
Stop button cancels in-flight generation

Chain-of-Thought Display

Models that emit <think>...</think> reasoning blocks have those segments rendered separately from the user-visible answer. The clean answer is what’s saved to chat history; the think block is shown collapsed for inspection.

Session Affinity

Chat sessions are pinned to a backing GPU instance via the X-Podvirt-Session-Id header. This keeps the KV cache warm across messages in the same conversation, reducing per-message latency.

Video Generation

Video-generation models have a dedicated Video view (/inference/serverless/:id/video) with a prompt input and embedded player. The frame goes through the same serverless GPU lifecycle — first generation cold-starts the model, subsequent generations reuse the warm instance.

The underlying engine is vllm-omni, a multimodal extension supporting video and audio outputs alongside text.

API Access

Every model exposes a chat-completions-style endpoint:

curl -X POST https://inference.podstack.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MODEL_ID",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Streaming, system prompts, temperature, max tokens, and stop sequences all follow the OpenAI chat-completions contract.

Generate API keys at Inference > API Keys.

GPU Dashboard

The GPU dashboard (/inference/gpu-dashboard) shows fleet-wide health and economics:

Per-model status: Active / Warming Up / Sleeping / Cold / Failed, plus GPU type, memory, total requests, and startup duration
Last 1 hour and last 24 hours usage summaries: request count, token totals, average latency, average time-to-first-token (TTFT), total cost
Per-model breakdown: which models drove your spend
Hourly time series: request rate, total tokens, average latency, tokens-per-second, and cost

Use it to spot cold-start tax, identify which models dominate your bill, and watch for latency regressions after a model swap.

Lifecycle States

State	Meaning
Cold	No instance running — next request cold-starts the model
Warming Up	Model loading onto a GPU (counted via `startup_duration_ms`)
Active / Running	Serving requests
Sleeping	Idle, instance held briefly for fast re-warm
Failed / Error	Last startup or request errored — see logs

Billing

Serverless inference is billed per token (prompt + completion) at a model-specific rate. Cold-start GPU seconds during warm-up are not separately billed — only the tokens you generate.

Find rates and usage under the Usage Analytics tab and on monthly invoices.

Use Cases

Low-volume LLM workloads where keeping a dedicated GPU idle is wasteful
Bursty inference traffic with high cold-start tolerance
Quick model comparison via the playground before committing to a managed deployment
Video generation without managing a long-running serving cluster

Next Steps

Model Catalog — for always-on managed inference
API Keys — generate keys for programmatic access
Playground — interactive testing for managed inference models