Inference
The Podstack Inference service provides an OpenAI-compatible API for running large language models, embeddings, and audio transcription. Deploy models from the catalog or bring your own fine-tuned models.
Overview
Inference includes:
- Model Catalog — browse and deploy inference-ready models
- Playground — test models interactively in the browser
- API Keys — manage authentication for API access
- Serverless Inference — pay-per-token cold-start GPU inference (chat, code, embedding, video)
- OpenAI-Compatible API — drop-in replacement for OpenAI endpoints
- Streaming and system prompts — first-class support for streamed responses and per-request system prompts
- Chain-of-thought rendering —
<think>blocks separated from user-visible answers - Usage Analytics — request counts, token usage, latency, and cost breakdowns
- GPU Dashboard — fleet-wide health and economics for serverless models
Getting Started
- Browse the Model Catalog to find a model
- Generate an API Key for authentication
- Test in the Playground or call the API directly
- For pay-per-token cold-start workloads, see Serverless Inference
OpenAI-Compatible API
The Inference API is compatible with the OpenAI SDK format, making migration easy:
Chat Completions
import openai
client = openai.OpenAI(
api_key="YOUR_PODSTACK_INFERENCE_KEY",
base_url="https://cloud.podstack.ai/inference/v1"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
Embeddings
response = client.embeddings.create(
model="BAAI/bge-large-en-v1.5",
input="The quick brown fox jumps over the lazy dog."
)
print(response.data[0].embedding[:5]) # First 5 dimensions
Audio Transcription
with open("audio.mp3", "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file
)
print(response.text)
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions | POST | Chat completions (streaming supported) |
/v1/embeddings | POST | Generate embeddings |
/v1/audio/transcriptions | POST | Audio transcription |
/v1/models | GET | List available models |
/v1/models/:id | GET | Get model details |
Pricing
Inference is billed per token:
- Input tokens: Cost per million input tokens (varies by model)
- Output tokens: Cost per million output tokens (varies by model)
- Pricing is shown per model in the catalog
Feature Availability
Inference requires:
- Feature flag enabled
- Sufficient wallet balance
- An active API key
Contact support if Inference isn’t visible in your portal.
Streaming
Set stream: true to receive tokens as they’re generated:
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
The playground also streams responses by default and exposes a Stop button to cancel in-flight generation.
System Prompts and Chat History
The chat playground supports a persistent system prompt per session and stores chat history in a sidebar so you can revisit prior conversations. Use the JSON tab to see the exact request payload — useful when porting a prompt into your own code.
Chain-of-Thought Display
Models that emit <think>...</think> reasoning blocks have those segments rendered separately from the final answer. Only the cleaned answer is saved to chat history; the think block is shown collapsed for inspection.
Usage Analytics
Per-API-key analytics show:
- Request count and token totals (prompt + completion)
- Average latency and time-to-first-token (TTFT)
- Per-model cost breakdown
- Hourly time series
Use these to detect runaway clients, attribute spend across teams, and pick the right model for your latency budget.
Cost Recommendations
The platform suggests cheaper or faster model alternatives based on your recent traffic shape. Recommendations appear on the Cost Recommendations view inside the Inference section.
Next Steps
- Browse the Model Catalog
- Generate API Keys
- Test in the Playground
- Serverless Inference for cold-start pay-per-token workloads