Table of contents

tei — Text Embeddings Inference

Hugging Face’s Text Embeddings Inference — a Rust-based server for embeddings, re-ranking, and classification models. Production-grade with batching, ONNX, and FlashAttention.

Image tag

docker.io/manvarharsh/tei:cuda12

What’s in this image

  • Base: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
  • HuggingFace Text Embeddings Inference (CUDA build)
  • Hugging Face Hub CLI
  • OpenSSH server

Default ports

PortService
22SSH
8080TEI HTTP API

Use cases

  • Hosting BGE / E5 / GTE embedding models for RAG pipelines
  • Hosting BAAI / Cohere re-rankers
  • Bulk embedding of document corpora
  • Drop-in replacement for OpenAI’s /v1/embeddings API (with TEI’s OpenAI-compatible mode)

Environment variables

VariableDescription
ENABLE_SSHEnable SSH server
ENABLE_TEIStart the TEI server on port 8080
TEI_MODEL_IDHugging Face model ID (e.g. BAAI/bge-large-en-v1.5)
TEI_EXTRA_ARGSExtra CLI args
HF_TOKENHugging Face token for gated models
SSH_PUBLIC_KEYPublic key for SSH

Quick test

curl http://<pod-url>:8080/embed \
  -H 'Content-Type: application/json' \
  -d '{"inputs":"The quick brown fox"}'

Persistence

Mount at /data. Set HF_HOME=/data/hf-cache to persist model weights.

See also