Servary Notify me

Servary

Run open-weight LLMs on your own infrastructure.

Register a model. Pick an environment. Ship a stable URL. The full lifecycle in one console, on your hardware.

servary.example.com / deployments
Deployment
Runtime
Environment
Status
Tags
llama-3.3-70b-instruct
vLLM
prod-us-east
Ready
8×H100 FP8
qwen3-30b-a3b
SGLang
prod-eu-west
Ready
MoE BF16
deepseek-r1-distill-32b
vLLM
staging
Rolling out
reasoning
mistral-small-3.1-24b
SGLang
prod-us-east
Ready
multimodal
phi-4-14b
vLLM
shared-pool
Queued
scratch

Shipping an LLM in production has become a tax. Bespoke YAML, glue scripts, Slack threads, a quarter spent rediscovering why the last rollout broke. Most teams pay it. Few enjoy it.

Servary collapses that work into one console. Pull weights from Hugging Face or your S3 bucket, target a VM or any Kubernetes cluster you own, and the platform handles the rest: rollout, gateway, audit. You keep the weights, the traffic, and the hardware. We never touch your network.

How it works

Three steps from a model name on Hugging Face to a stable endpoint your clients can hit. Servary owns everything in between.

  1. 1

    Register a model

    Point Servary at Hugging Face, an S3-compatible bucket, or your private registry. Every revision is content-addressed and reproducible across environments.

  2. 2

    Pick an environment

    Target a single VM for a quick proof, or a managed Kubernetes cluster for production. The control plane is the same; the runner adapts.

  3. 3

    Ship a stable URL

    Servary handles the rollout, the gateway, the audit log, and the lifecycle. Your clients integrate once with a URL that survives pod churn and model upgrades.

Features

The pieces of an LLM deployment, in one place.

Supported models

Any open-weight LLM supported by vLLM or SGLang runs on day one, with more runtimes on the way. The list below is what the team uses in production; the long tail of community models follows the same code path.

Need something not listed? Open an issue or message us. Most additions are a runtime config change, not a code change.

Questions

What is Servary, exactly?

Servary is a self-hosted control plane for serving open-weight LLMs on your own infrastructure. Point it at a model registry and a target environment, and it handles the full lifecycle: registration, rollout, a stable gateway, and an audit trail. No SaaS, no proxy, no shared tenancy. Your weights and traffic never leave your network.

Which models can I serve?

Any open-weight LLM supported by the inference runtimes Servary drives today, vLLM and SGLang, with more runtimes on the roadmap. That covers Llama 3.3 / 4, the full Qwen 3 family including the MoE variants, DeepSeek V3 / R1, Mistral Small / Large, Phi-4, Gemma 3, Command-R+, and many more. The supported-models list is updated continuously.

Do I need Kubernetes?

No. Single-VM installs and managed Kubernetes are both first-class targets. The same control plane drives both, so a project can graduate from one to the other without changing tools.

Is Servary open source?

Not on day one, but it is on our roadmap. Servary is self-host by default, so from the start you keep the weights, the traffic, and the audit log. We plan to open up the core over time and will share specifics as we get closer.

When can I use it?

We're finalising the public preview. Click Notify me at the top, and you'll hear from us the day it ships.