Servary
Run open-weight LLMs on your own infrastructure.
Register a model. Pick an environment. Ship a stable URL. The full lifecycle in one console, on your hardware.
- VM or Kubernetes, one console
- Your weights, your hardware
- 60+ models, day-one support
Shipping an LLM in production has become a tax. Bespoke YAML, glue scripts, Slack threads, a quarter spent rediscovering why the last rollout broke. Most teams pay it. Few enjoy it.
Servary collapses that work into one console. Pull weights from Hugging Face or your S3 bucket, target a VM or any Kubernetes cluster you own, and the platform handles the rest: rollout, gateway, audit. You keep the weights, the traffic, and the hardware. We never touch your network.
How it works
Three steps from a model name on Hugging Face to a stable endpoint your clients can hit. Servary owns everything in between.
- 1
Register a model
Point Servary at Hugging Face, an S3-compatible bucket, or your private registry. Every revision is content-addressed and reproducible across environments.
- 2
Pick an environment
Target a single VM for a quick proof, or a managed Kubernetes cluster for production. The control plane is the same; the runner adapts.
- 3
Ship a stable URL
Servary handles the rollout, the gateway, the audit log, and the lifecycle. Your clients integrate once with a URL that survives pod churn and model upgrades.
Features
The pieces of an LLM deployment, in one place.
-
One registry, every model
Pull from Hugging Face or any S3 bucket. Every revision is content-addressed and reproducible: the same hash deploys identically across clusters, regions, and rollbacks.
-
Secrets, scoped and rotated
Encrypted at rest, attached to environments by reference, and rotated without rebuilding deployments. HF tokens, model API keys, and registry credentials never sit in YAML, git, or a teammate's clipboard.
-
Deployments that stay in sync
Each deployment carries a spec hash; Servary's reconciler keeps the cluster in lockstep. Drift is detected in seconds, fixed automatically, and surfaced in the UI before customers notice.
-
A live view of every deployment
Pods, logs, events, and metrics on one tab. The pieces you'd otherwise stitch from kubectl, Grafana, and a terminal land where you ship from, with warm-up traces and status changes pushed live.
-
Endpoints your clients can trust
Every deployment gets a gateway-managed URL that survives pod churn, model upgrades, and environment moves. Your clients integrate once; you replace what's behind the URL whenever you need to.
-
Runs anywhere you do
Spin up Servary against a single VM for a quick proof, or against a managed Kubernetes cluster for production. The console and the API don't change; only the runner adapts.
-
Audit-ready by default
Every API call lands in an immutable audit log: who, what, when, which spec hash. Compliance, postmortems, and security reviews get the trail they need with no extra tooling.
-
And much more…
Per-deployment metrics, multi-runtime support, traffic management, multi-LoRA, scale-to-zero, and cost estimation are all on the way.
Supported models
Any open-weight LLM supported by vLLM or SGLang runs on day one, with more runtimes on the way. The list below is what the team uses in production; the long tail of community models follows the same code path.
- Llama 3.3 70B
- Llama 4 Scout / Maverick
- Qwen 3 0.6B → 235B MoE
- DeepSeek V3 / R1
- Mistral Small 3.1
- Mistral Large 2
- Phi-4 14B
- Gemma 3 1B → 27B
- Command-R+
Need something not listed? Open an issue or message us. Most additions are a runtime config change, not a code change.
Questions
What is Servary, exactly?
Servary is a self-hosted control plane for serving open-weight LLMs on your own infrastructure. Point it at a model registry and a target environment, and it handles the full lifecycle: registration, rollout, a stable gateway, and an audit trail. No SaaS, no proxy, no shared tenancy. Your weights and traffic never leave your network.
Which models can I serve?
Any open-weight LLM supported by the inference runtimes Servary drives today, vLLM and SGLang, with more runtimes on the roadmap. That covers Llama 3.3 / 4, the full Qwen 3 family including the MoE variants, DeepSeek V3 / R1, Mistral Small / Large, Phi-4, Gemma 3, Command-R+, and many more. The supported-models list is updated continuously.
Do I need Kubernetes?
No. Single-VM installs and managed Kubernetes are both first-class targets. The same control plane drives both, so a project can graduate from one to the other without changing tools.
Is Servary open source?
Not on day one, but it is on our roadmap. Servary is self-host by default, so from the start you keep the weights, the traffic, and the audit log. We plan to open up the core over time and will share specifics as we get closer.
When can I use it?
We're finalising the public preview. Click Notify me at the top, and you'll hear from us the day it ships.