Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
Summary
JetBrains released Mellum2, a 12B-parameter MoE model that activates 2.5B parameters per token, aimed at latency-sensitive code and text tasks. It's Apache 2.0 licensed and available on Hugging Face.
JetBrains announced Mellum2, a Mixture-of-Experts model with 12B total parameters designed for fast inference on text and code workloads. The release was published on the Hugging Face Blog with the model weights available under the Apache 2.0 license.
What’s actually new
Mellum2 is a from-scratch MoE architecture — not a fine-tune of an existing base. It activates only 2.5B of its 12B parameters per token, which is where the inference speed claim comes from: JetBrains reports more than 2x faster inference compared to similarly sized open models. The model is deliberately scoped to text and code, skipping multimodal capabilities entirely. JetBrains positions it not as a standalone general-purpose model but as a “focal” component — something you’d slot into a larger system for routing, RAG post-processing, sub-agent tasks, or context compression. The original Mellum was a code completion model; Mellum2 broadens that to natural language while keeping the same efficiency-first design philosophy. A full technical report with architecture details, training setup, and benchmark methodology is linked from the announcement on arXiv.
What it means for your config
Mellum2 is a model release, not a library or framework update, so there’s no direct config migration path to worry about. That said, if you’re running inference pipelines that reference Hugging Face model IDs — in transformers configs, vLLM serving configs, or orchestration YAML — you’ll want to note the new collection path (JetBrains/mellum-2) for when you swap it in. MoE models can behave differently from dense models in serving setups: tensor parallelism strategies, memory allocation, and expert routing overhead may require adjustments to your inference server configuration. The announcement doesn’t detail specific serving recommendations (quantization support, recommended frameworks, or minimum hardware), so check the model card and technical report before committing to a deployment config. If you’re using model routing logic that dispatches to different models based on task type — the kind of setup Mellum2 is explicitly designed for — this is a new candidate to add to your routing table alongside larger reasoning models.
Recommended next step
Pull up the model collection on Hugging Face and the linked arXiv technical report. Before integrating, verify the benchmark results are relevant to your specific workload — “competitive with similarly sized models” covers a wide range of tasks, and performance on code generation versus general reasoning can vary significantly. If you’re running multi-model agent systems where intermediate calls (routing, summarization, validation) are bottlenecked on latency or cost, Mellum2’s 2.5B active parameter footprint makes it worth benchmarking against whatever you’re currently using for those steps.
Read the full announcement on Hugging Face Blog → Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
More Hugging Face Updates
Holo3.1: Fast & Local Computer Use Agents
H Company releases the Holo3.1 family of computer-use models in four sizes (0.8B to 35B-A3B) with quantized checkpoints for local inference, expanded mobile support, and native function-calling protocols.
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
Artificial Analysis and IBM launch ITBench-AA, a benchmark testing frontier AI models on agentic SRE tasks like Kubernetes incident diagnosis. No model breaks 50%, making it one of the least saturated agentic benchmarks available.