K2: An open source model that delivers frontier capabilities

Home
K2: An open source model that delivers frontier capabilities

March 04 2026 / Research

K2: An open source model that delivers frontier capabilities

Published by admin

Introduction

If you follow AI research even casually, you’ve probably felt a gap between what frontier models can do and what we’re allowed to know about how they’re built. While open weights models have given us some level of transparency and customization, the data, training curves, and engineering recipes usually sit behind the curtain, making AI development an opaque process.

As part of our frontier model-focused work at MBZUAI’s Institute of Foundation Models, we are releasing a new version of K2 in a deliberate attempt to push back on that trend. K2 is a 70-billion-parameter, reasoning-centric foundation model designed not just to perform well, but to be fully inspectable. Weights, training code, data composition, mid-training checkpoints, evaluation harnesses – everything is meant to be open. It stands as the strongest fully open model, rivaling open-weight leaders in its size class, outperforming Qwen2.5-72B, and approaching the performance of Qwen3-235B.

And unlike many “open” models that quietly cap out at chatbot duties, K2 is explicitly built from scratch as a base for deep reasoning, long-context processing, and native tool use, in addition to functions such as conversation and knowledge retrieval.

A strong general model and a foundation for advanced reasoning

We began with a simple premise: you cannot study or deploy an advanced reasoning system on top of a weak foundation. K2 is therefore built as a dense 70B-parameter transformer, placing it in the same class as Qwen 2.5-72B – one of the most widely used developer models today – but with stronger reasoning capabilities thanks to its dedicated mid-training design.

Purpose/Activity	Type of data	Lawful basis for processing including basis of legitimate interest
To register you as a new student	(a) Identity (b) Contact	Performance of a contract with you
To process your application including: (a) Manage payments, fees and charges (b) Collect and recover money owed to us	(a) Identity (b) Contact (c) Financial (d) Transaction (e) Marketing and Communications	(a) Performance of a contract with you (b) Necessary for our legitimate interests (to recover debts due to us)
To manage our relationship with you which will include: (a) Notifying you about changes to our terms or privacy policy (b) Asking you to leave a review or take a survey	(a) Identity (b) Contact (c) Profile (d) Usage (e) Marketing and Communications (f) Technical	Necessary legitimate interests to study use of the services to enable enhancement and efficiency.
To enable you to take part in a survey	(a) Identity (b) Contact (c) Profile (d) Marketing and Communications	(a) Performance of a contract with you (b) Necessary for our legitimate interests (to develop and grow our business)
To administer and protect our business and this website (including troubleshooting, data analysis, testing, system maintenance, support, reporting, hosting of data and provide technical assistance, other support to help keep this site working, safe and secure)	(a) Identity (b) Contact (c) Technical	(a) Necessary for our legitimate interests (for running our business, provision of administration and IT services, network security, to prevent fraud and in the context of a business reorganisation or group restructuring exercise) (b) Necessary to comply with a legal obligation
To deliver relevant website content to you	(a) Identity (b) Contact (c) Profile (d) Usage (e) Marketing and Communications (f) Technical	Necessary for our legitimate interests (to study how users use our services, to develop them, to grow our business and to inform of our marketing strategy)
To use data analytics to improve our website, services, marketing, customer relationships and experiences	(a) Technical (b) Usage	Necessary for our legitimate interests (to define types of users for our services, to keep our website updated and relevant, to develop our business and to inform our marketing strategy)
To make suggestions and recommendations to you about our services that may be of interest to you	Lorem ipsum	(a) Identity (b) Contact (c) Technical (d) Usage (e) Profile (f) Marketing and Communications
To disclose data to selected third parties	(a) Identity (b) Contact (c) Technical (d) Transaction (e)Profile	See ‘ disclosures of your personal data’ , ‘ transfers’ and ‘ data retention’ sections below.

Yet the architecture is only the starting point. What differentiates K2 is the training philosophy. Rather than treating reasoning as an add-on applied at the end through superficial chain-of-thought finetuning, we embed reasoning deeply into the model’s mid-stage development, shaping the underlying representations long before the final polish.

We describe five pillars we explicitly optimized for:

Broad general knowledge.
Deep domain expertise in math, code, and science.
Robust long-context handling.
Early exposure to reasoning behaviors like planning and backtracking.
Native tool-calling capabilities for things like code execution and web search.

To get there, K2 goes through three distinct phases:

Pre-training for breadth and fluency.
Mid-training to infuse long-context skills and explicit reasoning behaviors.
Supervised finetuning (SFT) to turn the model into a usable assistant with tool calls and instruction following, while still leaving plenty of headroom for future reinforcement learning.

In the technical report we’ve published, we repeatedly use the phrase “360-open” to distinguish our approach from typical open weights releases. According to our terminology, a model isn’t truly open unless you also ship:

The full pre-training corpus (or at least its exact composition and curation recipe).
Any mid-training datasets, such as the reasoning-heavy TxT360-Midas corpus we introduce here.
The SFT data (TxT360-3efforts) that teaches the model to interact, reason at different effort levels, and call tools.
Training logs, hyperparameters, and infrastructure details, including how we handled loss spikes, batch sizing, and scaling laws.

That last point matters because, in industry, continual training – taking a capable base model and nudging it towards a new domain or task – is now the norm. But with closed models, you’re always guessing what’s already baked into the weights. We’ve explicitly released mid-training checkpoints and data composition so other researchers can plan domain adaptation without accidentally erasing capabilities or double-counting fragile distributional quirks.

This is also about reproducibility. Our report explains the scaling law-inspired choices: we track an “effective averaging timescale”, tune learning rates and batch sizes under a fixed token budget of roughly 12 trillion tokens, and even discuss when our decay-to-zero schedule causes parameter norms to stall.

Teaching a 70B model to think while it trains

The most novel part of K2 is the mid-training phase, where the model is already strong but not yet capable of reasoning. This is where we’ve focused next.

First, we extended the context window to 512K tokens. Second, we started feeding the model thinking traces (explicit step-by-step solutions) at scale. We assembled over 250 million unique math problems and synthesized solutions, resulting in a huge corpus where each problem comes paired with a multi-step derivation, not just an answer.

On top of that, we synthesized reasoning behaviors that aren’t neatly tied to math: dual-process analysis, planning, data science exploration, even user manual style stepwise instructions. Over a hundred prompt templates covering different “modes of thought,” all grounded in real user queries scraped from open instruction datasets.

The intention was to make reasoning feel like a native behavior for the model, something it has seen in many domains, not just in the curated puzzles that dominate modern RL reasoning benchmarks.

Once the mid-training was done, we applied a relatively modest SFT phase. We trained on a curated mix of chat, tools, and reasoning data (TxT360-3efforts), using full-parameter SFT with long sequences and aggressive sequence packing so almost no tokens are wasted on padding. The SFT run itself was short precisely because we wanted to demonstrate that even light tuning can elicit strong capabilities when the base is well-prepared.

We evaluated the full model along several axes: general knowledge, math and STEM, coding, long-context QA, and tool use. For AI researchers, the base model results are probably the most striking. In the mid-4 checkpoint, the strongest of the mid-training stages, K2 reaches:

55.1% on GPQA-Diamond, a challenging graduate-level science benchmark. This number further shoots up to 69.3% after the modest SFT stage.
93.6% on GSM8K with structured reasoning prompts.
94.7% on the MATH dataset.
On logic puzzles, our performance on Knights and Knaves -8 People at the hardest difficulty level matches leading fully-trained models such as DeepSeek-R1 (83%) and o3-mini-high (83%). Knights and Knaves is a notoriously challenging puzzle in which one must use logic to discover who in a group of people is answering questions truthfully, and who is answering falsely.

All of those numbers are at or above the best open weight baselines they compare against, including Qwen2.5-72B, and they’re particularly dominant on logic-heavy puzzles like Countdown and Knights and Knaves reasoning.

General purpose scores tell a more nuanced story: K2 doesn’t top Qwen on overall MMLU, but shines on harder subsets such as MMLU-Pro and GPQA-Diamond, where careful reasoning and calibrated answers matter more than regurgitating textbook facts.

Studying a model’s lifecycle, not just its final score

Another part of the report we’re proud of is the longitudinal study of K2’s development. Because we saved checkpoints and ran standard evaluations at each stage, we were able to plot how capabilities emerge over time.

For example, GSM8K accuracy rises with the shift to structured reasoning formats and larger token budgets. Logic benchmarks jump sharply at the beginning of certain mid-training stages, which we interpret as behavior shifts rather than simple knowledge accumulation: the model starts to prefer planning-style responses once it’s seen enough examples in the mid-training corpora.

But there’s also a darker side to “thinking tokens”: they can reveal things the final answer is careful to hide. So we devoted an entire section to safety evaluation and “thinking–response divergence.”

We ran K2 across 72 safety and adversarial stress tests, sampling 200 prompts from each. Overall, the model produces safe or appropriately refusing responses about 86% of the time. Performance is often above 95% on chemistry, biology, financial compliance, IP, and medical guidance, as well as on social harms like hate, extremism, and criminal instruction.

We also compared K2’s behavior to its predecessor K2 Think and found subtle but important differences. K2 is generally safer and less jailbreak-prone, but it sometimes over-refuses harmless queries that merely look scary (“How to kill a python program”) and, like every modern model, remains vulnerable to evolving jailbreak patterns.

The takeaway is that safety mechanisms today often act like output filters, not deep semantic priors. The model can think something unsafe but learn to hide it in the final message. For open weight models that people will finetune, compose, and inspect, that’s both a transparency win and a safety challenge.

Why K2 matters

On paper, K2 might look like another open 70B model competing on the common benchmarks in use today. In practice, it is something far rarer: a frontier-scale system intentionally built to be examined, extended, and improved in public.

Heading 3 example goes here

For industry teams, K2 provides a reasoning-ready foundation with full lifecycle documentation, making domain adaptation and continuous training far more predictable. For researchers, it offers a testbed where questions about chain-of-thought training, RL-based reasoning, safety divergence, and long-context mechanisms can finally be studied with full visibility into the data and checkpoints that shaped the model.

Heading 4 example goes here

And for the broader ecosystem, K2 demonstrates that open models need not be smaller, weaker versions of closed ones. You can target state-of-the-art reasoning, math, logic, and tool use—and still publish the recipe, the data, and the lessons learned from 1.5 million steps of training.

Image 1 Gallery Title

Video Transcript

basically got the opportunity to talk to each of the professors um the ones that are promoting their projects so that we get to know a little bit more about them and their projects in order to make a more base decision and choose the project we're going to work with in the next month. By joining Yog, I'm very ambitious about getting a hands-on experience on research especially in computer vision and NLB and I'm also uh very interested in working with higher professors that come from different areas. Sure. I'm trying to decide on a project. I guess I'm in between two of them. Um, one is with Professor Canaro for tokenization and embeddings and then my other one is trying to figure out a way to go from one assembly language to another assembly language using ML. I've been working with Professor Preslav Nakoval with my team. We've been working on an automated media bias rating [music] system. We've been working on the co switching projects with professor Tommoro where we study the intermixing of languages [music] in the same sentence. It is my first time in Abu Dhabi. It's my first time in Abu Dhabi. The thing that surprised me the most about this city is how developed [music] it is. Like all the sites to see, all the places to be in. Grand Monsk, I [music] think it's a pretty amazing site. Um, my favorite culture activity was Ferrari World. I thought it was super fun. The calm atmosphere, the l was also amazing. It was [music] so enriching. The current batch of Europe is very diverse and they're [music] very smart. They come all over the world. They are young brilliant generations and to and today demands of AI talents we need more talent in [music] this field and I think having such a UK prop program can help us to initiate or contribute to the world about to enrich the AI talent in the world. So I think [music] it's very important for the future generations for all the young [music] people to get certain kind of AI education and training but at the same time it's very important they also do learn and [music] get very very good at the domain the application. So my interns at your group this year they actually worked at improving uh machine translation when translating Arabic dialects into English. I've been actually very um impressed [music] uh by their technical skills. Uh so my group of interns actually half of them did not speak Arabic but they actually showed that and that enthusiasm about actually working on on the project and on [music] producing the results. So overall I've been really impressed with the technical abilities that the students really have. [cheering]

“In short, GPU.js is a JavaScript acceleration library that can be used for general-purpose computations on GPUs using JavaScript. It supports browsers, Node.js and TypeScript.”