OpenAI's two models gpt-oss-120b and gpt-oss-20b achieve performance on par with the commercial version, with the smaller version requiring only 16GB RAM.

OpenAI announced a significant milestone with the release of two open-source language models for the first time since GPT-2, marking a substantial strategic shift for the company. The two models gpt-oss-120b and gpt-oss-20b can operate on consumer hardware while still achieving performance close to high-end commercial models, released under the Apache 2.0 license allowing unrestricted use, modification, and commercialization.

The gpt-oss-120b version uses a mixture-of-experts architecture with a requirement of 80GB VRAM GPU, activating 5.1 billion parameters per token. Meanwhile, gpt-oss-20b can operate on devices with only 16GB of memory, activating 3.6 billion parameters per token. Both models support context lengths of up to 128,000 tokens, equivalent to GPT-4o, and are trained using reinforcement learning alongside advanced techniques from the o3 system.

The performance of these two models is particularly impressive in intensive benchmarks. On Codeforces, gpt-oss-120b achieved an Elo of 2622 when using tools and 2463 when not using tools, surpassing o4-mini and nearly matching o3. In the AIME 2024 mathematics test, the model achieved 96.6% compared to only 87.3% for o4-mini. Notably, on HealthBench for biomedical applications, gpt-oss-120b achieved 57.6%, exceeding even o3 with 50.1%.

Flexible optimization for practical deployment

A highlight of the two models is their flexibility in adjusting between latency and performance through three levels of reasoning: low, medium, and high. Developers can adjust this with just one command in the system message, allowing optimization tailored to specific applications. The post-training phase is conducted similarly to o4-mini, including supervised training and high-computation reinforcement learning.

OpenAI places special emphasis on safety, although the model is released with full rights for modification. Training data has removed sensitive information related to chemistry, biology, radiation, and nuclear issues. The post-training phase applies a layered command alignment method and techniques for rejecting dangerous commands. Three independent expert groups have evaluated the security and confirmed that the model does not reach the dangerous threshold according to the Preparedness Framework.

A bold decision by OpenAI is to not supervise the Chain-of-Thought of the two models, a significant difference from current commercial models. The company believes that keeping the unsupervised thought chain intact is crucial for monitoring biased, deceptive, or exploited behavior. This contrasts with the best current models that hide CoT to avoid duplication.

Two models are now available on HuggingFace with relatively high hardware requirements but accessible. While gpt-oss-120b requires a GPU with at least 80GB VRAM like the Nvidia A100, the gpt-oss-20b version can run on a 16GB VRAM GPU like the Nvidia RTX 4090, no longer posing a significant barrier for individual users and edge AI developers. This opens up opportunities for the community to develop powerful AI applications without relying on cloud services.