All subjects are at a postdoctoral level.
The long-awaited next-generation large model from xAI—Grok 4 has finally been released! Its capabilities exceed our expectations.
Around noon today Beijing time, the long-awaited xAI launch event finally began, with Musk appearing in the live stream, saying, 'This is the best AI in the world, let us show it off.'
Musk stated that Grok 4 can achieve full marks in the SAT exam (American college entrance exam) every time without previewing the questions; it can also achieve near-full marks in any subject of the GRE, surpassing the level of all graduate students worldwide. The most powerful aspect of Grok 4 is its reasoning ability, which has achieved reasoning levels beyond that of humans.
Musk believes that Grok 4 can achieve new scientific discoveries within this year.
Thanks to enhanced computing power and reinforcement learning training, Grok 4's reasoning ability has improved by 10 times compared to its predecessors. From Grok 2 to Grok 4, the technological paradigms used are different, including next token prediction, pre-training computation, pre-training + RL, and RL computation.
Among them, the computational power from Grok 2 to Grok 3 during pre-training increased by 10 times, and Grok 3 reasoning introduced RL fine-tuning for the first time, bringing deep reasoning capabilities. Grok 4 reasoning's reinforcement learning further increased the computational power by 10 times, resulting in significant improvement in reasoning abilities.
Additionally, due to the enhanced tool-calling capabilities, Grok 4 has further amplified its own intelligence. Therefore, it can achieve results far exceeding SOTA on various challenging benchmarks.
Next comes the main event: the benchmark results for Grok 4.
First is HLE (Humanities Last Exam), which includes mathematics, chemistry, and logic. In the benchmark results leaked last Saturday, Grok 4 scored a standard score of 35% on HLE, which improved to 45% after using reasoning techniques, but most netizens expressed skepticism.
In today's live broadcast, xAI researchers stated that the previous SOTA models could achieve a maximum score of 41.0% when using tools.
Now, Grok 4 has further improved this benchmark score.
Specifically, compared to other SOTA models (o3, Gemini 2.5 Pro), Grok 4 scored 38.6% when using tools, while Grok 4 Heavy's score soared to 44.4%. If the large model spends more time thinking during the test and appropriately uses more external tools, the HLE score can further increase to 50.7%.
Regarding other benchmark results, including GPQA (graduate-level Google verification question answering benchmark), AIME25 (American Mathematics Competition Invitational), LCB (Jan-May) (programming competition/online algorithm competition), HMMT25 (high school team mathematics competition), and USAMO25 (top American high school mathematics competition). From the chart below, it can be seen that Grok 4 Heavy has achieved the latest SOTA.
In contrast, humans can hardly answer a few questions in the HLE test. Musk emphasized multiple times: Grok has reached a postdoctoral level in all subjects, without exception. It hasn't discovered new sciences or new physical laws, but that is just a matter of time.
"If Grok does not discover practical new scientific technologies within this year, I would be very surprised," Musk said.
The complete benchmark results from the large model performance evaluation platform Artificial Analysis indicate that Grok 4 has become the leading AI model, achieving a total score of 73, ahead of o3, Gemini 2.5 Pro, Claude 4 Opus, and DeepSeek R1 0528.
Imagine where we are now; we are in the midst of an explosion of intelligence development, unprecedented in human history. It’s time to see what Grok 4 can specifically do.
Let’s look at one or two demos, such as 'HTML animations based on physical principles, simulating a 30-second visualization of two black holes colliding and generating gravitational waves':
Grok 4 almost completely presents the simulation effect of gravitational waves from two black holes approaching to the final merge. One side of the animation shows the reasoning process and steps of the calculations along with the code, with links to each paper consulted.
Grok 4's versatility has strengthened.
In addition to the improvement in scores across various language benchmarks, Grok 4 has also been enhanced in other areas.
Among them, Grok 4's voice capabilities are twice as fast compared to the previous generation, with lower end-to-end latency; it supports 5 types of voices; the total daily user engagement time has increased by 10 times.
The newly added Grok characters Eve and Sal are now available in the iOS version of Grok, with Sal supporting various personalities, while Eve can sing and whisper.
In the ARC-AGI benchmark test set, it is specifically designed to evaluate the general reasoning abilities of artificial intelligence systems and is regarded as an important touchstone on the path to AGI, aimed at testing whether the model can flexibly solve new problems it has never encountered, just like humans.
In this challenging benchmark targeting the core capabilities of AGI, Grok 4 also achieved the latest SOTA, reaching 15.9% on ARC-AGI-2, nearly doubling the previous commercial SOTA and surpassing the current Kaggle competition SOTA.
In the Vending-Bench benchmark test, it focuses on evaluating the ability of agents to perform complex operational tasks in the real physical world, with the core goal being to solve the 'Sim2Real Gap' (the gap from simulation to reality) between traditional simulation environments (like Habitat, AI2-THOR) and the real world, promoting the practical application capabilities of robotics in open scenarios.
It can be seen that Grok 4 has taken the lead compared to Claude Opus 4, Human, Gemini 2.5 Pro, and o3.
Grok 4 can be called via API, providing a context window of 256K tokens. It is currently open for use, with the version number grok-4-0709, and the price is the same as Grok 3.
According to tests from Artificial Analysis, xAI's API currently provides Grok 4 services at a rate of 75 tokens per second. Although it does not match o3 (188 tokens per second), it is better than Claude 4 Opus Thinking (66 tokens per second).
Finally, in terms of game experience, DannyLimanseta created an FPS shooting game in 4 hours using Grok 4. Grok can not only be used to create games but can also run games, gain insights into the elements of excellent games, and provide improvement suggestions. The results look really good.
Next, xAI is expected to release code models, multimodal agents, and video generation models, and it looks like the new product releases will reach a monthly update speed.
Currently, Grok 4 is online, but it requires a paid subscription, and the price is quite expensive. The payment model is divided into annual and monthly payments, where SuperGrok is $300 per year (approximately 2154 RMB), and SuperGrok Heavy is $3000 per year (approximately 21540 RMB).
Official website link: https://grok.com/