OpenAI o3 IQ score is making headlines after the model achieved a remarkable 136 on Norway’s public Mensa test.
This officially places the model’s performance above 98% of the human population, based on standardized IQ distribution curves. The data comes from the independent platform TrackingAI.org and showcases a significant leap in AI cognitive benchmarks.
The o3 model is part of OpenAI’s elite “o-series,” which has dominated recent intelligence testing. Its 136 score qualifies it for Mensa Norway, marking the first time an AI model meets that threshold under test conditions designed for humans.
The benchmark utilized two different evaluations — an Offline Test and the public Mensa Norway test. While o3 scored a modest 116 on the Offline evaluation, its Mensa score surged to 136, possibly due to its better alignment with human-oriented testing or some subtle overlap in prompt familiarity.
Proprietary edge: o3 outpaces GPT-4o and Llama 4
The OpenAIClick here for more Details o3 IQ score clearly highlights the widening performance gap between proprietary and open-source AI models. While o3 led with 136, GPT-4o scored only 95 on the same Mensa test. Even Meta’s best open model, Llama 4Click here for more Details Maverick, reached only 106.
TrackingAI’s testing method includes a prompt with four Likert-style response options. Each language model must choose one and justify the answer in 2–5 sentences. The best of the seven latest completions is used for scoring, with refusal events logged separately.
Though the scores are compelling, some have noted the lack of confidence intervals and transparency in the prompting process. Without this, reproducibility and interpretation remain limited, even with structured evaluations.
ANOTHER MUST-READ ON ICN.LIVE:
Trump vs. Federal Reserve policy: rate cuts, pressure, and Powell’s future in question
OpenAI o3 IQ score bucks the multimodal underperformance trend
Another standout insight is how o3 defies the trend of underperformance in multimodal models. Previous models like o1 Pro saw a drop in IQ when vision was activated — from 122 to 86 on the Mensa test. But o3 maintains top-tier text comprehension while significantly improving image analysis.
This suggests OpenAI may have made a breakthrough in integrating multimodal data without sacrificing reasoning strength.
Still, even with an IQ of 136, critics argue that short-context reasoning — the kind tested by Mensa — doesn’t reflect real-world capabilities like long-term planning or contextual dialogue. The utility of these high scores remains debated.
Despite questions around methodology, OpenAIClick here for more Details o3 IQ score sets a new standard in AI cognitive benchmarking. As transparency in corporate model development remains elusive, third-party groups like LM-Eval and GPTZero are becoming essential.
More nuanced evaluations will be needed to measure deeper cognitive behaviors beyond IQ-style testing. Still, o3’s Mensa-level score confirms a clear evolution in the reasoning power of today’s best AI systems.