mmJEE-Eval
A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models
🤗 Dataset Leaderboard Code arXivIntroduction
Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce mmJEE-Eval, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100% pass@3 and pass@5 numbers), they fully collapse when meta-cognition load is increased (GPT-5 fixes just 5.2% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://github.com/ArkaMukherjee0/mmJEE-Eval
Leaderboard
Performance comparison on mmJEE-Eval and other industry standard benchmarks
| Model | mmJEE-Eval (Ours) | Industry Standard Benchmarks | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc. % (Full) |
Acc. % (2025 set) |
Marks (%) |
Marks w/ CT (%) |
MMMU | MMMU Pro |
Math Vista |
Char Xiv |
|
| Random choice | 8.3 ±0.3 | 9.3 ± 1.6 | -12.7% | -1.9% | 22.1% | 12.6% | — | 10.8% |
| Aya Vision 8B | 8.5 ±0.6 | 9.2 ± 1.0 | -12.5% | -4.4% | 39.9% | — | — | — |
| Kimi VL Thinking 2506 16B | 9.8 ± 0.2 | 11.2 ± 2.7 | -4.44% | -3.06% | 64.0% | 46.3% | 68.7% | — |
| Qwen 2.5 VL 7B | 10.9 ±0.8 | 12.4 ± 1.9 | -5.83% | 1.4% | 58.6% | 46.2% | 68.2% | 42.5% |
| InternVL3 8B | 10.8 ±0.5 | 12.5 ± 1.3 | -6.1% | 3.6% | 62.7% | — | 71.6% | 37.6% |
| InternVL 3.5 14B | 15.6 ±0.9 | 16.1 ± 1.6 | 5.0% | 7.2% | 73.3% | — | 80.5% | — |
| InternVL 3.5 30B | 18.8 ±1.2 | 18.8 ± 1.1 | 6.1% | 7.2% | 75.6% | — | 80.9% | — |
| Grok 4 Fast | 19.3 ±1.1 | 17.6 ± 1.5 | 7.5% | 11.1% | — | — | — | — |
| InternVL3 78B | 22.0 ±0.8 | 23.6 ± 3.4 | 8.33% | 12.22% | 72.2% | — | 79.0% | 46.0% |
| Human (qualifying cutoff) | — | — | — | 20.6% | 76.2% | 73% | — | — |
| Gemma 3 27B | 29.8 ±0.3 | 30.6 ± 0.7 | 18.6% | 21.9% | 64.9% | — | 63.3% | — |
| Llama4 Scout 109B | 40.4 ±0.5 | 37.1 ± 2.3 | 28.1% | 33.9% | 69.4% | — | 70.7% | — |
| Qwen3 VL 235B Instruct | 44.6 ± 0.4 | 46.2 ± 4.9 | 36.1% | 37.2% | 78.7% | 68.1% | 84.9% | — |
| Llama4 Maverick 400B | 50.9 ±0.3 | 43.8 ± 2.8 | 32.8% | 35.8% | 73.4% | — | 73.7% | — |
| OpenAI o3 | 77.4 ±0.7 | 72.7 ± 3.7 | 66.7% | 66.9% | 82.9% | 76.4% | 81.1% | 78.6% |
| GPT-5-mini 'Medium' | 75.3 ± 0.7 | 71.3 ± 6.7 | 70.8% | 71.94% | 80.0% | — | — | — |
| Gemini 2.5 Pro | 81.2 ±0.7 | 77.4 ± 0.8 | 70.6% | 79.2% | 84.0% | 71.2% | 84.6% | — |
| Gemini 2.5 Flash 09-2025 | 82.6 ± 0.4 | 79.8 ± 1.2 | 83.3% | 83.3% | 79.7% | — | 81.2% | — |
| GPT-5 | 83.9 ±0.6 | 79.5 ± 1.5 | 80.8% | 81.9% | 84.2% | 78.4% | 82.7% | 81.1% |
| Human (Top 10 performers) | — | — | — | 90.1% | 82.6% | 80.8% | — | — |
| Human (Rank 1, Topper) | — | — | — | 92.2% | 88.6% | 85.4% | 60.3% | 80.5% |
| Δ over human (Human – Best model) |
— | — | — | 10.1% | 4.4% | 7% | –24.3% | –0.6% |
Note: For mmJEE-Eval, Acc. % (Full) represents Pass@1 accuracy on the full set of 1,460 questions, Acc. % (2025 set) presents Pass@1 accuracy on the held-out 2025 subset, Marks (%) represents total score on the two papers of JEE Advanced 2025 following the official marking scheme, Marks w/ CT (%) presents confidence thresholded scores. For other benchmarks, we source Pass@1 accuracies from respective leaderboards.