mmJEE-Eval

A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

1Kalinga Institute of Industrial Technology (KIIT), 2Indian Institute of Technology (IIT), Bhubaneswar
*(Work done while at IIT Bhubaneswar)

IJCNLP-AACL 2025 (Findings)
NeurIPS 2025 MATH-AI Workshop (Poster, Non-Archival)
πŸ€— Dataset Leaderboard Code arXiv

Introduction

Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce mmJEE-Eval, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100% pass@3 and pass@5 numbers), they fully collapse when meta-cognition load is increased (GPT-5 fixes just 5.2% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://github.com/ArkaMukherjee0/mmJEE-Eval

Video Presentation

Leaderboard

Performance comparison on mmJEE-Eval and other industry standard benchmarks

Closed-source Models Open-source Models Human Baselines
Model mmJEE-Eval (Ours) Industry Standard Benchmarks
Acc. %
(Full)
Acc. %
(2025 set)
Marks
(%)
Marks w/
CT (%)
MMMU MMMU
Pro
Math
Vista
Char
Xiv
Random choice 8.3 Β±0.3 9.3 Β± 1.6 -12.7% -1.9% 22.1% 12.6% β€” 10.8%
Aya Vision 8B 8.5 Β±0.6 9.2 Β± 1.0 -12.5% -4.4% 39.9% β€” β€” β€”
Kimi VL Thinking 2506 16B 9.8 Β± 0.2 11.2 Β± 2.7 -4.44% -3.06% 64.0% 46.3% 68.7% β€”
Qwen 2.5 VL 7B 10.9 Β±0.8 12.4 Β± 1.9 -5.83% 1.4% 58.6% 46.2% 68.2% 42.5%
InternVL3 8B 10.8 Β±0.5 12.5 Β± 1.3 -6.1% 3.6% 62.7% β€” 71.6% 37.6%
InternVL 3.5 14B 15.6 Β±0.9 16.1 Β± 1.6 5.0% 7.2% 73.3% β€” 80.5% β€”
InternVL 3.5 30B 18.8 Β±1.2 18.8 Β± 1.1 6.1% 7.2% 75.6% β€” 80.9% β€”
Grok 4 Fast 19.3 Β±1.1 17.6 Β± 1.5 7.5% 11.1% β€” β€” β€” β€”
InternVL3 78B 22.0 Β±0.8 23.6 Β± 3.4 8.33% 12.22% 72.2% β€” 79.0% 46.0%
Human (qualifying cutoff) β€” β€” β€” 20.6% 76.2% 73% β€” β€”
Gemma 3 27B 29.8 Β±0.3 30.6 Β± 0.7 18.6% 21.9% 64.9% β€” 63.3% β€”
Llama4 Scout 109B 40.4 Β±0.5 37.1 Β± 2.3 28.1% 33.9% 69.4% β€” 70.7% β€”
Qwen3 VL 235B Instruct 44.6 Β± 0.4 46.2 Β± 4.9 36.1% 37.2% 78.7% 68.1% 84.9% β€”
Llama4 Maverick 400B 50.9 Β±0.3 43.8 Β± 2.8 32.8% 35.8% 73.4% β€” 73.7% β€”
Claude Sonnet 4.5 57.5 Β± 0.7 56.8 Β± 4.5 49.7% 52.7% 77.8% β€” β€” β€”
OpenAI o3 77.4 Β±0.7 72.7 Β± 3.7 66.7% 66.9% 82.9% 76.4% 81.1% 78.6%
GPT-5-mini 'Medium' 75.3 Β± 0.7 71.3 Β± 6.7 70.8% 71.94% 80.0% β€” β€” β€”
Gemini 2.5 Pro 81.2 Β±0.7 77.4 Β± 0.8 70.6% 79.2% 84.0% 71.2% 84.6% β€”
Gemini 2.5 Flash 09-2025 82.6 Β± 0.4 79.8 Β± 1.2 83.3% 83.3% 79.7% β€” 81.2% β€”
GPT-5 83.9 Β±0.6 79.5 Β± 1.5 80.8% 81.9% 84.2% 78.4% 82.7% 81.1%
Human (Top 10 performers) β€” β€” β€” 90.1% 82.6% 80.8% β€” β€”
Human (Rank 1, Topper) β€” β€” β€” 92.2% 88.6% 85.4% 60.3% 80.5%
Ξ” over human
(Human – Best model)
β€” β€” β€” 10.1% 4.4% 7% –24.3% –0.6%

Note: For mmJEE-Eval, Acc. % (Full) represents Pass@1 accuracy on the full set of 1,460 questions, Acc. % (2025 set) presents Pass@1 accuracy on the held-out 2025 subset, Marks (%) represents total score on the two papers of JEE Advanced 2025 following the official marking scheme, Marks w/ CT (%) presents confidence thresholded scores. For other benchmarks, we source Pass@1 accuracies from respective leaderboards.