mmJEE-Eval

A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

1Kalinga Institute of Industrial Technology (KIIT), 2Indian Institute of Technology (IIT), Bhubaneswar
*(Work done while at IIT Bhubaneswar)

IJCNLP-AACL 2025 (Findings)
NeurIPS 2025 MATH-AI Workshop (Poster, Non-Archival)
🤗 Dataset Leaderboard Code arXiv

Introduction

Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce mmJEE-Eval, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100% pass@3 and pass@5 numbers), they fully collapse when meta-cognition load is increased (GPT-5 fixes just 5.2% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://github.com/ArkaMukherjee0/mmJEE-Eval

Leaderboard

Performance comparison on mmJEE-Eval and other industry standard benchmarks

Closed-source Models Open-source Models Human Baselines
Model mmJEE-Eval (Ours) Industry Standard Benchmarks
Acc. %
(Full)
Acc. %
(2025 set)
Marks
(%)
Marks w/
CT (%)
MMMU MMMU
Pro
Math
Vista
Char
Xiv
Random choice 8.3 ±0.3 9.3 ± 1.6 -12.7% -1.9% 22.1% 12.6% — 10.8%
Aya Vision 8B 8.5 ±0.6 9.2 ± 1.0 -12.5% -4.4% 39.9% — — —
Kimi VL Thinking 2506 16B 9.8 ± 0.2 11.2 ± 2.7 -4.44% -3.06% 64.0% 46.3% 68.7% —
Qwen 2.5 VL 7B 10.9 ±0.8 12.4 ± 1.9 -5.83% 1.4% 58.6% 46.2% 68.2% 42.5%
InternVL3 8B 10.8 ±0.5 12.5 ± 1.3 -6.1% 3.6% 62.7% — 71.6% 37.6%
InternVL 3.5 14B 15.6 ±0.9 16.1 ± 1.6 5.0% 7.2% 73.3% — 80.5% —
InternVL 3.5 30B 18.8 ±1.2 18.8 ± 1.1 6.1% 7.2% 75.6% — 80.9% —
Grok 4 Fast 19.3 ±1.1 17.6 ± 1.5 7.5% 11.1% — — — —
InternVL3 78B 22.0 ±0.8 23.6 ± 3.4 8.33% 12.22% 72.2% — 79.0% 46.0%
Human (qualifying cutoff) — — — 20.6% 76.2% 73% — —
Gemma 3 27B 29.8 ±0.3 30.6 ± 0.7 18.6% 21.9% 64.9% — 63.3% —
Llama4 Scout 109B 40.4 ±0.5 37.1 ± 2.3 28.1% 33.9% 69.4% — 70.7% —
Qwen3 VL 235B Instruct 44.6 ± 0.4 46.2 ± 4.9 36.1% 37.2% 78.7% 68.1% 84.9% —
Llama4 Maverick 400B 50.9 ±0.3 43.8 ± 2.8 32.8% 35.8% 73.4% — 73.7% —
OpenAI o3 77.4 ±0.7 72.7 ± 3.7 66.7% 66.9% 82.9% 76.4% 81.1% 78.6%
GPT-5-mini 'Medium' 75.3 ± 0.7 71.3 ± 6.7 70.8% 71.94% 80.0% — — —
Gemini 2.5 Pro 81.2 ±0.7 77.4 ± 0.8 70.6% 79.2% 84.0% 71.2% 84.6% —
Gemini 2.5 Flash 09-2025 82.6 ± 0.4 79.8 ± 1.2 83.3% 83.3% 79.7% — 81.2% —
GPT-5 83.9 ±0.6 79.5 ± 1.5 80.8% 81.9% 84.2% 78.4% 82.7% 81.1%
Human (Top 10 performers) — — — 90.1% 82.6% 80.8% — —
Human (Rank 1, Topper) — — — 92.2% 88.6% 85.4% 60.3% 80.5%
Δ over human
(Human – Best model)
— — — 10.1% 4.4% 7% –24.3% –0.6%

Note: For mmJEE-Eval, Acc. % (Full) represents Pass@1 accuracy on the full set of 1,460 questions, Acc. % (2025 set) presents Pass@1 accuracy on the held-out 2025 subset, Marks (%) represents total score on the two papers of JEE Advanced 2025 following the official marking scheme, Marks w/ CT (%) presents confidence thresholded scores. For other benchmarks, we source Pass@1 accuracies from respective leaderboards.