mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

Mukherjee, Arka; Ghosh, Shreya

mmJEE-Eval

A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

Arka Mukherjee¹^,*, Shreya Ghosh²

¹Kalinga Institute of Industrial Technology (KIIT), ²Indian Institute of Technology (IIT), Bhubaneswar
^*(Work done while at IIT Bhubaneswar)

IJCNLP-AACL 2025 (Findings)
NeurIPS 2025 MATH-AI Workshop (Poster, Non-Archival)

🤗 Dataset Leaderboard Code arXiv

Introduction

Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce mmJEE-Eval, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100% pass@3 and pass@5 numbers), they fully collapse when meta-cognition load is increased (GPT-5 fixes just 5.2% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://github.com/ArkaMukherjee0/mmJEE-Eval

Leaderboard

Performance comparison on mmJEE-Eval and other industry standard benchmarks

Closed-source Models Open-source Models Human Baselines

Model	mmJEE-Eval (Ours)				Industry Standard Benchmarks
Model	Acc. % (Full)	Acc. % (2025 set)	Marks (%)	Marks w/ CT (%)	MMMU	MMMU Pro	Math Vista	Char Xiv
Random choice	8.3 ±0.3	9.3 ± 1.6	-12.7%	-1.9%	22.1%	12.6%	—	10.8%
Aya Vision 8B	8.5 ±0.6	9.2 ± 1.0	-12.5%	-4.4%	39.9%	—	—	—
Kimi VL Thinking 2506 16B	9.8 ± 0.2	11.2 ± 2.7	-4.44%	-3.06%	64.0%	46.3%	68.7%	—
Qwen 2.5 VL 7B	10.9 ±0.8	12.4 ± 1.9	-5.83%	1.4%	58.6%	46.2%	68.2%	42.5%
InternVL3 8B	10.8 ±0.5	12.5 ± 1.3	-6.1%	3.6%	62.7%	—	71.6%	37.6%
InternVL 3.5 14B	15.6 ±0.9	16.1 ± 1.6	5.0%	7.2%	73.3%	—	80.5%	—
InternVL 3.5 30B	18.8 ±1.2	18.8 ± 1.1	6.1%	7.2%	75.6%	—	80.9%	—
Grok 4 Fast	19.3 ±1.1	17.6 ± 1.5	7.5%	11.1%	—	—	—	—
InternVL3 78B	22.0 ±0.8	23.6 ± 3.4	8.33%	12.22%	72.2%	—	79.0%	46.0%
Human (qualifying cutoff)	—	—	—	20.6%	76.2%	73%	—	—
Gemma 3 27B	29.8 ±0.3	30.6 ± 0.7	18.6%	21.9%	64.9%	—	63.3%	—
Llama4 Scout 109B	40.4 ±0.5	37.1 ± 2.3	28.1%	33.9%	69.4%	—	70.7%	—
Qwen3 VL 235B Instruct	44.6 ± 0.4	46.2 ± 4.9	36.1%	37.2%	78.7%	68.1%	84.9%	—
Llama4 Maverick 400B	50.9 ±0.3	43.8 ± 2.8	32.8%	35.8%	73.4%	—	73.7%	—
OpenAI o3	77.4 ±0.7	72.7 ± 3.7	66.7%	66.9%	82.9%	76.4%	81.1%	78.6%
GPT-5-mini 'Medium'	75.3 ± 0.7	71.3 ± 6.7	70.8%	71.94%	80.0%	—	—	—
Gemini 2.5 Pro	81.2 ±0.7	77.4 ± 0.8	70.6%	79.2%	84.0%	71.2%	84.6%	—
Gemini 2.5 Flash 09-2025	82.6 ± 0.4	79.8 ± 1.2	83.3%	83.3%	79.7%	—	81.2%	—
GPT-5	83.9 ±0.6	79.5 ± 1.5	80.8%	81.9%	84.2%	78.4%	82.7%	81.1%
Human (Top 10 performers)	—	—	—	90.1%	82.6%	80.8%	—	—
Human (Rank 1, Topper)	—	—	—	92.2%	88.6%	85.4%	60.3%	80.5%
Δ over human (Human – Best model)	—	—	—	10.1%	4.4%	7%	–24.3%	–0.6%

Note: For mmJEE-Eval, Acc. % (Full) represents Pass@1 accuracy on the full set of 1,460 questions, Acc. % (2025 set) presents Pass@1 accuracy on the held-out 2025 subset, Marks (%) represents total score on the two papers of JEE Advanced 2025 following the official marking scheme, Marks w/ CT (%) presents confidence thresholded scores. For other benchmarks, we source Pass@1 accuracies from respective leaderboards.

More Works from Our Lab

Evaluating VLM Cultural Competence Through Multimodal Story Generation

TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning

mmJEE-Eval

A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

Introduction

Leaderboard