URSA:
Understanding Chain-of-Thought Reasoning and Verifying in Multimodal Mathematics

Ruilin Luo^➹*, Zhuofan Zheng^❀*, Yifan Wang^➹, Yiyao Yu^➹,
Xinzhe Ni^➹, Zicheng Lin^➹, Jin Zeng^❀†, Yujiu Yang^➹†

^➹Tsinghua University^❀ByteDance
^*Equal Contribution ^†Corresponding Author

Abstract

Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of large language models (LLMs). Recently, the derivative process supervision methods have sparked discussions on scaling capabilities during test time. However, in multimodal mathematics reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and limiting the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. Also, we contribute a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-8B on multiple multimodal mathematics reasoning benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets focusing on both interpretation and logic. By further training on URSA-8B, we aim to inherit CoT reasoning capabilities and achieve robust supervision abilities. The trained URSA-RM-8B acts as a verifier, effectively enhancing the performance of URSA-8B at test time in multimodal mathematics reasoning, and achieves superior results. Model weights and training data will be totally open-sourced.

Figure 1: Overall performance comparison on six commonly used multimodal mathematical benchmarks.

Figure 2: Comparison of reasoning abilities across different disciplines and under varying modalities of information. We use MathVista and MathVerse as benchmarks.

Overview

In this work, we first perform vision-language alignment based on the collected 1M data. Subsequently, we train the multimodal mathematics reasoning model URSA-8B and the reward model URSA-RM-8B using the synthesized MMathCoT-1M and DualMath-1.1M datasets, respectively.

Figure 3: Overview of model training process.

Data Synthesis

VL-alignment Data: This portion of the data comes from collections, including MAVIS_Caption, Multimath-EN, and Geo170k-Alignment, totaling 960K.
Math SFT Data:We collect and categorize existing open-source data and apply different enhancement strategies: (1) Answer-only: This portion of the data is distilled by Gemini-1.5-Flash-002 to form CoT reasoning trajectories; (2) Analysis-formatted: This portion of the data is rewritten by Gemini-1.5-Flash-002 to enhance the diversity of CoT logical organization and language; (3) CoT-formatted: This portion of the data is standardized, for example, by rewriting structured mathematical language into a natural language format that combines thought processes with mathematical formulas, thereby leveraging NLU capabilities to better learn CoT reasoning. Finally, we obtain MMathCoT-1M.
PRM Training Data:The training data for PRM comes from process-labeled data from two perspectives: one from a binary error localization engine and the other from a misreading insertion engine. This allows PRM training in multimodal scenarios to focus simultaneously on logical correctness and clear image understanding. Following this approach, we organize DualMath-1.1M.

Figure 4: Data sources used by the URSA-8B model during the VL-alignment and SFT phases.

Comparison on MathVista and MathVerse

Accuracy scores on the testmini subset of MathVista (1,000 examples) and MathVerse (4,728 examples).


Model	MathVista						MathVerse
Model	ALL	GPS	MWP	FQA	TQA	VQA	ALL	TD	TL	TO	VI	VD	VO
Baselines
Random	17.9	21.6	3.8	18.2	19.6	26.3	12.4	12.4	12.4	12.4	12.4	12.4	12.4
Human	60.3	48.4	73.0	59.7	63.2	55.9	64.9	71.2	70.9	41.7	61.4	68.3	66.7
Closed-Source MLLMs
GPT-4o	63.8	-	-	-	-	-	-	-	-	-	-	-	-
GPT-4V	49.9	50.5	57.5	43.1	65.2	38.0	54.4	63.1	56.6	60.3	51.4	32.8	50.3
Gemini-1.5-002-Flash	58.4	-	-	-	-	-	-	-	-	-	-	-	-
Gemini-1.5-Pro	63.9	-	-	-	-	-	35.3	39.8	34.7	44.5	32.0	36.8	33.3
Claude-3.5-Sonnet	67.7	-	-	-	-	-	-	-	-	-	-	-	-
Qwen-VL-Plus	43.3	35.5	31.2	54.6	48.1	51.4	21.3	26.0	21.2	25.2	18.5	19.1	21.8
Open-Source General MLLMs
mPLUG-Owl2-7B	22.2	23.6	10.2	22.7	27.2	27.9	10.3	11.6	11.4	13.8	11.1	9.4	8.0
MiniGPT4-7B	23.1	26.0	13.4	18.6	30.4	30.2	12.2	12.3	12.9	13.4	12.5	14.8	8.7
LLaVA-1.5-13B	27.7	22.7	18.9	23.8	43.0	30.2	12.7	17.1	12.0	22.6	12.6	12.7	9.0
SPHINX-V2-13B	36.7	16.4	23.1	54.6	41.8	43.0	16.1	20.8	14.1	14.0	16.4	15.6	16.2
LLaVA-NeXT-34B	46.5	-	-	-	-	-	34.6	49.0	37.6	30.1	35.2	28.9	22.4
InternLM-XComposer2-VL-7B	57.6	63.0	73.7	55.0	56.3	39.7	25.9	36.9	28.3	42.5	20.1	24.4	19.8
Deepseek-VL	34.9	28.4	55.9	26.8	32.9	34.6	19.3	23.0	23.2	23.1	20.2	18.4	11.8
LLaVA-OneVision-7B	58.6	71.6	69.4	51.3	56.3	45.3	-	-	-	-	-	-	26.9
Qwen2-VL	58.9	40.9	64.0	69.1	60.1	58.1	33.6	37.4	33.5	35.0	31.3	30.3	28.1
InternVL2-8B	58.3	62.0	59.1	58.7	61.4	49.7	35.9	39.0	33.8	36.0	32.2	30.9	27.7
InternVL2.5-8B	64.5	64.9	70.4	63.2	66.5	58.1	-	-	-	-	-	-	22.8
Open-Source Math MLLMs
G-LLaVA-7B	25.1	48.7	3.6	19.1	25.0	28.7	16.6	20.9	20.7	21.1	17.2	14.6	9.4
Math-LLaVA-13B	46.6	57.7	56.5	37.2	51.3	33.5	22.9	27.3	24.9	27.0	24.5	21.7	16.1
Math-PUMA-Qwen2-7B	47.9	48.1	68.3	46.5	46.2	30.2	33.6	42.1	35.0	39.8	33.4	31.6	26.0
Math-PUMA-DeepSeek-Math	44.7	39.9	67.7	42.8	42.4	31.3	31.8	43.4	35.4	47.5	33.6	31.6	14.7
MAVIS-7B	-	64.1	-	-	-	-	35.2	43.2	37.2	-	34.1	29.7	31.8
InfiMM-Math	-	-	-	-	-	-	34.5	46.7	32.4	-	38.1	32.4	15.8
MultiMath-7B	50.0	66.8	61.8	40.1	50.0	33.0	27.7	34.8	30.8	35.3	28.1	25.9	15.0
Ours
URSA-8B	59.8	79.3	75.3	44.6	63.9	40.2	45.7	55.3	48.3	51.8	46.4	43.9	28.6
∆ over SOTA Open-Source Math MLLMs	+9.8	+12.5	+7.0	-1.9	+12.6	+6.7	+10.5	+8.6	+11.1	+4.3	+8.3	+11.5	-3.2

Comparison on WE-MATH

Performance comparison on four-dimensional metrics for WE-MATH testmini reasoning evaluation.


Model	Strict					Loose
Model	AVG ↑	IK ↓	IG ↓	CM ↑	RM ↓	AVG ↑	IK ↓	IG ↓	CM ↑	RM ↓
Closed-source MLLMs
Qwen-VL-Max	10.5	65.1	7.6	6.7	75.5	25.5	65.1	7.6	21.7	20.3
Gemini-1.5-Pro	26.4	42.9	11.2	20.8	54.8	46.0	42.9	11.2	40.4	12.0
GPT-4V	31.1	39.8	14.5	23.8	47.9	51.4	39.8	14.5	44.2	3.3
GPT-4o	42.9	31.2	15.2	35.2	34.2	60.6	31.2	15.2	53.0	1.1
Open-source General MLLMs
LLaVA-1.6-7B	3.3	78.3	2.5	2.1	89.1	13.8	78.3	2.5	12.6	34.7
LLaVA-1.6-13B	5.2	69.1	3.2	3.6	86.9	22.0	69.1	3.2	20.4	26.2
InternVL-Chat-V1.5-26B	12.7	56.4	10.5	7.4	77.6	31.0	56.4	10.5	25.7	22.4
LLaVA-NeXT-72B	13.4	58.9	7.1	9.9	71.0	31.5	58.9	7.1	28.0	17.9
DeepSeek-VL-7B	6.3	69.1	4.6	4.0	84.8	21.0	69.1	4.6	18.7	29.0
Phi3-Vision-4.2B	10.6	58.9	9.0	6.1	81.1	29.8	58.9	9.0	25.3	21.3
GLM-4V-9B	14.9	53.0	9.5	10.1	73.1	35.1	53.0	9.5	30.3	19.3
InternLM-XComposer2-VL-7B	12.7	56.4	10.5	7.4	77.6	31.0	56.4	10.5	25.7	22.4
InternVL2-8B	26.6	45.5	13.5	19.8	51.6	44.9	45.5	13.5	38.1	7.0
LLaVA-OneVision-7B	23.1	45.0	13.1	16.6	60.5	44.9	45.0	13.1	38.3	8.6
Qwen2-VL	25.6	47.1	14.7	18.3	52.2	43.0	47.1	14.7	35.6	7.0
Open-source Math MLLMs
G-LLaVA-13B	6.5	64.2	4.6	4.2	86.6	22.3	64.2	4.6	20.0	36.0
Math-LLaVA-13B	11.1	62.1	3.6	9.3	72.8	31.3	62.1	3.6	29.5	13.9
Math-PUMA-Qwen2-7B	19.2	47.8	13.7	12.4	67.8	41.0	47.8	13.7	34.1	11.4
Math-PUMA-DeepSeek-Math-7B	15.6	56.0	7.2	12.0	67.4	35.8	56.0	7.2	32.2	12.4
InfiMM-Math	20.6	48.8	12.2	15.2	61.7	-	-	-	-	-
Ours
URSA-8B	32.2	37.5	10.7	26.9	48.2	53.5	37.5	10.7	48.2	7.0
∆ over SOTA Open-Source Math MLLMs	+11.6	+10.3	-7.1	+11.7	+13.5	+12.5	+10.3	-7.1	+14.1	+4.4

Performance Comparison on Math-Vision

Performance comparison of different MLLMs on Math-Vision across various categories.


Model	Categories
Model	ALL	Alg	AnaG	Ari	CombG	Comb	Cnt	DescG	GrphT	Log	Angle	Area	Len	SoIG	Stat	Topo	TransG
Baselines
Human	68.8	55.1	78.6	99.6	98.4	43.5	98.5	91.3	62.2	61.3	33.5	47.2	73.5	87.3	93.1	99.8	69.0
Closed-source MLLMs
GPT-4o	30.4	42.0	39.3	49.3	28.9	25.6	22.4	24.0	23.3	29.4	17.3	29.8	30.1	29.1	44.8	34.8	17.9
GPT-4V	22.8	27.3	32.1	35.7	21.1	16.7	13.4	22.1	14.4	16.8	22.0	22.2	20.9	23.8	24.1	21.7	25.6
CoT GPT-4V	24.0	26.7	26.2	38.6	22.1	24.4	19.4	27.9	23.3	25.2	17.3	21.4	23.4	23.8	25.9	4.4	25.6
Gemini-1.5-Pro	19.2	20.3	35.7	34.3	19.8	15.5	20.9	26.0	26.7	22.7	14.5	14.4	16.5	18.9	10.3	26.1	17.3
Open-source General MLLMs
LLaVA-1.5	8.5	7.0	7.1	10.7	7.1	4.8	10.5	7.7	10.0	9.2	15.6	10.2	9.8	5.3	8.6	4.4	4.8
LLaVA-1.5	11.1	7.0	14.3	14.3	9.1	6.6	6.0	13.5	5.6	13.5	10.4	12.6	14.7	11.5	13.8	13.0	10.7
InternLM-XComposer2-VL	14.5	9.3	15.5	12.1	15.3	11.3	10.5	14.4	22.2	19.3	19.7	15.6	15.0	11.9	15.5	26.1	15.5
Ovis1.6-Gemma2-9B	18.8	13.3	15.5	22.1	17.9	11.3	22.4	23.1	20.0	20.2	20.8	18.0	24.7	15.6	20.7	17.4	20.8
MiniCPM-v2.6	18.4	9.9	19.0	18.6	21.8	13.1	13.4	17.3	20.0	16.0	25.4	19.4	20.7	15.2	27.6	30.4	22.0
LLaVA-OneVision	18.3	11.6	16.7	20.7	18.5	11.9	14.9	19.2	13.3	20.2	17.9	21.6	23.4	12.3	22.4	13.0	24.4
Qwen2-VL	19.2	15.4	20.2	19.3	16.9	16.7	17.9	22.1	22.2	16.0	19.1	22.4	22.5	14.8	19.0	4.3	23.8
InternVL2-8B	20.0	18.6	22.6	28.6	22.1	13.7	10.4	11.5	13.3	21.0	20.8	22.4	20.5	16.8	17.2	26.1	24.2
InternVL2.5-8B	17.0	15.1	23.8	29.3	16.2	8.9	11.9	10.6	8.9	18.5	22.0	19.4	15.4	13.9	22.4	21.7	19.6
Open-source Math MLLMs
Math-LLaVA	15.7	9.0	20.2	15.7	18.2	10.1	10.5	16.4	14.4	16.0	20.2	18.4	17.6	9.4	24.1	21.7	17.9
Multimath	16.3	11.3	21.1	15.5	15.9	11.3	12.1	15.5	15.9	18.5	20.1	16.4	21.3	13.3	14.6	13.3	20.8
Math-PUMA-Qwen2-7B	14.0	5.0	21.1	21.1	21.1	11.0	5.6	15.7	10.5	13.8	11.7	15.8	12.2	17.8	19.2	15.8	12.2
MAVIS	19.2	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
Ours
URSA-8B	26.2	28.1	26.2	35.0	22.1	15.5	19.4	18.3	22.2	21.8	37.0	27.0	26.5	31.1	27.6	17.4	23.8
∆ over SOTA Open-Source MLLMs	+6.9	+16.8	+5.1	+13.9	+1.0	+4.2	+7.3	+1.9	+6.3	+3.3	+16.8	+8.6	+5.2	+13.3	+3.5	-4.3	+3.0

Test-time Scaling

High Diversity: We show the Pass@N performance of URSA-8B below. With only 4 time sampling, URSA-8B can achieve a pass rate of 90.9 on the GPS task of Mathvista. The Pass@64 performance on MathVista is 97.1, indicating that URSA-8B already has substantial coverage of geometric problem-solving strategies. On the MathVerse dataset, URSA-8B also demonstrates a high upper limit of reasoning capability.
We also compare the Pass@N of Multimath-7B and find that URSA-8B has a significantly higher upper limit in reasoning performance than Multimath-7B. On MathVerse, the Pass@64 of URSA-8B shows a relative improvement of 74.2% compared to single inference, while Multimath-7B only shows an improvement of 58.5%.

Great verifier: The accuracy significantly surpasses self-consistency, making it a better test-time scaling reward model tool for multimodal mathematics reasoning.

Extensible OOD Capability: We choose Multimath as the base model, as it is relatively stable in performing CoT solutions after mathematics instruction tuning. We find that URSA-RM-8B effectively serves as a verifier for Multimath's CoT solutions, surpassing the performance of self-consistency.

Figure 5: Test-time scaling on MathVista-GPS.

Figure 6: Test-time scaling on MathVerse.

Figure 7: Test-time scaling on Math-Vision.

Figure 8: OOD performance when URSA-RM-8B works on Multimath-7B.

BibTeX

@article{luo2025ursa,
  title={URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics},
  author={Luo, Ruilin and Zheng, Zhuofan and Wang, Yifan and Yu, Yiyao and Ni, Xinzhe and Lin, Zicheng and Zeng, Jin and Yang, Yujiu},
  journal={arXiv preprint arXiv:2501.04686},
  year={2025}
}

URSA: Understanding Chain-of-Thought Reasoning and Verifying in Multimodal Mathematics