ursa Icon URSA:
Understanding Chain-of-Thought Reasoning and Verifying in Multimodal Mathematics

Ruilin Luo➹*, Zhuofan Zheng❀*, Yifan Wang, Yiyao Yu,
Xinzhe Ni, Zicheng Lin, Jin Zeng❀†, Yujiu Yang➹†
Tsinghua UniversityByteDance
*Equal Contribution Corresponding Author

Open-source work is still in progress! The model and code are under review, and the training data is already available!

Abstract

Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of large language models (LLMs). Recently, the derivative process supervision methods have sparked discussions on scaling capabilities during test time. However, in multimodal mathematics reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and limiting the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. Also, we contribute a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-7B on multiple multimodal mathematics reasoning benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets focusing on both interpretation and logic. By further training on URSA-7B, we aim to inherit CoT reasoning capabilities and achieve robust supervision abilities. The trained URSA-RM-7B acts as a verifier, effectively enhancing the performance of URSA-7B at test time in multimodal mathematics reasoning, and achieves superior results. Model weights and training data will be totally open-sourced.

ursa comparison

Figure 1: Comparison of reasoning abilities across different disciplines and under varying modalities of information. We use MathVista and MathVerse as benchmarks.

Overview

In this work, we first perform vision-language alignment based on the collected 1M data. Subsequently, we train the multimodal mathematics reasoning model URSA-7B and the reward model URSA-RM-7B using the synthesized MMathCoT-1M and DualMath-1.1M datasets, respectively.

ursa overflow

Figure 2: Overview of model training process.

Data Synthesis

VL-alignment Data: This portion of the data comes from collections, including MAVIS_Caption, Multimath-EN, and Geo170k-Alignment, totaling 960K.
Math SFT Data:We collect and categorize existing open-source data and apply different enhancement strategies: (1) Answer-only: This portion of the data is distilled by Gemini-1.5-Flash-002 to form CoT reasoning trajectories; (2) Analysis-formatted: This portion of the data is rewritten by Gemini-1.5-Flash-002 to enhance the diversity of CoT logical organization and language; (3) CoT-formatted: This portion of the data is standardized, for example, by rewriting structured mathematical language into a natural language format that combines thought processes with mathematical formulas, thereby leveraging NLU capabilities to better learn CoT reasoning. Finally, we obtain MMathCoT-1M.
PRM Training Data:The training data for PRM comes from process-labeled data from two perspectives: one from a binary error localization engine and the other from a misreading insertion engine. This allows PRM training in multimodal scenarios to focus simultaneously on logical correctness and clear image understanding. Following this approach, we organize DualMath-1.1M.

ursa comparison

Figure 3: Data sources used by the URSA-7B model during the VL-alignment and SFT phases.

Comparison on MathVista and MathVerse

Accuracy scores on the testmini subset of MathVista (1,000 examples) and MathVerse (4,728 examples).

Model MathVista MathVerse
ALL GPS MWP FQA TQA VQA ALL TD TL TO VI VD VO
Baselines
Random 17.9 21.6 3.8 18.2 19.6 26.3 12.4 12.4 12.4 12.4 12.4 12.4 12.4
Human 60.3 48.4 73.0 59.7 63.2 55.9 64.9 71.2 70.9 41.7 61.4 68.3 66.7
Closed-Source MLLMs
GPT-4o 63.8 - - - - - - - - - - - -
GPT-4V 49.9 50.5 57.5 43.1 65.2 38.0 54.4 63.1 56.6 60.3 51.4 32.8 50.3
Gemini-1.5-002-Flash 58.4 - - - - - - - - - - - -
Gemini-1.5-Pro 63.9 - - - - - 35.3 39.8 34.7 44.5 32.0 36.8 33.3
Claude-3.5-Sonnet 67.7 - - - - - - - - - - - -
Qwen-VL-Plus 43.3 35.5 31.2 54.6 48.1 51.4 21.3 26.0 21.2 25.2 18.5 19.1 21.8
Open-Source General MLLMs
mPLUG-Owl2-7B 22.2 23.6 10.2 22.7 27.2 27.9 10.3 11.6 11.4 13.8 11.1 9.4 8.0
MiniGPT4-7B 23.1 26.0 13.4 18.6 30.4 30.2 12.2 12.3 12.9 13.4 12.5 14.8 8.7
LLaVA-1.5-13B 27.7 22.7 18.9 23.8 43.0 30.2 12.7 17.1 12.0 22.6 12.6 12.7 9.0
SPHINX-V2-13B 36.7 16.4 23.1 54.6 41.8 43.0 16.1 20.8 14.1 14.0 16.4 15.6 16.2
LLaVA-NeXT-34B 46.5 - - - - - 34.6 49.0 37.6 30.1 35.2 28.9 22.4
InternLM-XComposer2-VL-7B 57.6 63.0 73.7 55.0 56.3 39.7 25.9 36.9 28.3 42.5 20.1 24.4 19.8
Deepseek-VL 34.9 28.4 55.9 26.8 32.9 34.6 19.3 23.0 23.2 23.1 20.2 18.4 11.8
InternVL2-8B 58.3 62.0 59.1 58.7 61.4 49.7 35.9 39.0 33.8 36.0 32.2 30.9 27.7
Qwen2-VL 58.9 40.9 64.0 69.1 60.1 58.1 33.6 37.4 33.5 35.0 31.3 30.3 28.1
Open-Source Math MLLMs
G-LLaVA-7B 25.1 48.7 3.6 19.1 25.0 28.7 16.6 20.9 20.7 21.1 17.2 14.6 9.4
Math-LLaVA-13B 46.6 57.7 56.5 37.2 51.3 33.5 22.9 27.3 24.9 27.0 24.5 21.7 16.1
Math-PUMA-Qwen2-7B 47.9 48.1 68.3 46.5 46.2 30.2 33.6 42.1 35.0 39.8 33.4 31.6 26.0
Math-PUMA-DeepSeek-Math 44.7 39.9 67.7 42.8 42.4 31.3 31.8 43.4 35.4 47.5 33.6 31.6 14.7
MAVIS-7B - 64.1 - - - - 35.2 43.2 37.2 - 34.1 29.7 31.8
InfiMM-Math - - - - - - 34.5 46.7 32.4 - 38.1 32.4 15.8
MultiMath-7B 50.0 66.8 61.8 40.1 50.0 33.0 27.7 34.8 30.8 35.3 28.1 25.9 15.0
Ours
URSA-7B 59.8 79.3 75.3 44.6 63.9 40.2 45.7 55.3 48.3 51.8 46.4 43.9 28.6
∆ over SOTA Open-Source Math MLLMs +9.8 +12.5 +7.0 -1.9 +12.6 +6.7 +10.5 +8.6 +11.1 +4.3 +8.3 +11.5 -3.2

Comparison on WE-MATH

Performance comparison on four-dimensional metrics for WE-MATH testmini reasoning evaluation.

Model Strict Loose
AVG ↑ IK ↓ IG ↓ CM ↑ RM ↓ AVG ↑ IK ↓ IG ↓ CM ↑ RM ↓
Closed-source MLLMs
Qwen-VL-Max 10.5 65.1 7.6 6.7 75.5 25.5 65.1 7.6 21.7 20.3
Gemini-1.5-Pro 26.4 42.9 11.2 20.8 54.8 46.0 42.9 11.2 40.4 12.0
GPT-4V 31.1 39.8 14.5 23.8 47.9 51.4 39.8 14.5 44.2 3.3
GPT-4o 42.9 31.2 15.2 35.2 34.2 60.6 31.2 15.2 53.0 1.1
Open-source General MLLMs
LLaVA-1.6-7B 3.3 78.3 2.5 2.1 89.1 13.8 78.3 2.5 12.6 34.7
LLaVA-1.6-13B 5.2 69.1 3.2 3.6 86.9 22.0 69.1 3.2 20.4 26.2
InternVL-Chat-V1.5-26B 12.7 56.4 10.5 7.4 77.6 31.0 56.4 10.5 25.7 22.4
LLaVA-NeXT-72B 13.4 58.9 7.1 9.9 71.0 31.5 58.9 7.1 28.0 17.9
DeepSeek-VL-7B 6.3 69.1 4.6 4.0 84.8 21.0 69.1 4.6 18.7 29.0
Phi3-Vision-4.2B 10.6 58.9 9.0 6.1 81.1 29.8 58.9 9.0 25.3 21.3
GLM-4V-9B 14.9 53.0 9.5 10.1 73.1 35.1 53.0 9.5 30.3 19.3
InternLM-XComposer2-VL-7B 12.7 56.4 10.5 7.4 77.6 31.0 56.4 10.5 25.7 22.4
InternVL2-8B 26.6 45.5 13.5 19.8 51.6 44.9 45.5 13.5 38.1 7.0
Qwen2-VL (7B) 25.6 47.1 14.7 18.3 52.2 43.0 47.1 14.7 35.6 7.0
Open-source Math MLLMs
G-LLaVA-13B 6.5 64.2 4.6 4.2 86.6 22.3 64.2 4.6 20.0 36.0
Math-LLaVA-13B 11.1 62.1 3.6 9.3 72.8 31.3 62.1 3.6 29.5 13.9
Math-PUMA-Qwen2-7B 19.2 47.8 13.7 12.4 67.8 41.0 47.8 13.7 34.1 11.4
Math-PUMA-DeepSeek-Math-7B 15.6 56.0 7.2 12.0 67.4 35.8 56.0 7.2 32.2 12.4
InfiMM-Math 20.6 48.8 12.2 15.2 61.7 - - - - -
Ours
URSA-7B 32.2 37.5 10.7 26.9 48.2 53.5 37.5 10.7 48.2 7.0
∆ over SOTA Open-Source Math MLLMs +11.6 +10.3 -7.1 +11.7 +13.5 +12.5 +10.3 -7.1 +14.1 +4.4

Test-time Scaling

High Diversity: We show the Pass@N performance of URSA-7B below. With only 4 time sampling, URSA-7B can achieve a pass rate of 90.9 on the GPS task of Mathvista. The Pass@64 performance on MathVista is 97.1, indicating that URSA-7B already has substantial coverage of geometric problem-solving strategies. On the MathVerse dataset, URSA-7B also demonstrates a high upper limit of reasoning capability.
We also compare the Pass@N of Multimath-7B and find that URSA-7B has a significantly higher upper limit in reasoning performance than Multimath-7B. On MathVerse, the Pass@64 of URSA-7B shows a relative improvement of 74.2% compared to single inference, while Multimath-7B only shows an improvement of 58.5%.

Great verifier: The accuracy significantly surpasses self-consistency, making it a better test-time scaling reward model tool for multimodal mathematics reasoning.

Extensible OOD Capability: We choose Multimath as the base model, as it is relatively stable in performing CoT solutions after mathematics instruction tuning. We find that URSA-RM-7B effectively serves as a verifier for Multimath's CoT solutions, surpassing the performance of self-consistency.

BibTeX

@misc{2501.04686,
  Author = {Ruilin Luo and Zhuofan Zheng and Yifan Wang and Yiyao Yu and Xinzhe Ni and Zicheng Lin and Jin Zeng and Yujiu Yang},
  Title = {URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics},
  Year = {2025},
  Eprint = {arXiv:2501.04686},
}