CoT-VTM: Chain-of-Thought Visual-to-Music Generation

Xikang Guan1, Zheng Gu2, Jing Huo1*, Tianyu Ding3, Yang Gao1

1Nanjing University    2Shenzhen University    3Microsoft Corporation

Abstract

The application of visual-to-music generation (VTM) is rapidly growing. However, current VTM methods struggle with capturing the relationship between visuals and music in open-domain settings, mainly due to two challenges: the lack of large-scale, high-quality visual-music paired datasets and the absence of direct semantic correspondence between visuals and music. In this work, we propose CoT-VTM, a framework that distills Chain-of-Thought (CoT) reasoning to enable visual-to-music generation without paired data, while efficiently producing music aligned with visual content in open-domain settings. We first bridge the gap between visual, music, and text data using appropriate foundation models. Next, we identify key elements of the visual-music relationship and design a CoT prompt for visual-to-music mapping. To fully distill the reasoning of CoT, we incorporate latent information from intermediate reasoning steps as supervisory signals alongside visual and music supervision. Finally, we design a two-stage mapping distillation training process: the first stage uses discriminative MLP modules, while the second uses a generative embedding diffusion model (EmbedDiff). Our model achieves optimal performance on both image-to-music and video-to-music tasks.

Qualitative Results

Ground Truth

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Sample 9

Sample 10

Caption2music

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Sample 9

Sample 10

CoDi

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Sample 9

Sample 10

M2UGen

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Sample 9

Sample 10

VidMuse

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Sample 9

Sample 10

CoT-VTM

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Sample 9

Sample 10