Abstract

We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency.

Music Understanding Examples





Input music clips Instruction OpenMU outputs
Give a short and clear description of the clip provided. This music clip features a fast tempo with a high valence, giving it a lively and positive feel. The energy level is moderate, striking a balance between calmness and excitement. The clip is highly danceable, likely due to its rhythmic stability and strong beat. With elements of jazz and soul genres, the music combines smooth melodies with soulful rhythms, creating a vibrant and engaging sound that is both uplifting and groovy.
How could a musician alter this music clip to make it more danceable? To make this music clip more danceable, a musician could consider increasing the tempo slightly to create a more upbeat feel. They could also emphasize the rhythm by adding percussive elements like a stronger beat or syncopated patterns to enhance the groove. Additionally, incorporating catchy melodies and repetitive hooks can make the music more engaging for dancing. By adjusting these elements, the musician can transform the track into a more dance-friendly piece.
Describe the audio The music clip features a blend of ambient and electronic elements. The prominent instruments include a bass guitar, drums, and synthesizer. The bass provides a deep, rhythmic foundation, while the drums add a driving beat. The synthesizer contributes to the electronic sound palette, adding layers of atmospheric textures. The overall sound is immersive and modern, creating a captivating and dynamic listening experience.
What type of vocal technique is being used in this music clip? The vocal technique in this music clip appears to be operatic, characterized by powerful projection, wide range, and dramatic expression commonly found in classical and opera music.
What is the tempo of this music clip? The tempo is [GetAudioTempo() -> 117.6] beats per minute.

Citation

Acknowledgements

The website template was borrowed from Michaƫl Gharbi and GenWarp authors.