𝗦𝗼𝗻𝗶𝗰𝗩𝗲𝗿𝘀𝗲 🎶 An open-source framework for temporally-aware music caption generation

Jun 19, 2025

Do you want to generate text descriptions of music mp3s? Just upload them on the SonicVerse demo page and you will get a long, informative captions that include 𝗱𝗲𝘁𝗮𝗶𝗹𝗲𝗱 𝗺𝘂𝘀𝗶𝗰 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀 (e.g. key/chords) as well as include 𝙩𝙞𝙢𝙚-𝙖𝙬𝙖𝙧𝙚 𝙙𝙚𝙨𝙘𝙧𝙞𝙥𝙩𝙞𝙤𝙣𝙨 𝙤𝙛 𝙚𝙫𝙤𝙡𝙫𝙞𝙣𝙜 𝙢𝙪𝙨𝙞𝙘𝙖𝙡 𝙘𝙤𝙣𝙩𝙚𝙣𝙩.

📂 𝗚𝗶𝘁𝗛𝘂𝗯: https://github.com/AMAAI-Lab/sonicverse
👩‍💻 Try a 𝗗𝗲𝗺𝗼: https://huggingface.co/spaces/amaai-lab/SonicVerse
🎶 𝗘𝘅𝗮𝗺𝗽𝗹𝗲𝘀: https://amaai-lab.github.io/SonicVerse/
📖 𝗣𝗮𝗽𝗲𝗿: https://arxiv.org/abs/2506.15154

𝗞𝗲𝘆 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀:
- 𝗠𝘂𝗹𝘁𝗶-𝗧𝗮𝘀𝗸 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴: Combines caption generation with music feature detection (key detection, vocals detection, etc.)
- 𝗣𝗿𝗼𝗷𝗲𝗰𝘁𝗶𝗼𝗻-𝗕𝗮𝘀𝗲𝗱 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: Transforms audio input into language tokens while maintaining feature detection capabilities
- Enhanced Captioning: Produces 𝗿𝗶𝗰𝗵, 𝗱𝗲𝘀𝗰𝗿𝗶𝗽𝘁𝗶𝘃𝗲 𝗰𝗮𝗽𝘁𝗶𝗼𝗻𝘀 that incorporate detected music features
- Long-Form Description: Enables detailed 𝘁𝗶𝗺𝗲-𝗶𝗻𝗳𝗼𝗿𝗺𝗲𝗱 𝗱𝗲𝘀𝗰𝗿𝗶𝗽𝘁𝗶𝗼𝗻𝘀 for longer music pieces through LLM chaining
- Trained on 𝗼𝗽𝗲𝗻 𝗱𝗮𝘁𝗮, model/weights/demo available as 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲

Developed by the Audio, Music, and AI (𝗔𝗠𝗔𝗔𝗜) 𝗟𝗮𝗯 at Singapore University of Technology and Design (SUTD):

Anuradha Chopra, Abhinaba Roy, Ph.D., Dorien Herremans (2025).
𝙎𝙤𝙣𝙞𝙘𝙑𝙚𝙧𝙨𝙚: 𝙈𝙪𝙡𝙩𝙞-𝙏𝙖𝙨𝙠 𝙇𝙚𝙖𝙧𝙣𝙞𝙣𝙜 𝙛𝙤𝙧 𝙈𝙪𝙨𝙞𝙘 𝙁𝙚𝙖𝙩𝙪𝙧𝙚-𝙄𝙣𝙛𝙤𝙧𝙢𝙚𝙙 𝘾𝙖𝙥𝙩𝙞𝙤𝙣𝙞𝙣𝙜.
6th Conference on AI Music Creativity (AIMC 2025), Brussels.
arXiv:2506.15154

Let us know if you find SonicVerse useful! We'd love to hear your feedback!

Try it yourself here.

#musicAI #genAI #NLP #ISMIR #womenintech #womeninscience

Audio & Music AI – The AMAAI Lab Notes

Discussion about this post