TL;DR:
• NVIDIA releases Granary, a 1-million-hour multilingual speech dataset for 25 European languages
• New Canary-1b-v2 model delivers 3x larger model performance while running 10x faster
• Dataset requires 50% less training data than competitors to achieve target accuracy
• Open-source release democratizes speech AI for underrepresented languages like Estonian and Maltese
NVIDIA just cracked open the gates to multilingual AI with Granary, a massive dataset containing nearly 1 million hours of speech data across 25 European languages. The release, announced today, targets a critical gap in speech AI where only a fraction of the world's 7,000 languages have proper AI support, potentially democratizing voice technology for underserved linguistic communities.
NVIDIA just dropped what could be the most significant multilingual speech breakthrough of 2024. The chip giant's new Granary dataset packs nearly 1 million hours of audio data across 25 European languages, with two accompanying models that promise to reshape how developers build voice AI for global markets.
The timing couldn't be more strategic. While tech giants pour billions into English-language AI systems, NVIDIA is betting on linguistic diversity as the next competitive frontier. The company's speech AI team, working alongside researchers from Carnegie Mellon University and Italy's Fondazione Bruno Kessler, has created what amounts to a linguistic data goldmine for European languages that have historically been AI afterthoughts.
[Embedded image: NVIDIA Granary dataset visualization showing coverage of 25 European languages]
"We're seeing a fundamental shift in how speech AI gets built," explains the methodology behind Granary in NVIDIA's technical paper being presented at the Interspeech conference in the Netherlands this week. The dataset doesn't just collect audio – it transforms unlabeled speech into structured, training-ready data using NVIDIA's NeMo Speech Data Processor toolkit.
The breakthrough lies in efficiency. Internal testing shows Granary requires roughly half the training data of competing datasets to achieve the same accuracy levels for automatic speech recognition and translation. That efficiency gain translates directly into lower development costs and faster deployment cycles for companies building multilingual voice applications.