Drax model from aiOla makes AI speech recognition viable and reliable in noisy environments

Artificial intelligence startup aiOla says it’s ready to take on giants such as OpenAI Group PBC and Alibaba Holdings Ltd. in the field of speech recognition with a new voice AI model based on flow-matching training techniques. The new model is called Drax, and it reinvents the way AI speech algorithms are trained to better capture the nuances of speech in real-world settings. According to the startup, this allows it to eliminate the delays and inaccuracies of today’s best speech models. It says it can recognize spoken words in the noisiest environments with superior accuracy and lower latency than anything that’s come before it – including OpenAI’s Whisper and Alibaba’s Qwen2. The startup says its goal is to transform AI-powered speech recognition and make it reliable enough to be used in everyday life and business settings. At present, existing AI models don’t perform well enough for this, because the way they’re trained means they have to sacrifice either speed or performance. According to aiOla, OpenAI’s Whisper performs impressively, but its accuracy stems from the way it processes speech token-by-token, which is far too slow to make it useful in many situations. While useful for recording short notes or dictations, Whisper struggles to keep up with long-form audio such as hour-long office meetings or complex customer calls that can drag on for ages. To enhance the speed of speech recognition, some of OpenAI’s rivals have instead looked at the possibility of using diffusion-based models that can process multiple tokens simultaneously. Alibaba is one of the leaders in this realm, and its Qwen2 system is much faster than Whisper, but it struggles to match its level of accuracy. Part of the problem is that it’s trained on clean, idealized data rather than the noisy and unpredictable speech that we hear in real-world settings, which might be drowned out to an extent by people whispering in the background or by noisy office environments. Parallel flow-based processing With Drax, aiOla says, it has come up with a novel technique for training speech recognition systems that’s finally able to strike the right balance between speed and accuracy. Similar to diffusion models that try to refine a picture from random noise, Drax tries to reconstruct human speech from a noisy representation. The training process involves a three-step probability path that first begins with meaningless noise before transitioning to a “speech-like” but still imperfect middle state. This exposes Drax to the realistic, acoustically-plausible errors that occur in the real world, improving its ability to understand speech amid background noise, including accented speech. Finally, it will converge on a clear and correct transcript. Vice President of AI Gil Hetz said there’s no room for error in the most critical applications of speech technology. For instance, a company cannot use AI chatbots to handle customer support if that bot has trouble understanding what people are trying to tell it, he said. It needs something that’s reliable in all situations. “That’s why Drax is such a breakthrough,” he said. “It combines accuracy and speed without compromise, handling real-world speech no matter the background noise, accent or jargon.” While OpenAI’s Whisper and most other automatic speech recognition models process spoken words sequentially, predicting one token at a time, Drax can output the entire token sequence in parallel, capturing the entire conversation at once. The company calls this a “parallel, flow-based approach” to training speech models that successfully eliminates latency while preventing the compounding errors that occur during longer transcriptions. The startup believes Drax is a major development because voice is by far and away the most natural and efficient medium for data entry, and will likely become the default way that humans communicate with machines. At present, this hasn’t occurred because AI transcription fails to keep up. But Drax promises to be different. “We’re closing that gap and making voice technology actually practical at scale,” said aiOla President Amir Haramty. “That’s why advancing speech recognition is so important – it’s the future of the enterprise.” Open source The startup said it’s making Drax available on GitHub and Hugging Face under a permissive open-source license, with three model sizes to choose from. They include a lightweight Flash version, a mid-sized model and the full-scale base model, so it can be run on low-powered devices with minimal processing power, or alternatively at large scale in the cloud. “By open-sourcing it, we hope to spark further discovery and collaboration from the community,” said aiOla Chief Scientist Yossi Keshet. Alongside the model, aiOla has released research that shows Drax’s performance versus its main competitors. In English benchmarks, Drax scored an average word error rate of 7.4%, the company said, ahead of OpenAI’s Whisper-large-v3, which achieved 7.6%. It also outperformed Whisper and Alibaba’s Qwen2-audio on various select datasets, while running up to 32 times faster, the company said. And when tested on French, German, Mandarin and Spanish language benchmarks, it maintained comparable or better accuracy while maintaining a similar level of performance. The startup ultimately hopes to see widespread adoption of Drax in the enterprise and run almost any business process at the speed of voice. It envisions a world where AI agents can understand natural language commands and transform them into structured data and real-world task execution, even in noisy and jargon-heavy environments. Image: SiliconANGLE

Drax model from aiOla makes AI speech recognition viable and reliable in noisy environments

Guess You Like