Coinciding with the introduction of the ChatGPT API, OpenAI has released the Whisper API, a hosted version of the Whisper speech-to-text model that was made available to the public in September.
Whisper is an automatic speech recognition system that, according to OpenAI, can perform “robust” transcription in multiple languages and translate between those languages and English at a cost of $0.006 per minute. Formats like M4A, MP3, MP4, MPEG, MPGA, WAV, and WEBM are all supported.
Numerous companies have created state-of-the-art speech recognition systems, which are at the heart of products by industry leaders like Google, Amazon, and Meta. According to Greg Brockman, president and chairman of OpenAI, Whisper’s improved recognition of unique accents, background noise, and technical jargon is the result of its training on 680,000 hours of multilingual and “multitask” data collected from the web.
As Brockman explained in a video call with TechCrunch yesterday afternoon, the release of a model was not sufficient to spur the development of an entire developer ecosystem. While the same large model is available as open source, we’ve optimised the Whisper API to the nth degree. Incredibly quick and easy to use.
In agreement with Brockman, there are many obstacles in the way of widespread use of voice transcription software in business settings. A survey conducted by Statista in 2020 found that the main barriers preventing businesses from adopting tech like tech-to-speech were concerns over accuracy, accent- or dialect-related recognition issues, and cost.
However, there are some restrictions on what you can do with Whisper, especially in terms of “next-word” prediction. OpenAI warns that Whisper’s transcriptions may contain words that weren’t actually spoken because the system was trained on a large amount of noisy data and is simultaneously trying to predict the next word in audio and transcribe the audio recording. Also, the error rate increases for speakers of languages that are under-represented in the training data, suggesting that Whisper’s performance is not consistent across languages.
Unfortunately, the last part is not novel to the field of speech recognition. Even the best systems have been plagued by bias for a long time; a Stanford study in 2020 found that Amazon, Apple, Google, IBM, and Microsoft systems made about 19% fewer errors with white users than with Black users.
Despite this, OpenAI thinks that Whisper’s transcription capabilities will be used to enhance other apps, services, products, and tools. There is already an in-app virtual speaking companion powered by the Whisper API, and it is used by the AI-driven language learning app Speak.
OpenAI, backed by Microsoft, stands to make a lot of money in the speech-to-text market if it can break into it in a big way. It has been estimated that the market could be worth $5.4 billion by 2026, up from $2.2 billion in 2021.
“Our vision is that we really want to be this universal intelligence,” Brockman explained. “We really want to, very flexibly, be able to take in whatever kind of data you have — whatever kind of task you want to accomplish — and be a force multiplier on that attention,” says the company’s CEO.