The latest Speech-to-Text models promise more precise and much better contextualized transcriptions, with reduced latency.
The architectures of models are evolving and the practices follow. Although WHISPER D’OPENAI is still widely used in companies, it is no longer the best transcription model on the market. The new approaches integrating the treatment of the different methods within the same neural network allow more reliable and much better contextualized transcriptions. Based respectively on GPT-4O and GPT-4O-MINI, the new OPENAI models promise advanced performance. Explanations.
GPT-4O-TRANSCRIBE Significantly higher
Introduced in March, the new transcription models GPT-4O-Transcribe and GPT-4-mini-transcribe now display performance above Whisper, even in its latest Large-V3 version. The two models display an error rate per word (Wer – Word Error Rate, which measures the word percentage incorrectly transcribed from a reference) significantly lower than the Whisper V2 and V3 models. On the Benchmark Flowers (Few-Shot Learning Evaluation of Universal Representations of Speech, a multilingual test covering more than 100 languages with manually transcribed audio samples), GPT-4-Transcribe and GPT-4o-mini-transcribe demonstrate a much more robust transcription accuracy, whatever the language used. In French, for example, GPT-4-Transcribe displays an error rate per word (WER) of 3.46% against 5.33% for Whisper or 4.84% for Gemini 2.0 Flash.
Unlike Whisper, which works as an autonomous and specialized voice recognition system only for audio, these new models integrate speech treatment directly into the main neural network of GPT-4O. A unification of the modalities on the same neural network which allows audio representations to benefit from the existing linguistic capacities of the OPENAI LLM (acquired during pre -worning). OPENAI thus ensures that its model is particularly good in complex scenarios with various accents, noisy environments or with different speech speeds.
A new way of transcribing
While the use of Whisper was modular and unidirectional, this is no longer the case with GPT-4-Transcribe and GPT-4-MINI-TRANSCRIBE. In addition to taking the audio file to be transcribed as input, OpenAi models accept a prompt textual. The goal? Give the context model so that the final transcription is even more precise. The model will thus tend to better spelled the words of certain unreamed lexical fields. The prompt can also be used to use a different language level, keep or delete filling words (“EUH”, “HMM”) or even improve the context when transcribing a sequenced file into two parts (by prompt the previous transcription).
To automate the creation of the contextual prompt for GPT-4O-Transcribe or GPT-4-MINI-TRANSCIBE, it is possible to have it generated by another model. The first model (Gemini 2.0 Flash for example) will take the audio file as input and give a short description out. The latter will then be sent to GPT-4o-Transcribe or GPT-4-MINI-TRANSCIBE for context. The final transcription will be even more precise and detailed. To illustrate this principle, we have tested the method in a Google Colaboratory ready for use, available here. Just enter your API OPENAI (OPENAI_API_KEY) and Google AI Studio (Google_Api_Key) keys in the secrets and execute the script.
The limits of the transcription with GPT-4-Transcribe
This is one of the main limits for long audio files: the price. Unlike Whisper, GPT-4-Transcriber and GPT-4O-MINI-TRANSCIBE are proprietary models and their execution is only possible from the OPENAI API. It will be necessary to count on average $ 0.006 for a minute of transcription with GPT-4-transcribe and $ 0.003 with GPT-4-mini-transcribe when Whisper (locally) requires only the energy cost of the machine on which it is executed (negligible, therefore).
For a very basic use, it is therefore more economical to use Whisper locally. But for use cases where latency and precision count (example: a vocal agent on the phone), GPT-4O will be preferred. It is also possible to carry out a routing of the models to be used according to the nature and the difficulty of the audio to be transcribed. It is for example possible to use a multimodal model evaluator to decide which model to use between Whisper, GPT-4-Mini-Transcribe and GPT-4-Transcribe depending on the domain or the quality of the recording.