Mirosoft-backed OpenAI has announced a new flagship generative AI model GPT-4o, claiming it a step towards "much more natural human-computer interaction". OpenAI's GPT-4o accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs. The “o” in GPT-4o stands for “omni”. The term refers to GPT-4o's ability to process text, speech, and video.
"It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation," OpenAI says on the official release of the product.
In terms of performance, GPT-4o matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. "GPT-4o is especially better at vision and audio understanding compared to existing models."
"our new model: GPT-4o, is our best model ever. it is smart, it is fast, it is natively multimodal (!), and…it is available to all ChatGPT users, including on the free plan! so far, GPT-4 class models have only been available to people who pay a monthly subscription. this is important to our mission; we want to put great AI tools in the hands of everyone," Sam Altman, CEO, OpenAI, said on the launch of the product on Monday.
Altman says a key part of OpenAI's mission is to put very capable AI tools in the hands of people for free (or at a great price). "I am very proud that we’ve made the best model in the world available for free in ChatGPT, without ads or anything like that."
He added that the new voice (and video) mode is the "best computer interface I’ve ever used". "It feels like AI from the movies; and it’s still a bit surprising to me that it’s real. Getting to human-level response times and expressiveness turns out to be a big change."
Notably, prior to GPT-4o, one could use "Voice Mode" to talk to ChatGPT, with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average.
Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio.
"This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion," says OpenAI.
The company says GPT-4o is better at understanding and discussing the images, too. "For example, you can now take a picture of a menu in a different language and talk to GPT-4o to translate it, learn about the food's history and significance, and get recommendations."
In terms of language capabilities, GPT-4o improves on quality and speed and can now support more than 50 languages. "We are also starting to roll out to ChatGPT Free with usage limits today."
When using GPT-4o, ChatGPT free users will now have access to experience GPT-4 level intelligence, get responses from both the model and the web, analyse data and create charts, chat about photos, and upload files for assistance summarising, writing or analysing.