Sam Altman-led OpenAI has created a lot of buzz around the company's recent launch of GPT-4o, a flagship AI model, which can process text, audio and images as inputs or outputs and has been dubbed as OpenAI's first natively "fully multimodal AI".
The new model has been widely hailed for how it engages in fluid conversations, addresses user issues, perceives surroundings and gives outputs based on its understanding. The AI model can translate speech in real time, take notes, solve math problems, and can even sing.
The man who reportedly steered the ambitious new project of OpenAI is Prafulla Dhariwal, an Indian scientist who has received praise from Altman himself.
"GPT-4o would not have happened without the vision, talent, conviction, and determination of @prafdhar over a long period of time. That (along with the work of many others) led to what I hope will turn out to be a revolution in how we use computers," the OpenAI CEO said in a post on X.
On the successful launch of GPT-4o, Dhariwal said the launch of GPT-4o was a huge organisation effort. He said a year or so back, they had a bunch of multimodal efforts but no single effort to make the largest GPTs fully multimodal. "I’ve been so fortunate to get to work with such an amazing group of people to finally make this possible!"
He said GPT-4o is the "first model to come out of the omni team, OpenAI’s first natively fully multimodal model". "This launch was a huge org-wide effort, but I’d like to give a shout out to a few of my awesome team members who made this magical model even possible!"
According to his website, Dhariwal is from Pune and lives in San Francisco. He is currently working as a research scientist at OpenAI. He works on generative models and unsupervised learning. Previously, he was an undergraduate at MIT and studied computers, math and physics. He joined as a research intern at OpenAI in May 2016, and was also the winner of the prestigious International Astronomy Olympiad in China in 2009. Before joining MIT, Dhariwal had secured 165th rank in the IITJEE entrance exam in 2013.
What is GPT-4o?
In GPT-4o, “o” stands for “omni”. OpenAI says it is a step towards much more "natural human-computer interaction" — it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs.
"It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models," the company said on the launch.
The company says with GPT-4o, OpenAI trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. "Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations," says the company.