Last updated on
OpenAI successfully diverted attention from Google in the weeks leading up to Google’s major event, Google I/O. However, when the event finally arrived, Google’s big announcement fell short, revealing only a language model marginally superior to its predecessor, with the “magic” component not even in the Alpha testing stage.
This left many feeling disappointed, akin to a mom receiving a vacuum cleaner for Mother’s Day. Nevertheless, OpenAI effectively reduced the media focus on Google’s significant event.
The first indication of some playful teasing emerges from the name of the latest GPT model, GPT-4 “o,” where the repetition of the letter “o” mirrors the naming convention of Google’s event, I/O.
While OpenAI claims that the “O” represents Omni, signifying all-encompassing capabilities, the choice of letter undoubtedly carries a subtle subtext, hinting at a clever nod towards Google’s event.
On the Friday prior to the announcement, Sam Altman teased on Twitter about promising “new stuff” that evoked a sense of “magic” for him, ruling out GPT-5 and a search engine but hinting at something innovative in the pipeline.
Then, OpenAI’s co-founder Greg Brockman unveiled GPT-4o in a tweet, describing it as a groundbreaking model capable of reasoning across text, audio, and video in real time. He emphasized its versatility, interactive appeal, and its potential to significantly enhance human-computer interaction, even extending to human-computer-computer interactions.
During the announcement, it was revealed that previous iterations of ChatGPT relied on three separate models to process audio input: one for converting audio to text, another for completing tasks and producing the text output, and a third for converting the text back into audio. However, the breakthrough with GPT-4o is its ability to handle audio input and output within a single model, accomplishing the entire process in the same timeframe as a human would take to listen and respond to a question.
However, there’s a snag: the audio component isn’t available online yet. The team is still ironing out the kinks with the guardrails, and it’ll be a few weeks before an Alpha version is rolled out to select users for testing. Alpha versions are prone to bugs, whereas Beta versions tend to be closer to the final product.
OpenAI clarified the frustrating delay:
“We understand that the audio features of GPT-4o introduce a range of new challenges. Currently, we’ve made text and image inputs and text outputs accessible to the public. In the coming weeks and months, we’ll be focusing on refining the technical infrastructure, enhancing usability post-training, and ensuring safety measures before releasing the other features.“
While the core functionality of GPT-4o’s audio input and output is complete, we’re still working on ensuring it meets the required safety standards before making it available to the public.
It’s inevitable that an incomplete and overhyped product would stir up some negative feedback on social media.
AI engineer Maziyar Panahi, as seen on his LinkedIn profile, expressed his disappointment on Twitter:
“Just tried out the new GPT-4o (Omni) in ChatGPT. Honestly, not impressed at all! Faster, cheaper, multimodal – these features don’t resonate with me. All I need is a code interpreter, and it’s just as lackluster as before!”
He later added:
“I get that for startups and businesses, faster, cheaper, audio, etc., are enticing. But for me, who primarily uses the chat feature, it feels pretty much the same. Especially as a Data Analytics assistant. And I don’t see any added value for my $20. Not today!”
Similar sentiments were echoed by others on platforms like Facebook and X, although there were also many who were pleased with the perceived improvements in speed and cost for API usage.
It’s hard to ignore the feeling that the release of GPT-4o was strategically timed to coincide with and overshadow Google I/O. Releasing it just before Google’s major event, while still in an incomplete state, might have unintentionally downplayed its significance, making it seem like a minor upgrade.
In its current state, GPT-4o doesn’t represent a groundbreaking leap forward. However, once the audio component exits the Alpha testing phase and undergoes Beta testing, then we can truly anticipate revolutionary advancements in large language models. Yet, by the time that milestone is reached, Google and Anthropic may have already solidified their positions in this domain.
OpenAI’s announcement presents a rather subdued picture of the new model, positioning its performance as comparable to GPT-4 Turbo. The standout features include notable enhancements in languages other than English and cost-effectiveness for API users.
OpenAI elaborates:
“GPT-4o matches GPT-4 Turbo’s performance on English text and code, while showcasing significant advancements in non-English text processing. Additionally, it offers improved speed and a 50% reduction in API costs.”
However, when examining ratings across six benchmarks, GPT-4o barely edges past GPT-4T in most tests, but lags behind in a crucial benchmark for reading comprehension.
Here are the performance metrics:
MMLU (Massive Multitask Language Understanding)
This assessment evaluates multitasking accuracy and problem-solving across over fifty subjects, including mathematics, science, history, and law. GPT-4o leads slightly with a score of 88.7, while GPT-4 Turbo follows closely behind at 86.9.
GPQA (Graduate-Level Google-Proof Q&A Benchmark)
Consisting of 448 multiple-choice questions crafted by domain experts in fields like biology, chemistry, and physics, this benchmark saw GPT-4o achieve a score of 53.6, edging out GPT-4T’s 48.0.
Math
In mathematics, GPT-4o excels with a score of 76.6, surpassing GPT-4T by four points (72.6).
HumanEval
This coding benchmark showcases GPT-4o’s slight advantage with a score of 90.2 over GPT-4T’s 87.1, demonstrating a three-point lead.
MGSM (Multilingual Grade School Math Benchmark)
Evaluating grade-school level math proficiency across ten languages, GPT-4o secures a score of 90.5, outperforming GPT-4T’s 88.5.
DROP (Discrete Reasoning Over Paragraphs)
Comprising 96k questions to gauge language model comprehension of paragraph content, GPT-4o achieves a score of 83.4, trailing GPT-4T’s 86.0 by nearly three points.
The naming of the model with the letter “o” does seem to stir up attention, potentially overshadowing Google’s I/O conference. Whether intentional or not, OpenAI managed to divert significant focus away from Google’s upcoming search conference.
But does a language model that only marginally outperforms its predecessor truly merit all the hype and media buzz it received? Despite its modest advancements, the impending announcement monopolized news headlines, overshadowing Google’s major event. For OpenAI, it’s evident that the hype was worth it.
Original news from SearchEngineJournal