OpenAI has unveiled the latest iteration of their leading language model, dubbed GPT-4o (note: “o” as in “omni,” not zero). This upgraded version boasts the capability to process inputs in audio, image, and text formats, while also producing outputs across these same modalities. Referred to as ChatGPT 4o, the appended “o” signifies “omni,” denoting its capacity to handle all types of data.
GPT-4o (Omni)
OpenAI characterizes this latest rendition of GPT-4 as a stride towards fostering more organic human-machine interactions, responding to user inputs with the fluidity akin to human-to-human conversations. This upgraded version matches GPT-4 Turbo in English proficiency and notably surpasses it in other languages. Notably, there’s a substantial enhancement in API performance, marked by increased speed and a 50% reduction in operating costs.
The announcement elaborates: “As evaluated against conventional benchmarks, GPT-4o attains GPT-4 Turbo-level proficiency in text comprehension, reasoning, and coding intelligence, while establishing new benchmarks in multilingual, audio, and visual capabilities.”
Advanced Voice Processing
Previously, communicating through voice necessitated integrating three distinct models: one for transcribing voice inputs into text, another (like GPT 3.5 or GPT-4) for processing the text and generating a response, and a third for converting the text back into audio. However, this method was criticized for losing nuances during the translation process.
OpenAI elucidated the drawbacks of this former approach, suggesting that they are addressed by the new methodology: “This process results in the primary intelligence source, GPT-4, missing out on a wealth of information—it cannot directly discern tone, distinguish multiple speakers, or account for background noises, and it lacks the ability to convey laughter, singing, or express emotions.”
The latest version obviates the need for three separate models, as all inputs and outputs are seamlessly managed within a single model, facilitating end-to-end audio processing. Interestingly, OpenAI acknowledges that they have yet to fully explore the capabilities of this new model or fully comprehend its limitations.
New Guardrails And An Iterative Release
OpenAI’s GPT-4o introduces enhanced guardrails and filters to ensure safety and prevent unintended voice outputs. However, today’s announcement specifies that initially, they will only be rolling out capabilities for text and image inputs, with text outputs and limited audio functionalities at launch. GPT-4o is accessible through both free and paid tiers, with Plus users enjoying message limits five times higher than standard.
Audio capabilities are slated for a limited alpha-phase release for ChatGPT Plus and API users in the coming weeks.
The announcement elaborated: “We acknowledge that GPT-4o’s audio features entail various novel risks. Therefore, we are publicly launching text and image inputs and text outputs today. In the subsequent weeks and months, we will focus on enhancing technical infrastructure, post-training usability, and safety measures required to introduce other modalities. For instance, at launch, audio outputs will be constrained to a selection of preset voices and will adhere to our existing safety protocols.”
Original news from SearchEngineJournal