Google DeepMind has released a rival to ChatGPT, named Gemini, and it can understand and generate multiple types of media including images, videos, audio, and text.
Most artificial intelligence (AI) tools only understand and generate one type of content. For example, OpenAI’s ChatGPT, “reads” and creates only text. But Gemini can generate multiple types of output based on any form of input, Google said in a blog post.
The three versions of Gemini 1.0 are Gemini Ultra, the largest version, Gemini Pro, which is being rolled out into Google’s digital services, and Gemini Nano, designed to be used on devices like smartphones.
According to DeepMind’s technical report on the chatbot, Gemini Ultra beat GPT-4 and other leading AI models in 30 of 32 key academic benchmarks used in AI research and development. These include high school exams and tests on morality and law.
Specifically, Gemini won out in nine image comprehension benchmarks, six video understanding tests, five in speech recognition and translation, and 10 of 12 text and reasoning benchmarks. The two in which Gemini Ulta failed to beat GPT-4 were in common-sense reasoning, according to the report.
Related: AI is transforming every aspect of science. Here’s how.
Building models that process multiple forms of media is hard because biases in the training data are likely to be amplified, performance tends to drop significantly, and models tend to overfit — meaning they perform well when tested against the training data, but can’t perform when exposed to new input.
Multimodal training also normally involves training different components of a model separately, each on a single type of medium and then stitching these components together. But Gemini was trained jointly across text, image, audio and video data at the same time. Scientists sourced this data from web documents, books and code.
Scientists trained Gemini by curating the training data and incorporating human supervision in the feedback process.
The team deployed servers across multiple data centers on a much grander scale than previous AI training efforts and relied on thousands of Google’s AI accelerator chips — known as the tensor processing units (TPUs).
DeepMind built these chips specifically to speed up model training, and DeepMind packaged them into clusters of 4,096 chips known as “SuperPods”, before training its system. The overall result of the re-configured infrastructure and methods meant the goodput — the volume of genuinely useful data that moved through the system (as opposed to throughput, which is all data) — increased from 85% in previous training endeavors to 97%, according to the technical report.
DeepMind scientists envision the technology being used in scenarios such as a person uploading photos of a meal being prepared in real-time, and Gemini responding with instructions on the next step in the process.
That said, the scientists did concede hallucinations — a phenomenon in which AI models return false information with maximum confidence — remains an issue for Gemini. Hallucinations are normally caused by limitations or biases in the training data, and they’re difficult to eradicate.