Multimodal learning has always been a challenge in Artificial intelligence, as different types of data require different approaches to efficient processing – something that some machine learnings are still far from being realized.
However, researchers from the Chinese University of Hong Kong and the Shanghai AI Lab came up with an innovative solution: the “Meta-Transformer”, a unified AI framework that can handle multiple data modalities using the same set of parameters. Understand the details below!
see more
AI and workload: benefit or challenge for professionals?
Creator of ChatGPT puts an end to the tool for detecting texts made…
O human brain is an inspiration for this new approach. Our brain simultaneously processes information from multiple sensory inputs, such as visual, auditory, and tactile signals, and understanding one source can help understanding another.
However, replicating this capability in the field of AI has been challenging due to the modality gap in deep learning.
(Image: Thinkhubstudio/iStock/playback)
Data modalities have distinct characteristics. Images have spatial information and have redundancy of information in compressed pixels. Point clouds are difficult to describe because of their sparse distribution in 3D space.
Audio spectrograms are non-stationary, time-varying patterns of data. Video data, in turn, comprises a series of image frames, which allows recording spatial information and temporal dynamics.
Until now, approaches to dealing with different modalities involved creating separate networks for each data type, resulting in a lot of work to fine-tune the models individually. However, Chinese researchers have proposed a new way to deal with this complexity.
The Meta-Transformer is composed of three main components: a modality specialist for data tokenization, a modality shared encoder to extract representations across modalities and task-specific heads for tasks “downstream”.
This framework allows for creating shared token sequences from multimodal data and extracting representations using an encoder with frozen parameters. Meta-Transformer's straightforward approach trains task-specific and generic modality representations efficiently.
The results of the Meta-Transformer experiments were impressive. The framework achieved exceptional performance on multiple datasets spanning 12 different modalities.
This innovative approach promises a new direction in the development of an agnostic framework for modality, which unifies all types of data and significantly improves the ability to understanding multimodal.
With Meta-Transformer, multimodal search is about to take a big step forward, delivering significant advances in artificial intelligence and machine learning.
The ability to process multiple data modalities with a single, unified framework represents an important milestone on the journey to more powerful and efficient AI.