Meta announced the release of five of the most recent research models of its Fundamental AI Research team. The models include image-to-text and text-to-music generation models, a multi-token prediction model, and a technique for detecting AI-generated speech.
The first is the release of the key components of its Chameleon model.
“Chameleon is a family of mixed-modal models that can process and deliver both image and text at the same time,” read the official statement on the website.
What makes Chameleon different from conventional LLMs, which produce only unimodal results, is its ability to take any text-image combination as both input and output. This will assist in generation of image captions or in using a mix of text prompts and images to create new scenes from scratch.
Each model comes with its separate licence restrictions. The chameleon model is accessible under a research-only licence.
To facilitate quicker code completion and smoother text prediction (in a simplified sense this includes the text suggestions we receive while typing texts on Whatsapp just above the keyboard), the second model of multi-token prediction will come to use.
The existing LLM training for predicting the next word may be simple and scalable but it remains inefficient because the existing models require larger sets of text compared to what children need to learn the same degree of language fluency.
The approach to facilitate prediction of multiple words simultaneously was introduced in April. This was preceded by a one-at-a-time approach. This model too has been released under a non-commercial and research research-only licence.
JASCO is another model that aims to facilitate music composition. Ancient text-to-music models like MusicGen rely mainly on text inputs for music generation. Meta’s new model, JASCO however, is capable of accepting various conditioning inputs, such as chords or beats, for greater controllability over more versatile generated music outputs. This means one can even use instrumental tunes alongside textual conditions to generate music. This allows the incorporation of both symbols and audio in the old text-to-music generation model.
Meta also released AudioSeal which aids AI-generated speech detection. This first audio watermarking technique makes pinpointing AI-generated segments within a longer audio snippet possible. Its localised detection approach enables faster and up to 485 times more efficient detection compared to previous methods. AudioSeal is suitable for large-scale and real-time application. This is the only model of the five released recently to have been made available under a commercial licence.
Meta also released some additional Responsible AI artefacts to increase diversity in Text-To-Image Generation Systems. Automatic indicators to evaluate potential geographical disparities in text-to-image models have been developed. This is done to improve the representation of geographical and cultural preferences. For better comprehension of varied perceptions of geographic representation, Meta even conducted a large-scale annotation study with more than 65,000 annotations and more than twenty survey responses per example.
Introduction of newer capabilities is expected in the coming months with longer context windows, additional model sizes, and enhanced performance alongside the Llama 3 research paper.