After announcing Gemma 2 at I/O 2024 in May, Google today is introducing PaliGemma 2 as its latest open vision-language model (VLM).
The first version of PaliGemma launched in May for use cases like captioning images and short video, understanding text in images, object detection, object segmentation, and “visual question answering.”
PaliGemma 2 now touts “long captioning” with the ability to generate “detailed, contextually relevant captions for images, going beyond simple object identification to describe actions, emotions, and the overall narrative of the scene.” Available model sizes include 3B, 10B, 28B parameters, as well as 224px, 448px, and 896px resolutions.
There’s also “accurate optical character recognition and understanding the structure and content of tables in documents.” Google has found PaliGemma 2 to offer leading performance in chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation.
Google says PaliGemma 2 is designed to be a “drop-in replacement” for those using the original model. Developers should benefit from “immediate performance gains on most tasks without major code modifications.” Another touted benefit is how easy it is to fine-tune for your specific tasks.
Pre-trained models and code for PaliGemma 2 are available today in Kaggle, Hugging Face, and Ollama.
FTC: We use income earning auto affiliate links. More.
Comments