…
Challenges in Localized Captioning for Vision-Language Models
Describing specific regions within images or videos remains a persistent challenge in vision-language modeling. While general-purpose vision-language models (VLMs) perform well at generating global captions, they often fall short in producing detailed, region-specific descriptions. These limitations are amplified in video data, where models must account for temporal…
Starting today, Gemini Advanced users can generate and share videos using our state-of-the-art video model, Veo 2. In Gemini, you can now translate text-based prompts into dynamic videos. Google Labs is also making Veo 2 available through Whisk, a generative AI experiment that allows you to create new images using both text and image prompts,…
Designing intelligent systems that function reliably in dynamic physical environments remains one of the more difficult frontiers in AI. While significant advances have been made in perception and planning within simulated or controlled contexts, the real world is noisy, unpredictable, and resistant to abstraction. Traditional AI systems often rely on high-level representations detached from their…
The recent launch of the DeepSeek-R1 model sent ripples across the global AI community. It delivered breakthroughs on par with the reasoning models from Meta and OpenAI, achieving this in a fraction of the time and at a significantly lower cost.
Beyond the headlines and online buzz, how can we assess the model’s reasoning abilities…
Despite rapid advances in vision-language modeling, much of the progress in this field has been shaped by models trained on proprietary datasets, often relying on distillation from closed-source systems. This reliance creates barriers to scientific transparency and reproducibility, particularly for tasks involving fine-grained image and video understanding. Benchmark performance may reflect the training data and…
Today we are rolling out an early version of Gemini 2.5 Flash in preview through the Gemini API via Google AI Studio and Vertex AI. Building upon the popular foundation of 2.0 Flash, this new version delivers a major upgrade in reasoning capabilities, while still prioritizing speed and cost. Gemini 2.5 Flash is our first…
Marine robotic platforms support various applications, including marine exploration, underwater infrastructure inspection, and ocean environment monitoring. While reliable perception systems enable robots to sense their surroundings, detect objects, and navigate complex underwater terrains independently, developing these systems presents unique difficulties compared to their terrestrial counterparts. Collecting real-world underwater data requires complex hardware, controlled experimental setups,…
Deploying your Large Language Model (LLM) is not necessarily the final step in productionizing your Generative AI application. An often forgotten, yet crucial part of the MLOPs lifecycle is properly load testing your LLM and ensuring it is ready to withstand your expected production traffic. Load testing at a high level is the practice of…