Lessons from Using Google Gemini for Video Analysis

Introduction

The rise of multi-modal models like Gemini has ushered in a new era of LLMs that seamlessly integrate video, text, images, and other inputs. As we’ve been using Gemini at Decipher AI for session replay analysis, we’ve learned several insights that have significantly enhanced the performance for us.

In this post, we'll delve into our discoveries and provide actionable tips to help you maximize the effectiveness of multi-modal models in your projects.

Slow it down!

Video quality plays a crucial role in a model's ability to interpret visual content accurately. Some models, including Google Gemini, typically extract frames at a rate of 1 frame per second (FPS). Don’t forget normal movies are 24fps. While this works well for slower-paced content, it can lead to significant loss of detail in fast-moving sequences. This detail is super important when it comes to the LLMs ability to understand what’s going on the video.

Our solution? Slow down the videos. This simple adjustment allows the model to capture more frames during critical moments, resulting in more accurate analysis. For instance, when examining a user interaction video showcasing a product page load, reducing the speed can reveal subtle errors like delayed UI element rendering that might otherwise go unnoticed.

Key Takeaway: Adjusting video speed enables multi-modal models to perform more accurate analysis by capturing crucial details in sped up content.

The Power of Short Form

When it comes to video length, we've found that brevity is important. While lengthy videos offer a wealth of information, they can overwhelm models, leading to less precise inferences and hallucinations. Shorter, more focused videos allow the model to concentrate on specific moments without the distraction of extraneous context.

For example, when diagnosing errors, submitting a concise 30-second clip of the issue is far more effective than a 10-minute video of the entire user journey. This targeted approach helps the model zero in on the problem, resulting in faster and more accurate diagnoses.

Key Takeaway: Aim for videos under two minutes to maximize the model's ability to deliver targeted insights.

Using Timestamps for Precision

One of the most powerful features of multi-modal models is their ability to associate external context with specific video moments. To fully leverage this capability, it's crucial to format timestamps correctly that the model can understand. This is dependent on how the model represents video frames. For instance, Google's models require timestamps in the MM:SS format (minutes:seconds).

By referencing exact times (e.g., "02:15") when asking the model to analyze a specific point in a video, you ensure that it focuses on the precise moment of interest.

Key Takeaway: Use accurate timestamps to guide the model's attention to specific video moments for more precise analysis.

Navigating Model Quirks

Each multi-modal model has its own idiosyncrasies, and learning to navigate these can significantly improve your results. For instance, we discovered that when working with video input with Gemini, placing the text prompt after the video rather than before often yields better. It’s possible this allows the model to process the visual content before considering the textual context.

Key Takeaway: Read all documentation and experiment with the little things!

Conclusion

Maximizing the potential of multimodal models like Gemini requires thoughtful implementation and experimentation. By slowing down fast-paced content, keeping videos concise, utilizing precise timestamps, and understanding model-specific quirks, you can significantly enhance the accuracy and effectiveness of your multimodal analysis. These strategies have proven invaluable in our work at Decipher AI, and we encourage you to adopt similar practices to achieve more accurate and efficient outcomes with multi-modal models.

Curious to see how these multimodal AI techniques in Decipher AI can help you catch hidden bugs and user frustrations on your product? Reach out to me at michael (at) getdecipher.com or schedule a demo.

Written by:

Michael Rosenfield

Co-founder

Share with friends:

Share on X