Workshop on Video Large Language Models (VidLLMs)

Challenge Tracks

  1. Composed Video Retrieval (CoVR)
  2. Composed Video Retrieval (CoVR) is a challenging task in computer vision that aims to retrieve target videos from a large database using a combination of a visual query (such as an image or a video) and modification text that specifies how the target video differs from the query. This builds on the principles of Composed Image Retrieval (CoIR), where an image and a textual description are used together to search for relevant images. CoVR emphasizes the need to integrate textual descriptions with visual information to refine the search, capturing subtle changes and contextual nuances that cannot be easily expressed through visual data alone. The task requires developing models that can effectively understand and combine these multi-modal inputs to generate accurate and contextually appropriate video embeddings. Existing approaches for CoVR primarily rely on using modification text to guide retrieval, but they often struggle to preserve the full context and specificity of the target video. This challenge lies in learning discriminative representations that align both visual and textual features, ensuring that the retrieved videos accurately match the intended modifications described in the query. CoVR remains an active area of research, pushing the boundaries of multi-modal understanding and retrieval in large-scale video datasets.
    For this task, we utilize video pairs curated by the WebVid-COVR Dataset (Ventura et. al.). We propose to use our own improved modification text and captions for this challenge track. The training set is composed of synthetically generated triplets, each consisting of an input video, a change text, and a target video. In contrast, the test set is manually curated with the assistance of a model-in-the-loop process. The change text in each triplet is created by comparing the captions of the input and target videos using a large language model (LLM), capturing the differences between the two videos. The training set includes 131,000 unique videos and 467,000 distinct change texts, with each video linked to an average of 12.7 triplets, and the average change text length being 4.8 words. WebVid-CoVR also features validation and test sets derived from the WebVid10M corpus, with the validation set containing 7,000 triplets and the test set comprising 3,200 manually curated triplets to ensure high quality.
    Reference Paper: arXiV
    Errata: We would like to thank the authors of the CoVR paper for curating the original video pairs and apologize for the omission of a reference to their papers. Please cite their works as well if you utilize the dataset: CoVR CoVR-2

    CoVR Challenge Website (eval.ai)

  3. Complex Video Reasoning & Robustness Evaluation
  4. Our Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) is a comprehensive benchmark designed to assess the reasoning and robustness capabilities of Video Large Language Models (Video-LLMs) in real-world scenarios. Unlike existing video benchmarks that focus primarily on simple video comprehension, CVRR-ES evaluates models on their ability to handle intricate reasoning tasks and robustly respond to diverse user prompts across a range of realworld contexts. The benchmark comprises 11 diverse real-world video category dimensions, including multiple actions in a single video, fine-grained action understanding, partial actions, time order understanding, non-existent actions with existent scene depictions, non-existent actions with non-existent scene depictions, continuity and object instance count, unusual and physically anomalous activities, interpretation of social context, understanding of emotional context, and interpretation of visual context. Overall, it consists of 2400 open-ended question-answer (QA) pairs spanning 214 unique videos. The average video duration is 22.3 seconds, with maximum and minimum durations of 183 and 2 seconds, respectively.
    The evaluation of Video-LMMs on the CVRR-ES benchmark has revealed several unique insights into the current state of video reasoning and robustness in AI models. Most notably, open-source Video-LMMs tend to struggle with complex reasoning tasks, often exhibiting over-affirmative behavior by agreeing with user prompts even when they are misleading or incorrect. These models also display a tendency to complete partial actions, inaccurately assuming the entire sequence has occurred when only a fragment is shown. The benchmark also highlights a critical limitation in the generalization ability of Video-LMMs when confronted with out-of-distribution (OOD) videos containing unusual or rare activities. These observations underline the need for advanced model training techniques that focus on nuanced reasoning, robust interaction with user prompts, and improved handling of real-world complexities. The CVRR-ES benchmark provides a challenging and diverse testbed for pushing the boundaries of Video-LLMs in complex video reasoning and robustness evaluations.
    Reference Paper: arXiV

    CVRR Challenge Website (eval.ai)

  5. Multilingual Video Reasoning Evaluation
  6. The Multilingual Video Reasoning Evaluation is a comprehensive benchmark under preparation to assess the multilingual capabilities of Video Large Language Models (Video-LLMs). While several multilingual evaluation sets exist for multimodal LLMs, diverse and truly multilingual benchmarks for video-LLMs remain scarce. This work will introduce the first evaluation suite for Video-LLMs, spanning 13 languages, including Chinese, German, Japanese, Russian, Spanish, Swedish, Arabic, French, Hindi, Sinhala, and Tamil. The test set will consist of a variety of question types, such as open-ended questions (both short and long answers) and multiple-choice questions, which are currently rare in videoLLM evaluations. It draws data from four existing benchmarks: Video-MME, VCGBench-Diverse, CVRR-ES, and MV-Bench, totaling 6,500 English question-answer pairs. These pairs are currently translated and verified by humans into the 13 target languages, creating a total test set of 84,500 samples.
    To promote fair multilingual evaluation, the benchmark will introduce 10 cultural categories for each language, ranging from Lifestyle and Festivals to Food & Cuisine, Sports, Architecture, Education, Heritage, Media, Agriculture, Landmarks, Art, and History. Initial results show that current Video-LLMs, both open-source and proprietary, struggle to capture cultural nuances in local languages, underscoring the need for culturally aware Video-LLMs capable of handling multilingual aspects effectively. The Multilingual Video Reasoning Evaluation benchmark pushes the multilingual scope in the current Video-LLMs, making it a standard to test cultural robustness for SOTA models.

    Eval.ai

Each track tests a different aspect of VidLLMs. Submit your solutions via the corresponding Eval AI links. CoVR Challenge. CVRR Challenge. MVRE (coming soon).

Amazon Challenge Prize

Prizes will be awarded across the three tracks. Winners will be decided by leaderboard ranking and review.

Key Dates