Search for a command to run...
General information Abstract: Vision-language models (VLMs) are widely evaluated for their cross-modal understanding using handcrafted datasets with contrastive image-caption pairs. While effective for controlled comparisons, such evaluations rely on small candidate sets and limited negative examples, potentially obscuring systematic failure modes under more realistic retrieval settings. In this work, we introduce a post-retrieval analysis framework that evaluates VLMs by inspecting their top-K retrievals from large candidate pools, providing a more diagnostic view of model behavior. We evaluate five VLMs—CLIP, BLIP-2, FLAVA, SigLIP2, and finetuned Qwen2.5-VL on the SVO-Probes, targeting image–text relations such as subjects,objects and verbs. Our evaluation combines standard retrieval metrics, novel semantic-similarity metric, human judgments, and large vision–language model–based assessments to account for incomplete annotations and semantically valid alternatives. Our results indicate that while VLMs achieve high pairwise accuracy and image-text matching accuracy, they struggle in top-K retrieval, particularly with actionsand relational content (human evaluation success rate at 1 ≈ 70%). The o3 assessments closely align with human judgment, enabling scalable evaluation. Overall, our results highlight important limitations of current benchmarks and demonstrate the value of post-retrieval analysis for diagnosing robustness and semantic sensitivity in vision–language models. Paper: currently under review process The data are available upon request for research purposes only. This research was supported by the EU NextGenerationEU through the Recovery and Resilience Plan for Slovakia under the project No. 09I01-03-V04-00007, DisAI-AMPLIFIED. Tables of content We randomly selected 100 samples from SVO-Probes. We used both the images and the texts of these samples for retrieval with all of the five mentioned VLMs. So for each model there are two json files and each json file containes entries that looks like this (from BLIP2_selected_samples_text_retrieval_evaluation_o3_experiment.json): { "221": { "BLIP2 retrieved captions": [ "Girl sits on a ball.", "A girl sits on a ball.", "A girl sitting on a ball.", "girl sits on ball", "A woman is sitting on a ball.", "the girl sits in the pool", "The girl can sit with no background.", "woman, ball, outside", "A girl sits on a soccer ball.", "One girl enjoy in a beach." ], "human evaluation": [ "correct", "correct", "correct", "correct", "correct", "object incorrect", "correct", "correct", "object incorrect", "object incorrect" ], "GPT evaluation": [ "correct", "correct", "correct", "correct", "subject incorrect", "object incorrect", "object incorrect", "object incorrect", "object incorrect", "object incorrect" ], "human evaluation 2": [ "1", "1", "1", "1", "1", "-1", "1", "-1", "-1", "-1" ], "human evaluation 3": [ "1", "1", "1", "1", "1", "-1", "1", "-1", "-1", "-1" ], "o3 evaluation": [ "correct", "correct", "correct", "correct", "correct", "object incorrect", "correct", "object incorrect", "object incorrect", "object incorrect" ] }, ... where "221" is the id of the image used as query for retrieval, "BLIP2 retrieved captions" containes list of tok 10 captions retrieved for given query, "human evaluation, "GPT evaluation" and "o3 evaluation" contain list of 10 labels obtained for each one of retrieved captions for given query by human or by LVLM. The labels are "correct", "subject incorrect", "verb incorrect" or "object incorrect". For files that are about _text_retrieval_ the file also containes "human evaluation 2" and "human evaluation 3" as two other anotators evaluated those retrieved samples. Here the labels are "1", "0" or "-1".