From Perception to Reasoning: A Survey of Vision Large Language Models

20260 citationsPreprintgreen Open Access

Authors

Neville Mathew

Yidan Shen · University of Houston System

Abstract

Vision Large Language Models (VLLMs) are rapidly reshaping how machines perceive, reason , and communicate with the visual world. Unlike conventional vision systems that primarily map images or videos to fixed labels, VLLMs tightly integrate visual perception with language-driven reasoning, enabling open-ended recognition, visual grounding, semantic explanation, and multi-step decision making across diverse tasks. Driven by recent breakthroughs in large-scale multimodal pretraining and instruction tuning, VLLMs have achieved remarkable progress and are increasingly deployed in real-world applications. In this survey, we present a comprehensive and systematic review of Vision Large Language Models, covering their foundations, methodological developments, and open challenges. Specifically, we organize the survey as follows:(1) We first introduce the background and motivation behind VLLMs, outlining their relationship to prior vision and vision–language models.(2) We then summarize key development stages and major paradigm transitions in VLLM research.(3) We review and categorize representative VLLM architectures based on their design choices and learning objectives.(4) We curate and analyze widely used datasets.(5) We consolidate commonly adopted benchmarking protocols and evaluation metrics.(6) Finally, we discuss critical challenges and outline research directions.

Topics & Keywords

Multimodal Machine Learning Applications Domain Adaptation and Few-Shot Learning Advanced Neural Network Applications

UN Sustainable Development Goals

Peace, Justice and strong institutions

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18943956