Search for a command to run...
Vision Large Language Models (VLLMs) are rapidly reshaping how machines perceive, reason , and communicate with the visual world. Unlike conventional vision systems that primarily map images or videos to fixed labels, VLLMs tightly integrate visual perception with language-driven reasoning, enabling open-ended recognition, visual grounding, semantic explanation, and multi-step decision making across diverse tasks. Driven by recent breakthroughs in large-scale multimodal pretraining and instruction tuning, VLLMs have achieved remarkable progress and are increasingly deployed in real-world applications. In this survey, we present a comprehensive and systematic review of Vision Large Language Models, covering their foundations, methodological developments, and open challenges. Specifically, we organize the survey as follows:(1) We first introduce the background and motivation behind VLLMs, outlining their relationship to prior vision and vision–language models.(2) We then summarize key development stages and major paradigm transitions in VLLM research.(3) We review and categorize representative VLLM architectures based on their design choices and learning objectives.(4) We curate and analyze widely used datasets.(5) We consolidate commonly adopted benchmarking protocols and evaluation metrics.(6) Finally, we discuss critical challenges and outline research directions.