Enhancing LLM-based image captioning via structured exploration with tree of thoughts

20260 citationsJournal Article

Authors

Chanjuan Lu · Southwest Minzu University

Yahong Li · Guilin University of Aerospace Technology

Abstract

Image captioning delivers important assistance to visually impaired individuals, improve how digital content is organized through visual indexing, and support automated posting on social media platforms. For drone applications, image captioning capabilities significantly enhance navigation precision in complex airspace, improve target identification accuracy during emergency response missions, and support detailed crop monitoring in agricultural applications. Recent advances in large language models present new opportunities to significantly improve the capabilities for Image captioning. Vision-language navigation systems benefit significantly from integrating image captions with visual language model, which can effectively improve the navigation performance. Nevertheless, current image captioning systems frequently produce outputs with insufficient detailed reasoning, have difficulty handling visually ambiguous situations, and often generate either overly generic statements or factually inconsistent content. To address these limitations, our work proposes a framework that integrates the Tree of Thoughts methodology to restructure the caption generation workflow. Our approach incorporates three key technical contributions: establishing clear reasoning chains, implementing self-reflection mechanisms that evaluate and select optimal reasoning paths, and enabling multi-path exploration to generate more reliable descriptions. The organized decision-making approach helps large language models create more accurate and thorough image captions with enhanced contextual awareness. Extensive experiments confirm that our method achieves stable performance improvements across various large language model implementations and evaluation metrics, demonstrating its applicability and effectiveness.

Topics & Keywords

Multimodal Machine Learning Applications Topic Modeling Language, Metaphor, and Cognition

Publication Details

DOI: 10.1117/12.3108016

Field-Weighted Citation Impact: 0.00

Command Palette

Enhancing LLM-based image captioning via structured exploration with tree of thoughts

Authors

Abstract

Topics & Keywords

Publication Details