Search for a command to run...
Image captioning delivers important assistance to visually impaired individuals, improve how digital content is organized through visual indexing, and support automated posting on social media platforms. For drone applications, image captioning capabilities significantly enhance navigation precision in complex airspace, improve target identification accuracy during emergency response missions, and support detailed crop monitoring in agricultural applications. Recent advances in large language models present new opportunities to significantly improve the capabilities for Image captioning. Vision-language navigation systems benefit significantly from integrating image captions with visual language model, which can effectively improve the navigation performance. Nevertheless, current image captioning systems frequently produce outputs with insufficient detailed reasoning, have difficulty handling visually ambiguous situations, and often generate either overly generic statements or factually inconsistent content. To address these limitations, our work proposes a framework that integrates the Tree of Thoughts methodology to restructure the caption generation workflow. Our approach incorporates three key technical contributions: establishing clear reasoning chains, implementing self-reflection mechanisms that evaluate and select optimal reasoning paths, and enabling multi-path exploration to generate more reliable descriptions. The organized decision-making approach helps large language models create more accurate and thorough image captions with enhanced contextual awareness. Extensive experiments confirm that our method achieves stable performance improvements across various large language model implementations and evaluation metrics, demonstrating its applicability and effectiveness.
DOI: 10.1117/12.3108016