Search for a command to run...
Text-to-image models such as Stable Diffusion (SD) require comprehensive, fine-grained, and high-precision methods for evaluating text–image alignment. A prior method, the text–image alignment metric (TIAM), employs a template-based approach for fine-grained, high-precision evaluation; however, it is restricted to objects and colors, limiting its comprehensiveness. This study extends the TIAM by incorporating attention maps and vision–language models to deliver a fine-grained and high-precision evaluation framework that goes beyond colors and objects to include attributes, actions, and positions. In our experiments, we analyze the evaluation scores of images generated by the proposed method and compare them with human judgments. The results demonstrate that the proposed method outperforms existing methods, exhibiting a stronger correlation with human judgments (r = 0.853, p<10−48). In addition, we applied the proposed method to evaluate the generation abilities of three SD models (i.e., SD1.4, SD2, and SD3.5). Each experiment used over 900 images, totaling 9858 images across all experiments to ensure statistical significance. The results indicate that SD3.5 exhibits superior expressiveness compared with SD1.4 and SD2. Nevertheless, for more complex tasks such as multi-attribute generation or multi-action generation, limitations in text–image alignment remain evident.