Search for a command to run...
Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking complex logical queries involving first-order logic operations such as conjunction (∧), disjunction (∨), and negation (¬). Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset ComLQ for Complex Logical Queries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured structure conformity and evidence distribution through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, Log-Scaled Negation Consistency (LSNC@K). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@K measures whether top-K retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models' limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion. In summary, our ComLQ offers a comprehensive and fine-grained exploration, paving the way for future research on complex logical queries in IR.
Published in: Proceedings of the AAAI Conference on Artificial Intelligence
Volume 40, Issue 40, pp. 34115-34123