Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yejun Yoon

Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion

Apr 19, 2025

Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park

Figure 1 for Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion

Figure 2 for Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion

Figure 3 for Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion

Figure 4 for Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion

Abstract:Query expansion methods powered by large language models (LLMs) have demonstrated effectiveness in zero-shot retrieval tasks. These methods assume that LLMs can generate hypothetical documents that, when incorporated into a query vector, enhance the retrieval of real evidence. However, we challenge this assumption by investigating whether knowledge leakage in benchmarks contributes to the observed performance gains. Using fact verification as a testbed, we analyzed whether the generated documents contained information entailed by ground truth evidence and assessed their impact on performance. Our findings indicate that performance improvements occurred consistently only for claims whose generated documents included sentences entailed by ground truth evidence. This suggests that knowledge leakage may be present in these benchmarks, inflating the perceived performance of LLM-based query expansion methods, particularly in real-world scenarios that require retrieving niche or novel knowledge.

* preprint

Via

Access Paper or Ask Questions

HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying Real-World Claims

Oct 16, 2024

Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park

Abstract:To tackle the AVeriTeC shared task hosted by the FEVER-24, we introduce a system that only employs publicly available large language models (LLMs) for each step of automated fact-checking, dubbed the Herd of Open LLMs for verifying real-world claims (HerO). HerO employs multiple LLMs for each step of automated fact-checking. For evidence retrieval, a language model is used to enhance a query by generating hypothetical fact-checking documents. We prompt pretrained and fine-tuned LLMs for question generation and veracity prediction by crafting prompts with retrieved in-context samples. HerO achieved 2nd place on the leaderboard with the AVeriTeC score of 0.57, suggesting the potential of open LLMs for verifying real-world claims. For future research, we make our code publicly available at https://github.com/ssu-humane/HerO.

* A system description paper for the AVeriTeC shared task, hosted by the seventh FEVER workshop (co-located with EMNLP 2024)

Via

Access Paper or Ask Questions

Understanding News Thumbnail Representativeness by Counterfactual Text-Guided Contrastive Language-Image Pretraining

Feb 21, 2024

Yejun Yoon, Seunghyun Yoon, Kunwoo Park

Abstract:This paper delves into the critical challenge of understanding the representativeness of news thumbnail images, which often serve as the first visual engagement for readers when an article is disseminated on social media. We focus on whether a news image represents the main subject discussed in the news text. To serve the challenge, we introduce NewsTT, a manually annotated dataset of news thumbnail image and text pairs. We found that pretrained vision and language models, such as CLIP and BLIP-2, struggle with this task. Since news subjects frequently involve named entities or proper nouns, a pretrained model could not have the ability to match its visual and textual appearances. To fill the gap, we propose CFT-CLIP, a counterfactual text-guided contrastive language-image pretraining framework. We hypothesize that learning to contrast news text with its counterfactual, of which named entities are replaced, can enhance the cross-modal matching ability in the target task. Evaluation experiments using NewsTT show that CFT-CLIP outperforms the pretrained models, such as CLIP and BLIP-2. Our code and data will be made accessible to the public after the paper is accepted.

* preprint

Via

Access Paper or Ask Questions

How does fake news use a thumbnail? CLIP-based Multimodal Detection on the Unrepresentative News Image

Apr 27, 2022

Hyewon Choi, Yejun Yoon, Seunghyun Yoon, Kunwoo Park

Figure 1 for How does fake news use a thumbnail? CLIP-based Multimodal Detection on the Unrepresentative News Image

Figure 2 for How does fake news use a thumbnail? CLIP-based Multimodal Detection on the Unrepresentative News Image

Figure 3 for How does fake news use a thumbnail? CLIP-based Multimodal Detection on the Unrepresentative News Image

Figure 4 for How does fake news use a thumbnail? CLIP-based Multimodal Detection on the Unrepresentative News Image

Abstract:This study investigates how fake news uses a thumbnail for a news article with a focus on whether a news article's thumbnail represents the news content correctly. A news article shared with an irrelevant thumbnail can mislead readers into having a wrong impression of the issue, especially in social media environments where users are less likely to click the link and consume the entire content. We propose to capture the degree of semantic incongruity in the multimodal relation by using the pretrained CLIP representation. From a source-level analysis, we found that fake news employs a more incongruous image to the main content than general news. Going further, we attempted to detect news articles with image-text incongruity. Evaluation experiments suggest that CLIP-based methods can successfully detect news articles in which the thumbnail is semantically irrelevant to news text. This study contributes to the research by providing a novel view on tackling online fake news and misinformation. Code and datasets are available at https://github.com/ssu-humane/fake-news-thumbnail.

* 9 pages, 8 figures including appendix figure, 2 tables. Published in Findings of ACL workshop, CONSTRAINT 2022 (Long paper). The manuscript is slightly revised after the camera ready version

Via

Access Paper or Ask Questions