Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Susik Yoon

References Indeed Matter? Reference-Free Preference Optimization for Conversational Query Reformulation

May 10, 2025

Doyoung Kim, Youngjun Lee, Joeun Kim, Jihwan Bang, Hwanjun Song, Susik Yoon, Jae-Gil Lee

Abstract:Conversational query reformulation (CQR) has become indispensable for improving retrieval in dialogue-based applications. However, existing approaches typically rely on reference passages for optimization, which are impractical to acquire in real-world scenarios. To address this limitation, we introduce a novel reference-free preference optimization framework DualReform that generates pseudo reference passages from commonly-encountered conversational datasets containing only queries and responses. DualReform attains this goal through two key innovations: (1) response-based inference, where responses serve as proxies to infer pseudo reference passages, and (2) response refinement via the dual-role of CQR, where a CQR model refines responses based on the shared objectives between response refinement and CQR. Despite not relying on reference passages, DualReform achieves 96.9--99.1% of the retrieval accuracy attainable only with reference passages and surpasses the state-of-the-art method by up to 31.6%.

Via

Access Paper or Ask Questions

Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths

Nov 08, 2024

Sangam Lee, Ryang Heo, SeongKu Kang, Susik Yoon, Jinyoung Yeo, Dongha Lee

Figure 1 for Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths

Figure 2 for Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths

Figure 3 for Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths

Figure 4 for Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths

Abstract:Generative retrieval has recently emerged as a new alternative of traditional information retrieval approaches. However, existing generative retrieval methods directly decode docid when a query is given, making it impossible to provide users with explanations as an answer for "Why this document is retrieved?". To address this limitation, we propose Hierarchical Category Path-Enhanced Generative Retrieval(HyPE), which enhances explainability by generating hierarchical category paths step-by-step before decoding docid. HyPE leverages hierarchical category paths as explanation, progressing from broad to specific semantic categories. This approach enables diverse explanations for the same document depending on the query by using shared category paths between the query and the document, and provides reasonable explanation by reflecting the document's semantic structure through a coarse-to-fine manner. HyPE constructs category paths with external high-quality semantic hierarchy, leverages LLM to select appropriate candidate paths for each document, and optimizes the generative retrieval model with path-augmented dataset. During inference, HyPE utilizes path-aware reranking strategy to aggregate diverse topic information, allowing the most relevant documents to be prioritized in the final ranked list of docids. Our extensive experiments demonstrate that HyPE not only offers a high level of explainability but also improves the retrieval performance in the document retrieval task.

Via

Access Paper or Ask Questions

Online Drift Detection with Maximum Concept Discrepancy

Jul 07, 2024

Ke Wan, Yi Liang, Susik Yoon

Abstract:Continuous learning from an immense volume of data streams becomes exceptionally critical in the internet era. However, data streams often do not conform to the same distribution over time, leading to a phenomenon called concept drift. Since a fixed static model is unreliable for inferring concept-drifted data streams, establishing an adaptive mechanism for detecting concept drift is crucial. Current methods for concept drift detection primarily assume that the labels or error rates of downstream models are given and/or underlying statistical properties exist in data streams. These approaches, however, struggle to address high-dimensional data streams with intricate irregular distribution shifts, which are more prevalent in real-world scenarios. In this paper, we propose MCD-DD, a novel concept drift detection method based on maximum concept discrepancy, inspired by the maximum mean discrepancy. Our method can adaptively identify varying forms of concept drift by contrastive learning of concept embeddings without relying on labels or statistical properties. With thorough experiments under synthetic and real-world scenarios, we demonstrate that the proposed method outperforms existing baselines in identifying concept drifts and enables qualitative analysis with high explainability.

Via

Access Paper or Ask Questions

SCStory: Self-supervised and Continual Online Story Discovery

Nov 27, 2023

Susik Yoon, Yu Meng, Dongha Lee, Jiawei Han

Abstract:We present a framework SCStory for online story discovery, that helps people digest rapidly published news article streams in real-time without human annotations. To organize news article streams into stories, existing approaches directly encode the articles and cluster them based on representation similarity. However, these methods yield noisy and inaccurate story discovery results because the generic article embeddings do not effectively reflect the story-indicative semantics in an article and cannot adapt to the rapidly evolving news article streams. SCStory employs self-supervised and continual learning with a novel idea of story-indicative adaptive modeling of news article streams. With a lightweight hierarchical embedding module that first learns sentence representations and then article representations, SCStory identifies story-relevant information of news articles and uses them to discover stories. The embedding module is continuously updated to adapt to evolving news streams with a contrastive learning objective, backed up by two unique techniques, confidence-aware memory replay and prioritized-augmentation, employed for label absence and data scarcity problems. Thorough experiments on real and the latest news data sets demonstrate that SCStory outperforms existing state-of-the-art algorithms for unsupervised online story discovery.

* Presented at WWW'23

Via

Access Paper or Ask Questions

One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning

Nov 18, 2023

Doyoung Kim, Susik Yoon, Dongmin Park, Youngjun Lee, Hwanjun Song, Jihwan Bang, Jae-Gil Lee

Figure 1 for One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning

Figure 2 for One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning

Figure 3 for One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning

Figure 4 for One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning

Abstract:In real-world continual learning scenarios, tasks often exhibit intricate and unpredictable semantic shifts, posing challenges for fixed prompt management strategies. We identify the inadequacy of universal and specific prompting in handling these dynamic shifts. Universal prompting is ineffective for tasks with abrupt semantic changes, while specific prompting struggles with overfitting under mild semantic shifts. To overcome these limitations, we propose an adaptive prompting approach that tailors minimal yet sufficient prompts based on the task semantics. Our methodology, SemPrompt, incorporates a two-level semantic grouping process: macroscopic semantic assignment and microscopic semantic refinement. This process ensures optimal prompt utilization for varying task semantics, improving the efficiency and effectiveness of learning in real-world CL settings. Our experimental results demonstrate that SemPrompt consistently outperforms existing methods in adapting to diverse semantic shifts in tasks.

Via

Access Paper or Ask Questions

Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

May 04, 2023

Susik Yoon, Dongha Lee, Yunyi Zhang, Jiawei Han

Figure 1 for Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Figure 2 for Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Figure 3 for Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Figure 4 for Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Abstract:Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.

* Accepted by SIGIR'23

Via

Access Paper or Ask Questions

MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Apr 04, 2023

Priyanka Kargupta, Tanay Komarlu, Susik Yoon, Xuan Wang, Jiawei Han

Figure 1 for MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Figure 2 for MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Figure 3 for MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Figure 4 for MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Abstract:Text classification typically requires a substantial amount of human-annotated data to serve as supervision, which is costly to obtain in dynamic emerging domains. Certain methods seek to address this problem by solely relying on the surface text of class names to serve as extremely weak supervision. However, existing methods fail to account for single-class documents discussing multiple topics. Both topic diversity and vague sentences may introduce noise into the document's underlying representation and consequently the precision of the predicted class. Furthermore, current work focuses on text granularities (documents, sentences, or words) independently, which limits the degree of coarse- or fine-grained context that we can jointly extract from all three to identify significant subtext for classification. In order to address this problem, we propose MEGClass, an extremely weakly-supervised text classification method to exploit Mutually-Enhancing Text Granularities. Specifically, MEGClass constructs class-oriented sentence and class representations based on keywords for performing a sentence-level confidence-weighted label ensemble in order to estimate a document's initial class distribution. This serves as the target distribution for a multi-head attention network with a class-weighted contrastive loss. This network learns contextualized sentence representations and weights to form document representations that reflect its original document and sentence-level topic diversity. Retaining this heterogeneity allows MEGClass to select the most class-indicative documents to serve as iterative feedback for enhancing the class representations. Finally, these top documents are used to fine-tune a pre-trained text classifier. As demonstrated through extensive experiments on six benchmark datasets, MEGClass outperforms other weakly and extremely weakly supervised methods.

* Code: https://github.com/pkargupta/MEGClass/

Via

Access Paper or Ask Questions

PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Feb 10, 2023

Susik Yoon, Hou Pong Chan, Jiawei Han

Figure 1 for PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Figure 2 for PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Figure 3 for PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Figure 4 for PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream

Abstract:Summarizing text-rich documents has been long studied in the literature, but most of the existing efforts have been made to summarize a static and predefined multi-document set. With the rapid development of online platforms for generating and distributing text-rich documents, there arises an urgent need for continuously summarizing dynamically evolving multi-document sets where the composition of documents and sets is changing over time. This is especially challenging as the summarization should be not only effective in incorporating relevant, novel, and distinctive information from each concurrent multi-document set, but also efficient in serving online applications. In this work, we propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS), and introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization. PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents while preserving accumulated knowledge from previous documents. To update new summaries, the most representative sentences for each multi-document set are extracted by measuring their similarities to the prototypes. A thorough evaluation with real multi-document sets streams demonstrates that PDSum outperforms state-of-the-art unsupervised multi-document summarization algorithms in EMDS in terms of relevance, novelty, and distinctiveness and is also robust to various evaluation settings.

* Accepted by WWW'23

Via

Access Paper or Ask Questions

Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream

Jun 09, 2022

Susik Yoon, Youngjun Lee, Jae-Gil Lee, Byung Suk Lee

Figure 1 for Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream

Figure 2 for Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream

Figure 3 for Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream

Figure 4 for Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream

Abstract:Online anomaly detection from a data stream is critical for the safety and security of many applications but is facing severe challenges due to complex and evolving data streams from IoT devices and cloud-based infrastructures. Unfortunately, existing approaches fall too short for these challenges; online anomaly detection methods bear the burden of handling the complexity while offline deep anomaly detection methods suffer from the evolving data distribution. This paper presents a framework for online deep anomaly detection, ARCUS, which can be instantiated with any autoencoder-based deep anomaly detection methods. It handles the complex and evolving data streams using an adaptive model pooling approach with two novel techniques: concept-driven inference and drift-aware model pool update; the former detects anomalies with a combination of models most appropriate for the complexity, and the latter adapts the model pool dynamically to fit the evolving data streams. In comprehensive experiments with ten data sets which are both high-dimensional and concept-drifted, ARCUS improved the anomaly detection accuracy of the streaming variants of state-of-the-art autoencoder-based methods and that of the state-of-the-art streaming anomaly detection methods by up to 22% and 37%, respectively.

* Accepted by KDD 2022 Research Track

Via

Access Paper or Ask Questions

TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

Jan 19, 2022

Dongha Lee, Jiaming Shen, SeongKu Kang, Susik Yoon, Jiawei Han, Hwanjo Yu

Figure 1 for TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

Figure 2 for TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

Figure 3 for TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

Figure 4 for TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

Abstract:Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.

* The Web Conference (WWW) 2022, 11 pages, 7 figures

Via

Access Paper or Ask Questions