Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Seungjun Baek

Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Feb 06, 2025

Minsang Kim, Seungjun Baek

Figure 1 for Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Figure 2 for Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Figure 3 for Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Figure 4 for Syntriever: How to Train Your Retriever with Synthetic Data from LLMs

Abstract:LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@$K$. The code is available at \href{https://github.com/kmswin1/Syntriever}{https://github.com/kmswin1/Syntriever}.

* the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Findings, Accepted

Via

Access Paper or Ask Questions

Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology

Dec 12, 2024

Minsang Kim, Seungjun Baek

Figure 1 for Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology

Figure 2 for Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology

Figure 3 for Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology

Figure 4 for Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology

Abstract:Large language models (LLMs) closely interact with humans, and thus need an intimate understanding of the cultural values of human society. In this paper, we explore how open-source LLMs make judgments on diverse categories of cultural values across countries, and its relation to training methodology such as model sizes, training corpus, alignment, etc. Our analysis shows that LLMs can judge socio-cultural norms similar to humans but less so on social systems and progress. In addition, LLMs tend to judge cultural values biased toward Western culture, which can be improved with training on the multilingual corpus. We also find that increasing model size helps a better understanding of social values, but smaller models can be enhanced by using synthetic data. Our analysis reveals valuable insights into the design methodology of LLMs in connection with their understanding of cultural values.

Via

Access Paper or Ask Questions

Augmenting Query and Passage for Retrieval-Augmented Generation using LLMs for Open-Domain Question Answering

Jun 20, 2024

Minsang Kim, Cheoneum Park, Seungjun Baek

Abstract:Retrieval-augmented generation (RAG) has received much attention for Open-domain question-answering (ODQA) tasks as a means to compensate for the parametric knowledge of large language models (LLMs). While previous approaches focused on processing retrieved passages to remove irrelevant context, they still rely heavily on the quality of retrieved passages which can degrade if the question is ambiguous or complex. In this paper, we propose a simple yet efficient method called question and passage augmentation via LLMs for open-domain QA. Our method first decomposes the original questions into multiple-step sub-questions. By augmenting the original question with detailed sub-questions and planning, we are able to make the query more specific on what needs to be retrieved, improving the retrieval performance. In addition, to compensate for the case where the retrieved passages contain distracting information or divided opinions, we augment the retrieved passages with self-generated passages by LLMs to guide the answer extraction. Experimental results show that the proposed scheme outperforms the previous state-of-the-art and achieves significant performance gain over existing RAG methods.

Via

Access Paper or Ask Questions

Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective

Jun 20, 2024

Minsang Kim, Seungjun Baek

Abstract:Compute-efficient training of large language models (LLMs) has become an important research problem. In this work, we consider data pruning as a method of data-efficient training of LLMs, where we take a data compression view on data pruning. We argue that the amount of information of a sample, or the achievable compression on its description length, represents its sample importance. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We leverage log-likelihood function of trained models as a surrogate to measure information content of samples. Experiments reveal a surprising insight that information-based pruning can enhance the generalization capability of the model, improves upon language modeling and downstream tasks as compared to the model trained on the entire dataset.

Via

Access Paper or Ask Questions

Hierarchical Position Embedding of Graphs with Landmarks and Clustering for Link Prediction

Feb 13, 2024

Minsang Kim, Seungjun Baek

Abstract:Learning positional information of nodes in a graph is important for link prediction tasks. We propose a representation of positional information using representative nodes called landmarks. A small number of nodes with high degree centrality are selected as landmarks, which serve as reference points for the nodes' positions. We justify this selection strategy for well-known random graph models and derive closed-form bounds on the average path lengths involving landmarks. In a model for power-law graphs, we prove that landmarks provide asymptotically exact information on inter-node distances. We apply theoretical insights to practical networks and propose Hierarchical Position embedding with Landmarks and Clustering (HPLC). HPLC combines landmark selection and graph clustering, where the graph is partitioned into densely connected clusters in which nodes with the highest degree are selected as landmarks. HPLC leverages the positional information of nodes based on landmarks at various levels of hierarchy such as nodes' distances to landmarks, inter-landmark distances and hierarchical grouping of clusters. Experiments show that HPLC achieves state-of-the-art performances of link prediction on various datasets in terms of HIT@K, MRR, and AUC. The code is available at \url{https://github.com/kmswin1/HPLC}.

* The International World Wide Web Conference (WWW) 2024, Accepted paper

Via

Access Paper or Ask Questions

CEmb-SAM: Segment Anything Model with Condition Embedding for Joint Learning from Heterogeneous Datasets

Aug 14, 2023

Dongik Shin, Beomsuk Kim, Seungjun Baek

Figure 1 for CEmb-SAM: Segment Anything Model with Condition Embedding for Joint Learning from Heterogeneous Datasets

Figure 2 for CEmb-SAM: Segment Anything Model with Condition Embedding for Joint Learning from Heterogeneous Datasets

Figure 3 for CEmb-SAM: Segment Anything Model with Condition Embedding for Joint Learning from Heterogeneous Datasets

Figure 4 for CEmb-SAM: Segment Anything Model with Condition Embedding for Joint Learning from Heterogeneous Datasets

Abstract:Automated segmentation of ultrasound images can assist medical experts with diagnostic and therapeutic procedures. Although using the common modality of ultrasound, one typically needs separate datasets in order to segment, for example, different anatomical structures or lesions with different levels of malignancy. In this paper, we consider the problem of jointly learning from heterogeneous datasets so that the model can improve generalization abilities by leveraging the inherent variability among datasets. We merge the heterogeneous datasets into one dataset and refer to each component dataset as a subgroup. We propose to train a single segmentation model so that the model can adapt to each sub-group. For robust segmentation, we leverage recently proposed Segment Anything model (SAM) in order to incorporate sub-group information into the model. We propose SAM with Condition Embedding block (CEmb-SAM) which encodes sub-group conditions and combines them with image embeddings from SAM. The conditional embedding block effectively adapts SAM to each image sub-group by incorporating dataset properties through learnable parameters for normalization. Experiments show that CEmb-SAM outperforms the baseline methods on ultrasound image segmentation for peripheral nerves and breast cancer. The experiments highlight the effectiveness of Cemb-SAM in learning from heterogeneous datasets in medical image segmentation tasks.

Via

Access Paper or Ask Questions

NeBLa: Neural Beer-Lambert for 3D Reconstruction of Oral Structures from Panoramic Radiographs

Apr 08, 2023

Sihwa Park, Seongjun Kim, Doeyoung Kwon, Yohan Jang, Seungjun Baek

Abstract:Panoramic radiography (panoramic X-ray, PX) is a widely used imaging modality for dental examination. However, its applicability is limited as compared to 3D Cone-beam computed tomography (CBCT), because PX only provides 2D flattened images of the oral structure. In this paper, we propose a new framework which estimates 3D oral structure from real-world PX images. Since there are not many matching PX and CBCT data, we used simulated PX from CBCT for training, however, we used real-world panoramic radiographs at the inference time. We propose a new ray-sampling method to make simulated panoramic radiographs inspired by the principle of panoramic radiography along with the rendering function derived from the Beer-Lambert law. Our model consists of three parts: translation module, generation module, and refinement module. The translation module changes the real-world panoramic radiograph to the simulated training image style. The generation module makes the 3D structure from the input image without any prior information such as a dental arch. Our ray-based generation approach makes it possible to reverse the process of generating PX from oral structure in order to reconstruct CBCT data. Lastly, the refinement module enhances the quality of the 3D output. Results show that our approach works better for simulated and real-world images compared to other state-of-the-art methods.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

ComDensE : Combined Dense Embedding of Relation-aware and Common Features for Knowledge Graph Completion

Jun 29, 2022

Minsang Kim, Seungjun Baek

Figure 1 for ComDensE : Combined Dense Embedding of Relation-aware and Common Features for Knowledge Graph Completion

Figure 2 for ComDensE : Combined Dense Embedding of Relation-aware and Common Features for Knowledge Graph Completion

Figure 3 for ComDensE : Combined Dense Embedding of Relation-aware and Common Features for Knowledge Graph Completion

Figure 4 for ComDensE : Combined Dense Embedding of Relation-aware and Common Features for Knowledge Graph Completion

Abstract:Real-world knowledge graphs (KG) are mostly incomplete. The problem of recovering missing relations, called KG completion, has recently become an active research area. Knowledge graph (KG) embedding, a low-dimensional representation of entities and relations, is the crucial technique for KG completion. Convolutional neural networks in models such as ConvE, SACN, InteractE, and RGCN achieve recent successes. This paper takes a different architectural view and proposes ComDensE which combines relation-aware and common features using dense neural networks. In the relation-aware feature extraction, we attempt to create relational inductive bias by applying an encoding function specific to each relation. In the common feature extraction, we apply the common encoding function to all input embeddings. These encoding functions are implemented using dense layers in ComDensE. ComDensE achieves the state-of-the-art performance in the link prediction in terms of MRR, HIT@1 on FB15k-237 and HIT@1 on WN18RR compared to the previous baseline approaches. We conduct an extensive ablation study to examine the effects of the relation-aware layer and the common layer of the ComDensE. Experimental results illustrate that the combined dense architecture as implemented in ComDensE achieves the best performance.

* International Conference on Pattern Recognition 2022, Accepted

Via

Access Paper or Ask Questions