Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sanghyeok Lee

ProGraph-R1: Progress-aware Reinforcement Learning for Graph Retrieval Augmented Generation

Jan 25, 2026

Jinyoung Park, Sanghyeok Lee, Omar Zia Khan, Hyunwoo J. Kim, Joo-Kyung Kim

Abstract:Graph Retrieval-Augmented Generation (GraphRAG) has been successfully applied in various knowledge-intensive question answering tasks by organizing external knowledge into structured graphs of entities and relations. It enables large language models (LLMs) to perform complex reasoning beyond text-chunk retrieval. Recent works have employed reinforcement learning (RL) to train agentic GraphRAG frameworks that perform iterative interactions between LLMs and knowledge graphs. However, existing RL-based frameworks such as Graph-R1 suffer from two key limitations: (1) they primarily depend on semantic similarity for retrieval, often overlooking the underlying graph structure, and (2) they rely on sparse, outcome-level rewards, failing to capture the quality of intermediate retrieval steps and their dependencies. To address these limitations, we propose ProGraph-R1, a progress-aware agentic framework for graph-based retrieval and multi-step reasoning. ProGraph-R1 introduces a structure-aware hypergraph retrieval mechanism that jointly considers semantic relevance and graph connectivity, encouraging coherent traversal along multi-hop reasoning paths. We also design a progress-based step-wise policy optimization, which provides dense learning signals by modulating advantages according to intermediate reasoning progress within a graph, rather than relying solely on final outcomes. Experiments on multi-hop question answering benchmarks demonstrate that ProGraph-R1 consistently improves reasoning accuracy and generation quality over existing GraphRAG methods.

* In progress

Via

Access Paper or Ask Questions

TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing

Nov 17, 2025

Jongha Kim, Minseong Bae, Sanghyeok Lee, Jinsung Yoon, Hyunwoo J. Kim

Abstract:Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.

* AAAI 2026 (Main Technical Track)

Via

Access Paper or Ask Questions

Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Aug 13, 2025

Jihwan Park, Taehoon song, Sanghyeok Lee, Miso Choi, Hyunwoo J. Kim

Figure 1 for Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Figure 2 for Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Figure 3 for Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Figure 4 for Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Abstract:Vision-Language Models (VLMs) have been widely used in various visual recognition tasks due to their remarkable generalization capabilities. As these models grow in size and complexity, fine-tuning becomes costly, emphasizing the need to reuse adaptation knowledge from 'weaker' models to efficiently enhance 'stronger' ones. However, existing adaptation transfer methods exhibit limited transferability across models due to their model-specific design and high computational demands. To tackle this, we propose Transferable Model-agnostic adapter (TransMiter), a light-weight adapter that improves vision-language models 'without backpropagation'. TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs, in an 'unsupervised' manner. Once trained, this knowledge can be seamlessly transferred across different models without the need for backpropagation. Moreover, TransMiter consists of only a few layers, inducing a negligible additional inference cost. Notably, supplementing the process with a few labeled data further yields additional performance gain, often surpassing a fine-tuned stronger model, with a marginal training cost. Experimental results and analyses demonstrate that TransMiter effectively and efficiently transfers adaptation knowledge while preserving generalization abilities across VLMs of different sizes and architectures in visual recognition tasks.

Via

Access Paper or Ask Questions

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Nov 22, 2024

Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim

Figure 1 for EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Figure 2 for EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Figure 3 for EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Figure 4 for EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Abstract:For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens. Yet, efficient vision backbones built with SSM have been explored less. In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states. Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at https://github.com/mlvlab/EfficientViM.

* preprint

Via

Access Paper or Ask Questions

Robust Multimodal 3D Object Detection via Modality-Agnostic Decoding and Proximity-based Modality Ensemble

Jul 27, 2024

Juhan Cha, Minseok Joo, Jihwan Park, Sanghyeok Lee, Injae Kim, Hyunwoo J. Kim

Abstract:Recent advancements in 3D object detection have benefited from multi-modal information from the multi-view cameras and LiDAR sensors. However, the inherent disparities between the modalities pose substantial challenges. We observe that existing multi-modal 3D object detection methods heavily rely on the LiDAR sensor, treating the camera as an auxiliary modality for augmenting semantic details. This often leads to not only underutilization of camera data but also significant performance degradation in scenarios where LiDAR data is unavailable. Additionally, existing fusion methods overlook the detrimental impact of sensor noise induced by environmental changes, on detection performance. In this paper, we propose MEFormer to address the LiDAR over-reliance problem by harnessing critical information for 3D object detection from every available modality while concurrently safeguarding against corrupted signals during the fusion process. Specifically, we introduce Modality Agnostic Decoding (MOAD) that extracts geometric and semantic features with a shared transformer decoder regardless of input modalities and provides promising improvement with a single modality as well as multi-modality. Additionally, our Proximity-based Modality Ensemble (PME) module adaptively utilizes the strengths of each modality depending on the environment while mitigating the effects of a noisy sensor. Our MEFormer achieves state-of-the-art performance of 73.9% NDS and 71.5% mAP in the nuScenes validation set. Extensive analyses validate that our MEFormer improves robustness against challenging conditions such as sensor malfunctions or environmental changes. The source code is available at https://github.com/hanchaa/MEFormer

Via

Access Paper or Ask Questions

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

Apr 01, 2024

Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim

Abstract:Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at https://github.com/mlvlab/MCTF.

* Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Via

Access Paper or Ask Questions

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

Mar 30, 2024

Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim

Figure 1 for vid-TLDR: Training Free Token merging for Light-weight Video Transformer

Figure 2 for vid-TLDR: Training Free Token merging for Light-weight Video Transformer

Figure 3 for vid-TLDR: Training Free Token merging for Light-weight Video Transformer

Figure 4 for vid-TLDR: Training Free Token merging for Light-weight Video Transformer

Abstract:Video Transformers have become the prevalent solution for various video downstream tasks with superior expressive power and flexibility. However, these video transformers suffer from heavy computational costs induced by the massive number of tokens across the entire video frames, which has been the major barrier to training the model. Further, the patches irrelevant to the main contents, e.g., backgrounds, degrade the generalization performance of models. To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training. For vid-TLDR, we introduce a novel approach to capture the salient regions in videos only with the attention map. Further, we introduce the saliency-aware token merging strategy by dropping the background tokens and sharpening the object scores. Our experiments show that vid-TLDR significantly mitigates the computational complexity of video Transformers while achieving competitive performance compared to the base model without vid-TLDR. Code is available at https://github.com/mlvlab/vid-TLDR.

* Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Via

Access Paper or Ask Questions

Read-only Prompt Optimization for Vision-Language Few-shot Learning

Aug 29, 2023

Dongjun Lee, Seokwon Song, Jihee Suh, Joonmyeong Choi, Sanghyeok Lee, Hyunwoo J. Kim

Figure 1 for Read-only Prompt Optimization for Vision-Language Few-shot Learning

Figure 2 for Read-only Prompt Optimization for Vision-Language Few-shot Learning

Figure 3 for Read-only Prompt Optimization for Vision-Language Few-shot Learning

Figure 4 for Read-only Prompt Optimization for Vision-Language Few-shot Learning

Abstract:In recent years, prompt tuning has proven effective in adapting pre-trained vision-language models to downstream tasks. These methods aim to adapt the pre-trained models by introducing learnable prompts while keeping pre-trained weights frozen. However, learnable prompts can affect the internal representation within the self-attention module, which may negatively impact performance variance and generalization, especially in data-deficient settings. To address these issues, we propose a novel approach, Read-only Prompt Optimization (RPO). RPO leverages masked attention to prevent the internal representation shift in the pre-trained model. Further, to facilitate the optimization of RPO, the read-only prompts are initialized based on special tokens of the pre-trained model. Our extensive experiments demonstrate that RPO outperforms CLIP and CoCoOp in base-to-new generalization and domain generalization while displaying better robustness. Also, the proposed method achieves better generalization on extremely data-deficient settings, while improving parameter efficiency and computational overhead. Code is available at https://github.com/mlvlab/RPO.

* Accepted at ICCV2023

Via

Access Paper or Ask Questions

Self-positioning Point-based Transformer for Point Cloud Understanding

Mar 29, 2023

Jinyoung Park, Sanghyeok Lee, Sihyeon Kim, Yunyang Xiong, Hyunwoo J. Kim

Figure 1 for Self-positioning Point-based Transformer for Point Cloud Understanding

Figure 2 for Self-positioning Point-based Transformer for Point Cloud Understanding

Figure 3 for Self-positioning Point-based Transformer for Point Cloud Understanding

Figure 4 for Self-positioning Point-based Transformer for Point Cloud Understanding

Abstract:Transformers have shown superior performance on various computer vision tasks with their capabilities to capture long-range dependencies. Despite the success, it is challenging to directly apply Transformers on point clouds due to their quadratic cost in the number of points. In this paper, we present a Self-Positioning point-based Transformer (SPoTr), which is designed to capture both local and global shape contexts with reduced complexity. Specifically, this architecture consists of local self-attention and self-positioning point-based global cross-attention. The self-positioning points, adaptively located based on the input shape, consider both spatial and semantic information with disentangled attention to improve expressive power. With the self-positioning points, we propose a novel global cross-attention mechanism for point clouds, which improves the scalability of global self-attention by allowing the attention module to compute attention weights with only a small set of self-positioning points. Experiments show the effectiveness of SPoTr on three point cloud tasks such as shape classification, part segmentation, and scene segmentation. In particular, our proposed model achieves an accuracy gain of 2.6% over the previous best models on shape classification with ScanObjectNN. We also provide qualitative analyses to demonstrate the interpretability of self-positioning points. The code of SPoTr is available at https://github.com/mlvlab/SPoTr.

* Accepted paper at CVPR 2023

Via

Access Paper or Ask Questions

SageMix: Saliency-Guided Mixup for Point Clouds

Oct 13, 2022

Sanghyeok Lee, Minkyu Jeon, Injae Kim, Yunyang Xiong, Hyunwoo J. Kim

Figure 1 for SageMix: Saliency-Guided Mixup for Point Clouds

Figure 2 for SageMix: Saliency-Guided Mixup for Point Clouds

Figure 3 for SageMix: Saliency-Guided Mixup for Point Clouds

Figure 4 for SageMix: Saliency-Guided Mixup for Point Clouds

Abstract:Data augmentation is key to improving the generalization ability of deep learning models. Mixup is a simple and widely-used data augmentation technique that has proven effective in alleviating the problems of overfitting and data scarcity. Also, recent studies of saliency-aware Mixup in the image domain show that preserving discriminative parts is beneficial to improving the generalization performance. However, these Mixup-based data augmentations are underexplored in 3D vision, especially in point clouds. In this paper, we propose SageMix, a saliency-guided Mixup for point clouds to preserve salient local structures. Specifically, we extract salient regions from two point clouds and smoothly combine them into one continuous shape. With a simple sequential sampling by re-weighted saliency scores, SageMix preserves the local structure of salient regions. Extensive experiments demonstrate that the proposed method consistently outperforms existing Mixup methods in various benchmark point cloud datasets. With PointNet++, our method achieves an accuracy gain of 2.6% and 4.0% over standard training in 3D Warehouse dataset (MN40) and ScanObjectNN, respectively. In addition to generalization performance, SageMix improves robustness and uncertainty calibration. Moreover, when adopting our method to various tasks including part segmentation and standard 2D image classification, our method achieves competitive performance.

* Neural Information Processing Systems (NeurIPS), 2022

Via

Access Paper or Ask Questions