Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Minji Kim

ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

Feb 20, 2025

Abhijit Mishra, Richard Noh, Hsiang Fu, Mingda Li, Minji Kim

Abstract:Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

* 12 pages, 7 figures, 3 tables

Via

Access Paper or Ask Questions

Rule-based autocorrection of Piping and Instrumentation Diagrams (P&IDs) on graphs

Feb 18, 2025

Lukas Schulze Balhorn, Niels Seijsener, Kevin Dao, Minji Kim, Dominik P. Goldstein, Ge H. M. Driessen, Artur M. Schweidtmann

Abstract:A piping and instrumentation diagram (P&ID) is a central reference document in chemical process engineering. Currently, chemical engineers manually review P&IDs through visual inspection to find and rectify errors. However, engineering projects can involve hundreds to thousands of P&ID pages, creating a significant revision workload. This study proposes a rule-based method to support engineers with error detection and correction in P&IDs. The method is based on a graph representation of P&IDs, enabling automated error detection and correction, i.e., autocorrection, through rule graphs. We use our pyDEXPI Python package to generate P&ID graphs from DEXPI-standard P&IDs. In this study, we developed 33 rules based on chemical engineering knowledge and heuristics, with five selected rules demonstrated as examples. A case study on an illustrative P&ID validates the reliability and effectiveness of the rule-based autocorrection method in revising P&IDs.

Via

Access Paper or Ask Questions

Physics-informed reduced order model with conditional neural fields

Dec 06, 2024

Minji Kim, Tianshu Wen, Kookjin Lee, Youngsoo Choi

Figure 1 for Physics-informed reduced order model with conditional neural fields

Figure 2 for Physics-informed reduced order model with conditional neural fields

Abstract:This study presents the conditional neural fields for reduced-order modeling (CNF-ROM) framework to approximate solutions of parametrized partial differential equations (PDEs). The approach combines a parametric neural ODE (PNODE) for modeling latent dynamics over time with a decoder that reconstructs PDE solutions from the corresponding latent states. We introduce a physics-informed learning objective for CNF-ROM, which includes two key components. First, the framework uses coordinate-based neural networks to calculate and minimize PDE residuals by computing spatial derivatives via automatic differentiation and applying the chain rule for time derivatives. Second, exact initial and boundary conditions (IC/BC) are imposed using approximate distance functions (ADFs) [Sukumar and Srivastava, CMAME, 2022]. However, ADFs introduce a trade-off as their second- or higher-order derivatives become unstable at the joining points of boundaries. To address this, we introduce an auxiliary network inspired by [Gladstone et al., NeurIPS ML4PS workshop, 2022]. Our method is validated through parameter extrapolation and interpolation, temporal extrapolation, and comparisons with analytical solutions.

* 7 pages, 2 figures, NeurIPS 2024 Workshop on Machine Learning and the Physical Sciences

Via

Access Paper or Ask Questions

Leveraging Temporal Contextualization for Video Action Recognition

Apr 15, 2024

Minji Kim, Dongyoon Han, Taekyung Kim, Bohyung Han

Figure 1 for Leveraging Temporal Contextualization for Video Action Recognition

Figure 2 for Leveraging Temporal Contextualization for Video Action Recognition

Figure 3 for Leveraging Temporal Contextualization for Video Action Recognition

Figure 4 for Leveraging Temporal Contextualization for Video Action Recognition

Abstract:Pretrained vision-language models have shown effectiveness in video understanding. However, recent studies have not sufficiently leveraged essential temporal information from videos, simply averaging frame-wise representations or referencing consecutive frames. We introduce Temporally Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding that effectively and efficiently leverages comprehensive video information. We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process. Furthermore, our Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality. We conduct extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition to validate the superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design choices. Code is available at https://github.com/naver-ai/tc-clip

* 24 pages, 10 figures, 12 tables

Via

Access Paper or Ask Questions

Neural Collage Transfer: Artistic Reconstruction via Material Manipulation

Nov 03, 2023

Ganghun Lee, Minji Kim, Yunsu Lee, Minsu Lee, Byoung-Tak Zhang

Abstract:Collage is a creative art form that uses diverse material scraps as a base unit to compose a single image. Although pixel-wise generation techniques can reproduce a target image in collage style, it is not a suitable method due to the solid stroke-by-stroke nature of the collage form. While some previous works for stroke-based rendering produced decent sketches and paintings, collages have received much less attention in research despite their popularity as a style. In this paper, we propose a method for learning to make collages via reinforcement learning without the need for demonstrations or collage artwork data. We design the collage Markov Decision Process (MDP), which allows the agent to handle various materials and propose a model-based soft actor-critic to mitigate the agent's training burden derived from the sophisticated dynamics of collage. Moreover, we devise additional techniques such as active material selection and complexity-based multi-scale collage to handle target images at any size and enhance the results' aesthetics by placing relatively more scraps in areas of high complexity. Experimental results show that the trained agent appropriately selected and pasted materials to regenerate the target image into a collage and obtained a higher evaluation score on content and style than pixel-wise generation methods. Code is available at https://github.com/northadventure/CollageRL.

* ICCV 2023

Via

Access Paper or Ask Questions

Addressing Distribution Shift in RTB Markets via Exponential Tilting

Aug 14, 2023

Minji Kim, Seong Jin Lee, Bumsik Kim

Abstract:Distribution shift in machine learning models can be a primary cause of performance degradation. This paper delves into the characteristics of these shifts, primarily motivated by Real-Time Bidding (RTB) market models. We emphasize the challenges posed by class imbalance and sample selection bias, both potent instigators of distribution shifts. This paper introduces the Exponential Tilt Reweighting Alignment (ExTRA) algorithm, as proposed by Marty et al. (2023), to address distribution shifts in data. The ExTRA method is designed to determine the importance weights on the source data, aiming to minimize the KL divergence between the weighted source and target datasets. A notable advantage of this method is its ability to operate using labeled source data and unlabeled target data. Through simulated real-world data, we investigate the nature of distribution shift and evaluate the applicacy of the proposed model.

Via

Access Paper or Ask Questions

EXOT: Exit-aware Object Tracker for Safe Robotic Manipulation of Moving Object

Jun 08, 2023

Hyunseo Kim, Hye Jung Yoon, Minji Kim, Dong-Sig Han, Byoung-Tak Zhang

Abstract:Current robotic hand manipulation narrowly operates with objects in predictable positions in limited environments. Thus, when the location of the target object deviates severely from the expected location, a robot sometimes responds in an unexpected way, especially when it operates with a human. For safe robot operation, we propose the EXit-aware Object Tracker (EXOT) on a robot hand camera that recognizes an object's absence during manipulation. The robot decides whether to proceed by examining the tracker's bounding box output containing the target object. We adopt an out-of-distribution classifier for more accurate object recognition since trackers can mistrack a background as a target object. To the best of our knowledge, our method is the first approach of applying an out-of-distribution classification technique to a tracker output. We evaluate our method on the first-person video benchmark dataset, TREK-150, and on the custom dataset, RMOT-223, that we collect from the UR5e robot. Then we test our tracker on the UR5e robot in real-time with a conveyor-belt sushi task, to examine the tracker's ability to track target dishes and to determine the exit status. Our tracker shows 38% higher exit-aware performance than a baseline method. The dataset and the code will be released at https://github.com/hskAlena/EXOT.

* 2023 IEEE International Conference on Robotics and Automation (ICRA)

Via

Access Paper or Ask Questions

Towards Sequence-Level Training for Visual Tracking

Aug 11, 2022

Minji Kim, Seungkwan Lee, Jungseul Ok, Bohyung Han, Minsu Cho

Figure 1 for Towards Sequence-Level Training for Visual Tracking

Figure 2 for Towards Sequence-Level Training for Visual Tracking

Figure 3 for Towards Sequence-Level Training for Visual Tracking

Figure 4 for Towards Sequence-Level Training for Visual Tracking

Abstract:Despite the extensive adoption of machine learning on the task of visual object tracking, recent learning-based approaches have largely overlooked the fact that visual tracking is a sequence-level task in its nature; they rely heavily on frame-level training, which inevitably induces inconsistency between training and testing in terms of both data distributions and task objectives. This work introduces a sequence-level training strategy for visual tracking based on reinforcement learning and discusses how a sequence-level design of data sampling, learning objectives, and data augmentation can improve the accuracy and robustness of tracking algorithms. Our experiments on standard benchmarks including LaSOT, TrackingNet, and GOT-10k demonstrate that four representative tracking models, SiamRPN++, SiamAttn, TransT, and TrDiMP, consistently improve by incorporating the proposed methods in training without modifying architectures.

* ECCV 2022

Via

Access Paper or Ask Questions

From Scratch to Sketch: Deep Decoupled Hierarchical Reinforcement Learning for Robotic Sketching Agent

Aug 09, 2022

Ganghun Lee, Minji Kim, Minsu Lee, Byoung-Tak Zhang

Figure 1 for From Scratch to Sketch: Deep Decoupled Hierarchical Reinforcement Learning for Robotic Sketching Agent

Figure 2 for From Scratch to Sketch: Deep Decoupled Hierarchical Reinforcement Learning for Robotic Sketching Agent

Figure 3 for From Scratch to Sketch: Deep Decoupled Hierarchical Reinforcement Learning for Robotic Sketching Agent

Figure 4 for From Scratch to Sketch: Deep Decoupled Hierarchical Reinforcement Learning for Robotic Sketching Agent

Abstract:We present an automated learning framework for a robotic sketching agent that is capable of learning stroke-based rendering and motor control simultaneously. We formulate the robotic sketching problem as a deep decoupled hierarchical reinforcement learning; two policies for stroke-based rendering and motor control are learned independently to achieve sub-tasks for drawing, and form a hierarchy when cooperating for real-world drawing. Without hand-crafted features, drawing sequences or trajectories, and inverse kinematics, the proposed method trains the robotic sketching agent from scratch. We performed experiments with a 6-DoF robot arm with 2F gripper to sketch doodles. Our experimental results show that the two policies successfully learned the sub-tasks and collaborated to sketch the target images. Also, the robustness and flexibility were examined by varying drawing tools and surfaces.

* Accepted to ICRA 2022

Via

Access Paper or Ask Questions

Online Hybrid Lightweight Representations Learning: Its Application to Visual Tracking

May 23, 2022

Ilchae Jung, Minji Kim, Eunhyeok Park, Bohyung Han

Figure 1 for Online Hybrid Lightweight Representations Learning: Its Application to Visual Tracking

Figure 2 for Online Hybrid Lightweight Representations Learning: Its Application to Visual Tracking

Figure 3 for Online Hybrid Lightweight Representations Learning: Its Application to Visual Tracking

Figure 4 for Online Hybrid Lightweight Representations Learning: Its Application to Visual Tracking

Abstract:This paper presents a novel hybrid representation learning framework for streaming data, where an image frame in a video is modeled by an ensemble of two distinct deep neural networks; one is a low-bit quantized network and the other is a lightweight full-precision network. The former learns coarse primary information with low cost while the latter conveys residual information for high fidelity to original representations. The proposed parallel architecture is effective to maintain complementary information since fixed-point arithmetic can be utilized in the quantized network and the lightweight model provides precise representations given by a compact channel-pruned network. We incorporate the hybrid representation technique into an online visual tracking task, where deep neural networks need to handle temporal variations of target appearances in real-time. Compared to the state-of-the-art real-time trackers based on conventional deep neural networks, our tracking algorithm demonstrates competitive accuracy on the standard benchmarks with a small fraction of computational cost and memory footprint.

* 7 pages, 1 figure, accepted at IJCAI2022

Via

Access Paper or Ask Questions