Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dong-Ki Kim

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents

May 19, 2025

Yunseok Jang, Yeda Song, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Dong-Ki Kim, Kyunghoon Bae, Honglak Lee

Abstract:Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving an average performance gain of 18.11%p on an unseen mobile OS platform. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework comprises robust OCR-based scene detection (95.04% F1score), near-perfect UI element detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.

* CVPR 2025

Via

Access Paper or Ask Questions

SayComply: Grounding Field Robotic Tasks in Operational Compliance through Retrieval-Based Language Models

Nov 18, 2024

Muhammad Fadhil Ginting, Dong-Ki Kim, Sung-Kyun Kim, Bandi Jai Krishna, Mykel J. Kochenderfer, Shayegan Omidshafiei, Ali-akbar Agha-mohammadi

Figure 1 for SayComply: Grounding Field Robotic Tasks in Operational Compliance through Retrieval-Based Language Models

Figure 2 for SayComply: Grounding Field Robotic Tasks in Operational Compliance through Retrieval-Based Language Models

Figure 3 for SayComply: Grounding Field Robotic Tasks in Operational Compliance through Retrieval-Based Language Models

Figure 4 for SayComply: Grounding Field Robotic Tasks in Operational Compliance through Retrieval-Based Language Models

Abstract:This paper addresses the problem of task planning for robots that must comply with operational manuals in real-world settings. Task planning under these constraints is essential for enabling autonomous robot operation in domains that require adherence to domain-specific knowledge. Current methods for generating robot goals and plans rely on common sense knowledge encoded in large language models. However, these models lack grounding of robot plans to domain-specific knowledge and are not easily transferable between multiple sites or customers with different compliance needs. In this work, we present SayComply, which enables grounding robotic task planning with operational compliance using retrieval-based language models. We design a hierarchical database of operational, environment, and robot embodiment manuals and procedures to enable efficient retrieval of the relevant context under the limited context length of the LLMs. We then design a task planner using a tree-based retrieval augmented generation (RAG) technique to generate robot tasks that follow user instructions while simultaneously complying with the domain knowledge in the database. We demonstrate the benefits of our approach through simulations and hardware experiments in real-world scenarios that require precise context retrieval across various types of context, outperforming the standard RAG method. Our approach bridges the gap in deploying robots that consistently adhere to operational protocols, offering a scalable and edge-deployable solution for ensuring compliance across varied and complex real-world environments. Project website: saycomply.github.io.

Via

Access Paper or Ask Questions

Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Oct 29, 2024

Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee

Figure 1 for Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Figure 2 for Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Figure 3 for Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Figure 4 for Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents

Abstract:In this paper, we introduce Auto-Intent, a method to adapt a pre-trained large language model (LLM) as an agent for a target domain without direct fine-tuning, where we empirically focus on web navigation tasks. Our approach first discovers the underlying intents from target domain demonstrations unsupervisedly, in a highly compact form (up to three words). With the extracted intents, we train our intent predictor to predict the next intent given the agent's past observations and actions. In particular, we propose a self-exploration approach where top-k probable intent predictions are provided as a hint to the pre-trained LLM agent, which leads to enhanced decision-making capabilities. Auto-Intent substantially improves the performance of GPT-{3.5, 4} and Llama-3.1-{70B, 405B} agents on the large-scale real-website navigation benchmarks from Mind2Web and online navigation tasks from WebArena with its cross-benchmark generalization from Mind2Web.

* EMNLP 2024 Findings

Via

Access Paper or Ask Questions

AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Mar 13, 2024

Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, Honglak Lee

Figure 1 for AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Figure 2 for AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Figure 3 for AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Figure 4 for AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Abstract:The primary limitation of large language models (LLMs) is their restricted understanding of the world. This poses significant difficulties for LLM-based agents, particularly in domains where pre-trained LLMs lack sufficient knowledge. In this paper, we introduce a novel framework, called AutoGuide, that bridges the knowledge gap in pre-trained LLMs by leveraging implicit knowledge in offline experiences. Specifically, AutoGuide effectively extracts knowledge embedded in offline data by extracting a set of state-aware guidelines. Importantly, each state-aware guideline is expressed in concise natural language and follows a conditional structure, clearly describing the state where it is applicable. As such, the resulting guidelines enable a principled way to provide helpful knowledge pertinent to an agent's current decision-making process. We show that our approach outperforms competitive LLM-based baselines by a large margin in sequential decision-making benchmarks.

Via

Access Paper or Ask Questions

TOD-Flow: Modeling the Structure of Task-Oriented Dialogues

Dec 07, 2023

Sungryull Sohn, Yiwei Lyu, Anthony Liu, Lajanugen Logeswaran, Dong-Ki Kim, Dongsub Shim, Honglak Lee

Figure 1 for TOD-Flow: Modeling the Structure of Task-Oriented Dialogues

Figure 2 for TOD-Flow: Modeling the Structure of Task-Oriented Dialogues

Figure 3 for TOD-Flow: Modeling the Structure of Task-Oriented Dialogues

Figure 4 for TOD-Flow: Modeling the Structure of Task-Oriented Dialogues

Abstract:Task-Oriented Dialogue (TOD) systems have become crucial components in interactive artificial intelligence applications. While recent advances have capitalized on pre-trained language models (PLMs), they exhibit limitations regarding transparency and controllability. To address these challenges, we propose a novel approach focusing on inferring the TOD-Flow graph from dialogue data annotated with dialog acts, uncovering the underlying task structure in the form of a graph. The inferred TOD-Flow graph can be easily integrated with any dialogue model to improve its prediction performance, transparency, and controllability. Our TOD-Flow graph learns what a model can, should, and should not predict, effectively reducing the search space and providing a rationale for the model's prediction. We show that the proposed TOD-Flow graph better resembles human-annotated graphs compared to prior approaches. Furthermore, when combined with several dialogue policies and end-to-end dialogue models, we demonstrate that our approach significantly improves dialog act classification and end-to-end response generation performance in the MultiWOZ and SGD benchmarks. Code available at: https://github.com/srsohn/TOD-Flow

Via

Access Paper or Ask Questions

Code Models are Zero-shot Precondition Reasoners

Nov 16, 2023

Lajanugen Logeswaran, Sungryull Sohn, Yiwei Lyu, Anthony Zhe Liu, Dong-Ki Kim, Dongsub Shim, Moontae Lee, Honglak Lee

Figure 1 for Code Models are Zero-shot Precondition Reasoners

Figure 2 for Code Models are Zero-shot Precondition Reasoners

Figure 3 for Code Models are Zero-shot Precondition Reasoners

Figure 4 for Code Models are Zero-shot Precondition Reasoners

Abstract:One of the fundamental skills required for an agent acting in an environment to complete tasks is the ability to understand what actions are plausible at any given point. This work explores a novel use of code representations to reason about action preconditions for sequential decision making tasks. Code representations offer the flexibility to model procedural activities and associated constraints as well as the ability to execute and verify constraint satisfaction. Leveraging code representations, we extract action preconditions from demonstration trajectories in a zero-shot manner using pre-trained code models. Given these extracted preconditions, we propose a precondition-aware action sampling strategy that ensures actions predicted by a policy are consistent with preconditions. We demonstrate that the proposed approach enhances the performance of few-shot policy learning approaches across task-oriented dialog and embodied textworld benchmarks.

* Neurips Foundation Models for Decision Making Workshop 2023

Via

Access Paper or Ask Questions

MultiPrompter: Cooperative Prompt Optimization with Multi-Agent Reinforcement Learning

Oct 25, 2023

Dong-Ki Kim, Sungryull Sohn, Lajanugen Logeswaran, Dongsub Shim, Honglak Lee

Figure 1 for MultiPrompter: Cooperative Prompt Optimization with Multi-Agent Reinforcement Learning

Figure 2 for MultiPrompter: Cooperative Prompt Optimization with Multi-Agent Reinforcement Learning

Figure 3 for MultiPrompter: Cooperative Prompt Optimization with Multi-Agent Reinforcement Learning

Figure 4 for MultiPrompter: Cooperative Prompt Optimization with Multi-Agent Reinforcement Learning

Abstract:Recently, there has been an increasing interest in automated prompt optimization based on reinforcement learning (RL). This approach offers important advantages, such as generating interpretable prompts and being compatible with black-box foundation models. However, the substantial prompt space size poses challenges for RL-based methods, often leading to suboptimal policy convergence. This paper introduces MultiPrompter, a new framework that views prompt optimization as a cooperative game between prompters which take turns composing a prompt together. Our cooperative prompt optimization effectively reduces the problem size and helps prompters learn optimal prompts. We test our method on the text-to-image task and show its ability to generate higher-quality images than baselines.

Via

Access Paper or Ask Questions

Game-Theoretical Perspectives on Active Equilibria: A Preferred Solution Concept over Nash Equilibria

Oct 28, 2022

Dong-Ki Kim, Matthew Riemer, Miao Liu, Jakob N. Foerster, Gerald Tesauro, Jonathan P. How

Abstract:Multiagent learning settings are inherently more difficult than single-agent learning because each agent interacts with other simultaneously learning agents in a shared environment. An effective approach in multiagent reinforcement learning is to consider the learning process of agents and influence their future policies toward desirable behaviors from each agent's perspective. Importantly, if each agent maximizes its long-term rewards by accounting for the impact of its behavior on the set of convergence policies, the resulting multiagent system reaches an active equilibrium. While this new solution concept is general such that standard solution concepts, such as a Nash equilibrium, are special cases of active equilibria, it is unclear when an active equilibrium is a preferred equilibrium over other solution concepts. In this paper, we analyze active equilibria from a game-theoretic perspective by closely studying examples where Nash equilibria are known. By directly comparing active equilibria to Nash equilibria in these examples, we find that active equilibria find more effective solutions than Nash equilibria, concluding that an active equilibrium is the desired solution for multiagent learning settings.

Via

Access Paper or Ask Questions

City-wide Street-to-Satellite Image Geolocalization of a Mobile Ground Agent

Mar 10, 2022

Lena M. Downes, Dong-Ki Kim, Ted J. Steiner, Jonathan P. How

Figure 1 for City-wide Street-to-Satellite Image Geolocalization of a Mobile Ground Agent

Figure 2 for City-wide Street-to-Satellite Image Geolocalization of a Mobile Ground Agent

Figure 3 for City-wide Street-to-Satellite Image Geolocalization of a Mobile Ground Agent

Figure 4 for City-wide Street-to-Satellite Image Geolocalization of a Mobile Ground Agent

Abstract:Cross-view image geolocalization provides an estimate of an agent's global position by matching a local ground image to an overhead satellite image without the need for GPS. It is challenging to reliably match a ground image to the correct satellite image since the images have significant viewpoint differences. Existing works have demonstrated localization in constrained scenarios over small areas but have not demonstrated wider-scale localization. Our approach, called Wide-Area Geolocalization (WAG), combines a neural network with a particle filter to achieve global position estimates for agents moving in GPS-denied environments, scaling efficiently to city-scale regions. WAG introduces a trinomial loss function for a Siamese network to robustly match non-centered image pairs and thus enables the generation of a smaller satellite image database by coarsely discretizing the search area. A modified particle filter weighting scheme is also presented to improve localization accuracy and convergence. Taken together, WAG's network training and particle filter weighting approach achieves city-scale position estimation accuracies on the order of 20 meters, a 98% reduction compared to a baseline training and weighting approach. Applied to a smaller-scale testing area, WAG reduces the final position estimation error by 64% compared to a state-of-the-art baseline from the literature. WAG's search space discretization additionally significantly reduces storage and processing requirements.

* 7 pages, 14 figures. Submitted to IROS 2022. Video highlight available at https://youtu.be/06MOR0ozQeI

Via

Access Paper or Ask Questions

Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Mar 07, 2022

Dong-Ki Kim, Matthew Riemer, Miao Liu, Jakob N. Foerster, Michael Everett, Chuangchuang Sun, Gerald Tesauro, Jonathan P. How

Figure 1 for Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Figure 2 for Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Figure 3 for Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Figure 4 for Influencing Long-Term Behavior in Multiagent Reinforcement Learning

Abstract:The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other interacting agents and influence the evolution of their future policies towards desirable behavior for its own benefit. Unfortunately, all previous approaches for achieving this suffer from myopic evaluation, considering only a few or a finite number of updates to the policies of other agents. In this paper, we propose a principled framework for considering the limiting policies of other agents as the time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will take on. Thanks to our farsighted evaluation, we demonstrate better long-term performance than state-of-the-art baselines in various domains, including the full spectrum of general-sum, competitive, and cooperative settings.

* Under review as a workshop paper

Via

Access Paper or Ask Questions