Abstract:While various vertical domain large language models (LLMs) have been developed, the challenge of automatically evaluating their performance across different domains remains significant in addressing real-world user needs. Current benchmark-based evaluation methods exhibit rigid, purposeless interactions and rely on pre-collected static datasets that are costly to build, inflexible across domains, and misaligned with practical user needs. To address this, we revisit the evaluation components and introduce two definitions: **Benchmark+**, which extends traditional QA benchmarks into a more flexible ``strategy-criterion'' format; and **Assessment+**, which enhances the interaction process for greater exploration and enables both quantitative metrics and qualitative insights that capture nuanced target LLM behaviors from richer multi-turn interactions. We propose an agent-based evaluation framework called *TestAgent*, which implements these two concepts through retrieval augmented generation and reinforcement learning. Experiments on tasks ranging from building vertical domain evaluation from scratch to activating existing benchmarks demonstrate the effectiveness of *TestAgent* across various scenarios. We believe this work offers an interesting perspective on automatic evaluation for LLMs.
Abstract:Recent work on visual representation learning has shown to be efficient for robotic manipulation tasks. However, most existing works pretrained the visual backbone solely on 2D images or egocentric videos, ignoring the fact that robots learn to act in 3D space, which is hard to learn from 2D observation. In this paper, we examine the effectiveness of pretraining for vision backbone with public-available large-scale 3D data to improve manipulation policy learning. Our method, namely Depth-aware Pretraining for Robotics (DPR), enables an RGB-only backbone to learn 3D scene representations from self-supervised contrastive learning, where depth information serves as auxiliary knowledge. No 3D information is necessary during manipulation policy learning and inference, making our model enjoy both efficiency and effectiveness in 3D space manipulation. Furthermore, we introduce a new way to inject robots' proprioception into the policy networks that makes the manipulation model robust and generalizable. We demonstrate in experiments that our proposed framework improves performance on unseen objects and visual environments for various robotics tasks on both simulated and real robots.
Abstract:Generative Adversarial Imitation Learning (GAIL) stands as a cornerstone approach in imitation learning. This paper investigates the gradient explosion in two types of GAIL: GAIL with deterministic policy (DE-GAIL) and GAIL with stochastic policy (ST-GAIL). We begin with the observation that the training can be highly unstable for DE-GAIL at the beginning of the training phase and end up divergence. Conversely, the ST-GAIL training trajectory remains consistent, reliably converging. To shed light on these disparities, we provide an explanation from a theoretical perspective. By establishing a probabilistic lower bound for GAIL, we demonstrate that gradient explosion is an inevitable outcome for DE-GAIL due to occasionally large expert-imitator policy disparity, whereas ST-GAIL does not have the issue with it. To substantiate our assertion, we illustrate how modifications in the reward function can mitigate the gradient explosion challenge. Finally, we propose CREDO, a simple yet effective strategy that clips the reward function during the training phase, allowing the GAIL to enjoy high data efficiency and stable trainability.