Abstract:Benchmarks are crucial for assessing multi-agent reinforcement learning (MARL) algorithms. While StarCraft II-related environments have driven significant advances in MARL, existing benchmarks like SMAC focus primarily on micromanagement, limiting comprehensive evaluation of high-level strategic intelligence. To address this, we introduce HLSMAC, a new cooperative MARL benchmark with 12 carefully designed StarCraft II scenarios based on classical stratagems from the Thirty-Six Stratagems. Each scenario corresponds to a specific stratagem and is designed to challenge agents with diverse strategic elements, including tactical maneuvering, timing coordination, and deception, thereby opening up avenues for evaluating high-level strategic decision-making capabilities. We also propose novel metrics across multiple dimensions beyond conventional win rate, such as ability utilization and advancement efficiency, to assess agents' overall performance within the HLSMAC environment. We integrate state-of-the-art MARL algorithms and LLM-based agents with our benchmark and conduct comprehensive experiments. The results demonstrate that HLSMAC serves as a robust testbed for advancing multi-agent strategic decision-making.
Abstract:People need to internalize the skills of AI agents to improve their own capabilities. Our paper focuses on Mahjong, a multiplayer game involving imperfect information and requiring effective long-term decision-making amidst randomness and hidden information. Through the efforts of AI researchers, several impressive Mahjong AI agents have already achieved performance levels comparable to those of professional human players; however, these agents are often treated as black boxes from which few insights can be gleaned. This paper introduces Mxplainer, a parameterized search algorithm that can be converted into an equivalent neural network to learn the parameters of black-box agents. Experiments conducted on AI and human player data demonstrate that the learned parameters provide human-understandable insights into these agents' characteristics and play styles. In addition to analyzing the learned parameters, we also showcase how our search-based framework can locally explain the decision-making processes of black-box agents for most Mahjong game states.
Abstract:As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT -- the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at https://pku-intermt.github.io .
Abstract:Business Process Management (BPM) is gaining increasing attention as it has the potential to cut costs while boosting output and quality. Business process document generation is a crucial stage in BPM. However, due to a shortage of datasets, data-driven deep learning techniques struggle to deliver the expected results. We propose an approach to transform Conditional Process Trees (CPTs) into Business Process Text Sketches (BPTSs) using Large Language Models (LLMs). The traditional prompting approach (Few-shot In-Context Learning) tries to get the correct answer in one go, and it can find the pattern of transforming simple CPTs into BPTSs, but for close-domain and CPTs with complex hierarchy, the traditional prompts perform weakly and with low correctness. We suggest using this technique to break down a difficult CPT into a number of basic CPTs and then solve each one in turn, drawing inspiration from the divide-and-conquer strategy. We chose 100 process trees with depths ranging from 2 to 5 at random, as well as CPTs with many nodes, many degrees of selection, and cyclic nesting. Experiments show that our method can achieve a correct rate of 93.42%, which is 45.17% better than traditional prompting methods. Our proposed method provides a solution for business process document generation in the absence of datasets, and secondly, it becomes potentially possible to provide a large number of datasets for the process model extraction (PME) domain.
Abstract:Evolutionary game theory has been a successful tool to combine classical game theory with learning-dynamical descriptions in multiagent systems. Provided some symmetric structures of interacting players, many studies have been focused on using a simplified heuristic payoff table as input to analyse the dynamics of interactions. Nevertheless, even for the state-of-the-art method, there are two limits. First, there is inaccuracy when analysing the simplified payoff table. Second, no existing work is able to deal with 2-population multiplayer asymmetric games. In this paper, we fill the gap between heuristic payoff table and dynamic analysis without any inaccuracy. In addition, we propose a general framework for $m$ versus $n$ 2-population multiplayer asymmetric games. Then, we compare our method with the state-of-the-art in some classic games. Finally, to illustrate our method, we perform empirical game-theoretical analysis on Wolfpack as well as StarCraft II, both of which involve complex multiagent interactions.
Abstract:We show that for the problem of minimizing (or maximizing) the ratio of two supermodular functions, no bounded approximation ratio can be achieved via polynomial number of queries, if the two supermodular functions are both monotone non-decreasing or non-increasing.
Abstract:We show that for the cardinality constrained monotone submodular maximization problem, there exists a $(1-1/e-\varepsilon)$-approximate deterministic algorithm with linear query complexity, which performs $O(n/\varepsilon)$ queries in total.
Abstract:In the NeurIPS 2018 Artificial Intelligence for Prosthetics challenge, participants were tasked with building a controller for a musculoskeletal model with a goal of matching a given time-varying velocity vector. Top participants were invited to describe their algorithms. In this work, we describe the challenge and present thirteen solutions that used deep reinforcement learning approaches. Many solutions use similar relaxations and heuristics, such as reward shaping, frame skipping, discretization of the action space, symmetry, and policy blending. However, each team implemented different modifications of the known algorithms by, for example, dividing the task into subtasks, learning low-level control, or by incorporating expert knowledge and using imitation learning.
Abstract:With the rapid growth of the express industry, intelligent warehouses that employ autonomous robots for carrying parcels have been widely used to handle the vast express volume. For such warehouses, the warehouse layout design plays a key role in improving the transportation efficiency. However, this work is still done by human experts, which is expensive and leads to suboptimal results. In this paper, we aim to automate the warehouse layout designing process. We propose a two-layer evolutionary algorithm to efficiently explore the warehouse layout space, where an auxiliary objective fitness approximation model is introduced to predict the outcome of the designed warehouse layout and a two-layer population structure is proposed to incorporate the approximation model into the ordinary evolution framework. Empirical experiments show that our method can efficiently design effective warehouse layouts that outperform both heuristic-designed and vanilla evolution-designed warehouse layouts.
Abstract:In typical reinforcement learning (RL), the environment is assumed given and the goal of the learning is to identify an optimal policy for the agent taking actions through its interactions with the environment. In this paper, we extend this setting by considering the environment is not given, but controllable and learnable through its interaction with the agent at the same time. This extension is motivated by environment design scenarios in the real-world, including game design, shopping space design and traffic signal design. Theoretically, we find a dual Markov decision process (MDP) w.r.t. the environment to that w.r.t. the agent, and derive a policy gradient solution to optimizing the parametrized environment. Furthermore, discontinuous environments are addressed by a proposed general generative framework. Our experiments on a Maze game design task show the effectiveness of the proposed algorithms in generating diverse and challenging Mazes against various agent settings.