Michael Pokorny
Abstract:Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
Abstract:The combination of Large Language Models (LLMs), systematic evaluation, and evolutionary algorithms has enabled breakthroughs in combinatorial optimization and scientific discovery. We propose to extend this powerful combination to the control of dynamical systems, generating interpretable control policies capable of complex behaviors. With our novel method, we represent control policies as programs in standard languages like Python. We evaluate candidate controllers in simulation and evolve them using a pre-trained LLM. Unlike conventional learning-based control techniques, which rely on black box neural networks to encode control policies, our approach enhances transparency and interpretability. We still take advantage of the power of large AI models, but leverage it at the policy design phase, ensuring that all system components remain interpretable and easily verifiable at runtime. Additionally, the use of standard programming languages makes it straightforward for humans to finetune or adapt the controllers based on their expertise and intuition. We illustrate our method through its application to the synthesis of an interpretable control policy for the pendulum swing-up and the ball in cup tasks. We make the code available at https://github.com/muellerlab/synthesizing_interpretable_control_policies.git
Abstract:Caves and lava tubes on the Moon and Mars are sites of geological and astrobiological interest but consist of terrain that is inaccessible with traditional robot locomotion. To support the exploration of these sites, we present ReachBot, a robot that uses extendable booms as appendages to manipulate itself with respect to irregular rock surfaces. The booms terminate in grippers equipped with microspines and provide ReachBot with a large workspace, allowing it to achieve force closure in enclosed spaces such as the walls of a lava tube. To propel ReachBot, we present a contact-before-motion planner for non-gaited legged locomotion that utilizes internal force control, similar to a multi-fingered hand, to keep its long, slender booms in tension. Motion planning also depends on finding and executing secure grips on rock features. We use a Monte Carlo simulation to inform gripper design and predict grasp strength and variability. Additionally, we use a two-step perception system to identify possible grasp locations. To validate our approach and mechanisms under realistic conditions, we deployed a single ReachBot arm and gripper in a lava tube in the Mojave Desert. The field test confirmed that ReachBot will find many targets for secure grasps with the proposed kinematic design.
Abstract:We present a novel approach to cooperative aerial transportation through a team of drones, using optimal control theory and a hierarchical control strategy. We assume the drones are connected to the payload through rigid attachments, essentially transforming the whole system into a larger flying object with "thrust modules" at the attachment locations of the drones. We investigate the optimal arrangement of the thrust modules around the payload, so that the resulting system is robust to disturbances. We choose the $\mathcal{H}_2$ norm as a measure of robustness, and propose an iterative optimization routine to compute the optimal layout of the vehicles around the object. We experimentally validate our approach using four drones and comparing the disturbance rejection performances achieved by two different layouts (the optimal one and a sub-optimal one), and observe that the results match our predictions.