Abstract:Real-world decision-making often requires integrating and reasoning over information from multiple modalities. While recent multimodal large language models (MLLMs) have shown promise in such tasks, their ability to perform multi-hop reasoning across diverse sources remains insufficiently evaluated. Existing benchmarks, such as MMQA, face challenges due to (1) data contamination and (2) a lack of complex queries that necessitate operations across more than two modalities, hindering accurate performance assessment. To address this, we present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark created to analyze the reasoning capabilities of MLLMs by urging them to combine information from textual reports, tables, and charts within the financial domain. FCMR is categorized into three difficulty levels-Easy, Medium, and Hard-facilitating a step-by-step evaluation. In particular, problems at the Hard level require precise cross-modal three-hop reasoning and are designed to prevent the disregard of any modality. Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model (Claude 3.5 Sonnet) achieving only 30.4% accuracy on the most challenging tier. We also conduct analysis to provide insights into the inner workings of the models, including the discovery of a critical bottleneck in the information retrieval phase.
Abstract:While the introduction of contrastive learning frameworks in sentence representation learning has significantly contributed to advancements in the field, it still remains unclear whether state-of-the-art sentence embeddings can capture the fine-grained semantics of sentences, particularly when conditioned on specific perspectives. In this paper, we introduce Hyper-CL, an efficient methodology that integrates hypernetworks with contrastive learning to compute conditioned sentence representations. In our proposed approach, the hypernetwork is responsible for transforming pre-computed condition embeddings into corresponding projection layers. This enables the same sentence embeddings to be projected differently according to various conditions. Evaluation on two representative conditioning benchmarks, namely conditional semantic text similarity and knowledge graph completion, demonstrates that Hyper-CL is effective in flexibly conditioning sentence representations, showcasing its computational efficiency at the same time. We also provide a comprehensive analysis of the inner workings of our approach, leading to a better interpretation of its mechanisms.
Abstract:This study presents a new hardware design and control of a minimally actuated 5 control degrees of freedom (CDoF) quadrotor-based tiltrotor. The proposed tiltrotor possesses several characteristics distinct from those found in existing works, including: 1) minimal number of actuators for 5 CDoF, 2) large margin to generate interaction force during aerial physical interaction (APhI), and 3) no mechanical obstruction in thrust direction rotation. Thanks to these properties, the proposed tiltrotor is suitable for perching-enabled APhI since it can hover parallel to an arbitrarily oriented surface and can freely adjust its thrust direction. To fully control the 5-CDoF of the designed tiltrotor, we construct an asymptotically stabilizing controller with stability analysis. The proposed tiltrotor design and controller are validated in experiments where the first two experiments of $x,y$ position tracking and pitch tracking show controllability of the added CDoF compared to a conventional quadrotor. Finally, the last experiment of perching and cart pushing demonstrates the proposed tiltrotor's applicability to perching-enabled APhI.
Abstract:This study aims to design a motion/force controller for an aerial manipulator which guarantees the tracking of time-varying motion/force trajectories as well as the stability during the transition between free and contact motions. To this end, we model the force exerted on the end-effector as the Kelvin-Voigt linear model and estimate its parameters by recursive least-squares estimator. Then, the gains of the disturbance-observer (DOB)-based motion/force controller are calculated based on the stability conditions considering both the model uncertainties in the dynamic equation and switching between the free and contact motions. To validate the proposed controller, we conducted the time-varying motion/force tracking experiments with different approach speeds and orientations of the surface. The results show that our controller enables the aerial manipulator to track the time-varying motion/force trajectories.
Abstract:Rapidly generating an optimal chasing motion of a drone to follow a dynamic target among obstacles is challenging due to numerical issues rising from multiple conflicting objectives and non-convex constraints. This study proposes to resolve the difficulties with a fast and reliable pipeline that incorporates 1) a target movement forecaster and 2) a chasing planner. They are based on a sample-and-check approach that consists of the generation of high-quality candidate primitives and the feasibility tests with a light computation load. We forecast the movement of the target by selecting an optimal prediction among a set of candidates built from past observations. Based on the prediction, we construct a set of prospective chasing trajectories which reduce the high-order derivatives, while maintaining the desired relative distance from the predicted target movement. Then, the candidate trajectories are tested on safety of the chaser and visibility toward the target without loose approximation of the constraints. The proposed algorithm is thoroughly evaluated in challenging scenarios involving dynamic obstacles. Also, the overall process from the target recognition to the chasing motion planning is implemented fully onboard on a drone, demonstrating real-world applicability.