Abstract:Survival prediction is a critical task in pathology. In clinical practice, pathologists often examine multiple cases, leveraging a broader spectrum of cancer phenotypes to enhance pathological assessment. Despite significant advancements in deep learning, current solutions typically model each slide as a sample, struggling to effectively capture comparable and slide-agnostic pathological features. In this paper, we introduce GroupMIL, a novel framework inspired by the clinical practice of collective analysis, which models multiple slides as a single sample and organizes groups of patches and slides sequentially to capture cross-slide prognostic features. We also present GPAMamba, a model designed to facilitate intra- and inter-slide feature interactions, effectively capturing local micro-environmental characteristics within slide-level graphs while uncovering essential prognostic patterns across an extended patch sequence within the group framework. Furthermore, we develop a dual-head predictor that delivers comprehensive survival risk and probability assessments for each patient. Extensive empirical evaluations demonstrate that our model significantly outperforms state-of-the-art approaches across five datasets from The Cancer Genome Atlas.
Abstract:Construction remains one of the most hazardous sectors. Recent advancements in AI, particularly Large Language Models (LLMs), offer promising opportunities for enhancing workplace safety. However, responsible integration of LLMs requires systematic evaluation, as deploying them without understanding their capabilities and limitations risks generating inaccurate information, fostering misplaced confidence, and compromising worker safety. This study evaluates the performance of two widely used LLMs, GPT-3.5 and GPT-4o, across three standardized exams administered by the Board of Certified Safety Professionals (BCSP). Using 385 questions spanning seven safety knowledge areas, the study analyzes the models' accuracy, consistency, and reliability. Results show that both models consistently exceed the BCSP benchmark, with GPT-4o achieving an accuracy rate of 84.6% and GPT-3.5 reaching 73.8%. Both models demonstrate strengths in safety management systems and hazard identification and control, but exhibit weaknesses in science, mathematics, emergency response, and fire prevention. An error analysis identifies four primary limitations affecting LLM performance: lack of knowledge, reasoning flaws, memory issues, and calculation errors. Our study also highlights the impact of prompt engineering strategies, with variations in accuracy reaching 13.5% for GPT-3.5 and 7.9% for GPT-4o. However, no single prompt configuration proves universally effective. This research advances knowledge in three ways: by identifying areas where LLMs can support safety practices and where human oversight remains essential, by offering practical insights into improving LLM implementation through prompt engineering, and by providing evidence-based direction for future research and development. These contributions support the responsible integration of AI in construction safety management toward achieving zero injuries.
Abstract:Explanation generation plays a more pivotal role than fact verification in producing interpretable results and facilitating comprehensive fact-checking, which has recently garnered considerable attention. However, previous studies on explanation generation has shown several limitations, such as being confined to English scenarios, involving overly complex inference processes, and not fully unleashing the potential of the mutual feedback between veracity labels and explanation texts. To address these issues, we construct two complex fact-checking datasets in the Chinese scenarios: CHEF-EG and TrendFact. These datasets involve complex facts in areas such as health, politics, and society, presenting significant challenges for fact verification methods. In response to these challenges, we propose a unified framework called FactISR (Augmenting Fact-Checking via Iterative Self-Revision) to perform mutual feedback between veracity and explanations by leveraging the capabilities of large language models(LLMs). FactISR uses a single model to address tasks such as fact verification and explanation generation. Its self-revision mechanism can further revision the consistency between veracity labels, explanation texts, and evidence, as well as eliminate irrelevant noise. We conducted extensive experiments with baselines and FactISR on the proposed datasets. The experimental results demonstrate the effectiveness of our method.
Abstract:Our goal is to generate realistic human motion from natural language. Modern methods often face a trade-off between model expressiveness and text-to-motion alignment. Some align text and motion latent spaces but sacrifice expressiveness; others rely on diffusion models producing impressive motions, but lacking semantic meaning in their latent space. This may compromise realism, diversity, and applicability. Here, we address this by combining latent diffusion with a realignment mechanism, producing a novel, semantically structured space that encodes the semantics of language. Leveraging this capability, we introduce the task of textual motion inversion to capture novel motion concepts from a few examples. For motion synthesis, we evaluate LEAD on HumanML3D and KIT-ML and show comparable performance to the state-of-the-art in terms of realism, diversity, and text-motion consistency. Our qualitative analysis and user study reveal that our synthesized motions are sharper, more human-like and comply better with the text compared to modern methods. For motion textual inversion, our method demonstrates improved capacity in capturing out-of-distribution characteristics in comparison to traditional VAEs.
Abstract:Scene flow estimation aims to generate the 3D motion field of points between two consecutive frames of point clouds, which has wide applications in various fields. Existing point-based methods ignore the irregularity of point clouds and have difficulty capturing long-range dependencies due to the inefficiency of point-level computation. Voxel-based methods suffer from the loss of detail information. In this paper, we propose a point-voxel fusion method, where we utilize a voxel branch based on sparse grid attention and the shifted window strategy to capture long-range dependencies and a point branch to capture fine-grained features to compensate for the information loss in the voxel branch. In addition, since xyz coordinates are difficult to describe the geometric structure of complex 3D objects in the scene, we explicitly encode the local surface information of the point cloud through the umbrella surface feature extraction (USFE) module. We verify the effectiveness of our method by conducting experiments on the Flyingthings3D and KITTI datasets. Our method outperforms all other self-supervised methods and achieves highly competitive results compared to fully supervised methods. We achieve improvements in all metrics, especially EPE, which is reduced by 8.51% and 10.52% on the KITTIo and KITTIs datasets, respectively.
Abstract:In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge. KBLaM works with a knowledge base (KB) constructed from a corpus of documents, transforming each piece of knowledge in the KB into continuous key-value vector pairs via pre-trained sentence encoders with linear adapters and integrating them into pre-trained LLMs via a specialized rectangular attention mechanism. Unlike Retrieval-Augmented Generation, KBLaM eliminates external retrieval modules, and unlike in-context learning, its computational overhead scales linearly with KB size rather than quadratically. Our approach enables integrating a large KB of more than 10K triples into an 8B pre-trained LLM of only 8K context window on one single A100 80GB GPU and allows for dynamic updates without model fine-tuning or retraining. Experiments demonstrate KBLaM's effectiveness in various tasks, including question-answering and open-ended reasoning, while providing interpretable insights into its use of the augmented knowledge.
Abstract:Dielectric elastomer actuators (DEAs), also recognized as artificial muscle, have been widely developed for the soft locomotion robot. With the complaint skeleton and miniaturized dimension, they are well suited for the narrow space inspection. In this work, we propose a novel low profile (1.1mm) and lightweight (1.8g) bi-stable in-plane DEA (Bi-DEA) constructed by supporting a dielectric elastomer onto a flat bi-stable mechanism. It has an amplified displacement and output force compared with the in-plane DEA (I-DEA) without the bi-stable mechanism. Then, the Bi-DEA is applied to a thin soft robot, using three electrostatic adhesive pads (EA-Pads) as anchoring elements. This robot is capable of crawling and climbing to access millimetre-scale narrow gaps. A theoretical model of the bi-stable mechanism and the DEA are presented. The enhanced performance of the Bi-DEA induced by the mechanism is experimentally validated. EA-Pad provides the adhesion between the actuator and the locomotion substrate, allowing crawling and climbing on various surfaces, i.e., paper and acrylic. The thin soft robot has been demonstrated to be capable of crawling through a 4mm narrow gap with a speed up to 3.3mm/s (0.07 body length per second and 2.78 body thickness per second).
Abstract:Soft parallel robots with their manipulation safety and low commercial cost show a promising future for delicate operations and safe human-robot interactions. However, promoting the use of electroactive polymers (EAPs) is still challenging due to the under-improving quality of the product and the dynamic modelling of the collaborations between multiple actuators. This article presents the design, fabrication, modelling and control of a parallel kinematics Delta robot actuated by dielectric elastomer actuators (DEAs). The trade-off between the actuation force and stroke is retaken by an angular stroke amplification mechanism, and the weight of the robot frame is reduced by utilizing 3D puzzling strip structures. A generic way of constructing a high-stability conductive paint on a silicon-based film has been achieved by laser scanning the DE-film and then sandwiching a conductive particle-based electrode with a paint which is mixed by the particles and photosensitive resin. Compared to the wildly used carbon grease, the fabricated electrode shows a higher consistency in its dynamic behaviour before and after the on-stand test. Finally, to predict the output force and inverse motion of the robot end effector, we constructed the inverse dynamic model by introducing an expanded Bergstrom-Boyce model to the constitutive behavior of the dielectric film. The experimental results show a prediction of robot output force with RSME of 12.4% when the end effector remains stationary, and a well-followed trajectory with less than RSME 2.5%.
Abstract:This paper is a report of the Workshop on Simulations for Information Access (Sim4IA) workshop at SIGIR 2024. The workshop had two keynotes, a panel discussion, nine lightning talks, and two breakout sessions. Key takeaways were user simulation's importance in academia and industry, the possible bridging of online and offline evaluation, and the issues of organizing a companion shared task around user simulations for information access. We report on how we organized the workshop, provide a brief overview of what happened at the workshop, and summarize the main topics and findings of the workshop and future work.
Abstract:Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered. Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics. To address this limitation, we present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds. Our method leverages any interactive perception technique as a foundation for interactive perception, inducing slight object movement to generate point cloud frames of the evolving dynamic scene. These point clouds are then segmented using Segment Anything Model 2 (SAM2), after which the moving part of the object is masked for accurate motion online axis estimation, guiding subsequent robotic actions. Our approach significantly enhances the precision and efficiency of manipulation tasks involving articulated objects. Experiments in simulated environments demonstrate that our method outperforms baseline approaches, especially in tasks that demand precise axis-based control. Project Page: https://hytidel.github.io/video-tracking-for-axis-estimation/.