Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mackenzie W. Mathis

LLaVAction: evaluating and training multi-modal large language models for action recognition

Mar 24, 2025

Shaokai Ye, Haozhe Qi, Alexander Mathis, Mackenzie W. Mathis

Abstract:Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.

* https://github.com/AdaptiveMotorControlLab/LLaVAction

Via

Access Paper or Ask Questions

A Contrastive Teacher-Student Framework for Novelty Detection under Style Shifts

Jan 28, 2025

Hossein Mirzaei, Mojtaba Nafez, Moein Madadi, Arad Maleki, Mahdi Hajialilue, Zeinab Sadat Taghavi, Sepehr Rezaee, Ali Ansari, Bahar Dibaei Nia, Kian Shamsaie(+5 more)

Abstract:There have been several efforts to improve Novelty Detection (ND) performance. However, ND methods often suffer significant performance drops under minor distribution shifts caused by changes in the environment, known as style shifts. This challenge arises from the ND setup, where the absence of out-of-distribution (OOD) samples during training causes the detector to be biased toward the dominant style features in the in-distribution (ID) data. As a result, the model mistakenly learns to correlate style with core features, using this shortcut for detection. Robust ND is crucial for real-world applications like autonomous driving and medical imaging, where test samples may have different styles than the training data. Motivated by this, we propose a robust ND method that crafts an auxiliary OOD set with style features similar to the ID set but with different core features. Then, a task-based knowledge distillation strategy is utilized to distinguish core features from style features and help our model rely on core features for discriminating crafted OOD and ID sets. We verified the effectiveness of our method through extensive experimental evaluations on several datasets, including synthetic and real-world benchmarks, against nine different ND methods.

* The code repository is available at: https://github.com/rohban-lab/CTS

Via

Access Paper or Ask Questions

Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

Oct 14, 2024

Hossein Mirzaei, Mackenzie W. Mathis

Figure 1 for Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

Figure 2 for Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

Figure 3 for Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

Figure 4 for Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

Abstract:Despite significant advancements in out-of-distribution (OOD) detection, existing methods still struggle to maintain robustness against adversarial attacks, compromising their reliability in critical real-world applications. Previous studies have attempted to address this challenge by exposing detectors to auxiliary OOD datasets alongside adversarial training. However, the increased data complexity inherent in adversarial training, and the myriad of ways that OOD samples can arise during testing, often prevent these approaches from establishing robust decision boundaries. To address these limitations, we propose AROS, a novel approach leveraging neural ordinary differential equations (NODEs) with Lyapunov stability theorem in order to obtain robust embeddings for OOD detection. By incorporating a tailored loss function, we apply Lyapunov stability theory to ensure that both in-distribution (ID) and OOD data converge to stable equilibrium points within the dynamical system. This approach encourages any perturbed input to return to its stable equilibrium, thereby enhancing the model's robustness against adversarial perturbations. To not use additional data, we generate fake OOD embeddings by sampling from low-likelihood regions of the ID data feature space, approximating the boundaries where OOD data are likely to reside. To then further enhance robustness, we propose the use of an orthogonal binary layer following the stable feature space, which maximizes the separation between the equilibrium points of ID and OOD samples. We validate our method through extensive experiments across several benchmarks, demonstrating superior performance, particularly under adversarial attacks. Notably, our approach improves robust detection performance from 37.8% to 80.1% on CIFAR-10 vs. CIFAR-100 and from 29.0% to 67.0% on CIFAR-100 vs. CIFAR-10.

* Code and pre-trained models are available at https://github.com/AdaptiveMotorControlLab/AROS

Via

Access Paper or Ask Questions

AmadeusGPT: a natural language interface for interactive animal behavioral analysis

Jul 10, 2023

Shaokai Ye, Jessy Lauer, Mu Zhou, Alexander Mathis, Mackenzie W. Mathis

Abstract:The process of quantifying and analyzing animal behavior involves translating the naturally occurring descriptive language of their actions into machine-readable code. Yet, codifying behavior analysis is often challenging without deep understanding of animal behavior and technical machine learning knowledge. To limit this gap, we introduce AmadeusGPT: a natural language interface that turns natural language descriptions of behaviors into machine-executable code. Large-language models (LLMs) such as GPT3.5 and GPT4 allow for interactive language-based queries that are potentially well suited for making interactive behavior analysis. However, the comprehension capability of these LLMs is limited by the context window size, which prevents it from remembering distant conversations. To overcome the context window limitation, we implement a novel dual-memory mechanism to allow communication between short-term and long-term memory using symbols as context pointers for retrieval and saving. Concretely, users directly use language-based definitions of behavior and our augmented GPT develops code based on the core AmadeusGPT API, which contains machine learning, computer vision, spatio-temporal reasoning, and visualization modules. Users then can interactively refine results, and seamlessly add new behavioral modules as needed. We benchmark AmadeusGPT and show we can produce state-of-the-art performance on the MABE 2022 behavior challenge tasks. Note, an end-user would not need to write any code to achieve this. Thus, collectively AmadeusGPT presents a novel way to merge deep biological knowledge, large-language models, and core computer vision modules into a more naturally intelligent system. Code and demos can be found at: https://github.com/AdaptiveMotorControlLab/AmadeusGPT.

* demo available https://github.com/AdaptiveMotorControlLab/AmadeusGPT

Via

Access Paper or Ask Questions

Seeing biodiversity: perspectives in machine learning for wildlife conservation

Oct 25, 2021

Devis Tuia, Benjamin Kellenberger, Sara Beery, Blair R. Costelloe, Silvia Zuffi, Benjamin Risse, Alexander Mathis, Mackenzie W. Mathis, Frank van Langevelde, Tilo Burghardt(+8 more)

Figure 1 for Seeing biodiversity: perspectives in machine learning for wildlife conservation

Figure 2 for Seeing biodiversity: perspectives in machine learning for wildlife conservation

Figure 3 for Seeing biodiversity: perspectives in machine learning for wildlife conservation

Figure 4 for Seeing biodiversity: perspectives in machine learning for wildlife conservation

Abstract:Data acquisition in animal ecology is rapidly accelerating due to inexpensive and accessible sensors such as smartphones, drones, satellites, audio recorders and bio-logging devices. These new technologies and the data they generate hold great potential for large-scale environmental monitoring and understanding, but are limited by current data processing approaches which are inefficient in how they ingest, digest, and distill data into relevant information. We argue that machine learning, and especially deep learning approaches, can meet this analytic challenge to enhance our understanding, monitoring capacity, and conservation of wildlife species. Incorporating machine learning into ecological workflows could improve inputs for population and behavior models and eventually lead to integrated hybrid modeling tools, with ecological models acting as constraints for machine learning models and the latter providing data-supported insights. In essence, by combining new machine learning approaches with ecological domain knowledge, animal ecologists can capitalize on the abundance of data generated by modern sensor technologies in order to reliably estimate population abundances, study animal behavior and mitigate human/wildlife conflicts. To succeed, this approach will require close collaboration and cross-disciplinary education between the computer science and animal ecology communities in order to ensure the quality of machine learning approaches and train a new generation of data scientists in ecology and conservation.

Via

Access Paper or Ask Questions

AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild

Mar 24, 2021

Daniel Joska, Liam Clark, Naoya Muramatsu, Ricardo Jericevich, Fred Nicolls, Alexander Mathis, Mackenzie W. Mathis, Amir Patel

Figure 1 for AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild

Figure 2 for AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild

Figure 3 for AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild

Figure 4 for AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild

Abstract:Animals are capable of extreme agility, yet understanding their complex dynamics, which have ecological, biomechanical and evolutionary implications, remains challenging. Being able to study this incredible agility will be critical for the development of next-generation autonomous legged robots. In particular, the cheetah (acinonyx jubatus) is supremely fast and maneuverable, yet quantifying its whole-body 3D kinematic data during locomotion in the wild remains a challenge, even with new deep learning-based methods. In this work we present an extensive dataset of free-running cheetahs in the wild, called AcinoSet, that contains 119,490 frames of multi-view synchronized high-speed video footage, camera calibration files and 7,588 human-annotated frames. We utilize markerless animal pose estimation to provide 2D keypoints. Then, we use three methods that serve as strong baselines for 3D pose estimation tool development: traditional sparse bundle adjustment, an Extended Kalman Filter, and a trajectory optimization-based method we call Full Trajectory Estimation. The resulting 3D trajectories, human-checked 3D ground truth, and an interactive tool to inspect the data is also provided. We believe this dataset will be useful for a diverse range of fields such as ecology, neuroscience, robotics, biomechanics as well as computer vision.

* Code and data can be found at: https://github.com/African-Robotics-Unit/AcinoSet

Via

Access Paper or Ask Questions

Measuring and modeling the motor system with machine learning

Mar 22, 2021

Sébastien B. Hausmann, Alessandro Marin Vargas, Alexander Mathis, Mackenzie W. Mathis

Figure 1 for Measuring and modeling the motor system with machine learning

Figure 2 for Measuring and modeling the motor system with machine learning

Figure 3 for Measuring and modeling the motor system with machine learning

Abstract:The utility of machine learning in understanding the motor system is promising a revolution in how to collect, measure, and analyze data. The field of movement science already elegantly incorporates theory and engineering principles to guide experimental work, and in this review we discuss the growing use of machine learning: from pose estimation, kinematic analyses, dimensionality reduction, and closed-loop feedback, to its use in understanding neural correlates and untangling sensorimotor systems. We also give our perspective on new avenues where markerless motion capture combined with biomechanical modeling and neural networks could be a new platform for hypothesis-driven research.

Via

Access Paper or Ask Questions

A Primer on Motion Capture with Deep Learning: Principles, Pitfalls and Perspectives

Sep 02, 2020

Alexander Mathis, Steffen Schneider, Jessy Lauer, Mackenzie W. Mathis

Figure 1 for A Primer on Motion Capture with Deep Learning: Principles, Pitfalls and Perspectives

Figure 2 for A Primer on Motion Capture with Deep Learning: Principles, Pitfalls and Perspectives

Figure 3 for A Primer on Motion Capture with Deep Learning: Principles, Pitfalls and Perspectives

Figure 4 for A Primer on Motion Capture with Deep Learning: Principles, Pitfalls and Perspectives

Abstract:Extracting behavioral measurements non-invasively from video is stymied by the fact that it is a hard computational problem. Recent advances in deep learning have tremendously advanced predicting posture from videos directly, which quickly impacted neuroscience and biology more broadly. In this primer we review the budding field of motion capture with deep learning. In particular, we will discuss the principles of those novel algorithms, highlight their potential as well as pitfalls for experimentalists, and provide a glimpse into the future.

* Review, 21 pages, 8 figures and 5 boxes

Via

Access Paper or Ask Questions

Deep learning tools for the measurement of animal behavior in neuroscience

Oct 18, 2019

Mackenzie W. Mathis, Alexander Mathis

Figure 1 for Deep learning tools for the measurement of animal behavior in neuroscience

Figure 2 for Deep learning tools for the measurement of animal behavior in neuroscience

Figure 3 for Deep learning tools for the measurement of animal behavior in neuroscience

Abstract:Recent advances in computer vision have made accurate, fast and robust measurement of animal behavior a reality. In the past years powerful tools specifically designed to aid the measurement of behavior have come to fruition. Here we discuss how capturing the postures of animals - pose estimation - has been rapidly advancing with new deep learning methods. While challenges still remain, we envision that the fast-paced development of new deep learning tools will rapidly change the landscape of realizable real-world neuroscience.

* 11 pages, 3 figures, review

Via

Access Paper or Ask Questions

Pretraining boosts out-of-domain robustness for pose estimation

Sep 24, 2019

Alexander Mathis, Mert Yüksekgönül, Byron Rogers, Matthias Bethge, Mackenzie W. Mathis

Figure 1 for Pretraining boosts out-of-domain robustness for pose estimation

Figure 2 for Pretraining boosts out-of-domain robustness for pose estimation

Figure 3 for Pretraining boosts out-of-domain robustness for pose estimation

Figure 4 for Pretraining boosts out-of-domain robustness for pose estimation

Abstract:Deep neural networks are highly effective tools for human and animal pose estimation. However, robustness to out-of-domain data remains a challenge. Here, we probe the transfer and generalization ability for pose estimation with two architecture classes (MobileNetV2s and ResNets) pretrained on ImageNet. We generated a novel dataset of 30 horses that allowed for both within-domain and out-of-domain (unseen horse) testing. We find that pretraining on ImageNet strongly improves out-of-domain performance. Moreover, we show that for both pretrained and networks trained from scratch, better ImageNet-performing architectures perform better for pose estimation, with a substantial improvement on out-of-domain data when pretrained. Collectively, our results demonstrate that transfer learning is particularly beneficial for out-of-domain robustness.

Via

Access Paper or Ask Questions