Abstract:Scientific discovery pipelines typically involve complex, rigid, and time-consuming processes, from data preparation to analyzing and interpreting findings. Recent advances in AI have the potential to transform such pipelines in a way that domain experts can focus on interpreting and understanding findings, rather than debugging rigid pipelines or manually annotating data. As part of an active collaboration between data science/AI researchers and behavioral neuroscientists, we showcase an example AI-enhanced pipeline, specifically designed to transform and accelerate the way that the domain experts in the team are able to gain insights out of experimental data. The application at hand is in the domain of behavioral neuroscience, studying fear generalization in mice, an important problem whose progress can advance our understanding of clinically significant and often debilitating conditions such as PTSD (Post-Traumatic Stress Disorder). We identify the emerging paradigm of "In-Context Learning" (ICL) as a suitable interface for domain experts to automate parts of their pipeline without the need for or familiarity with AI model training and fine-tuning, and showcase its remarkable efficacy in data preparation and pattern interpretation. Also, we introduce novel AI-enhancements to tensor decomposition model, which allows for more seamless pattern discovery from the heterogeneous data in our application. We thoroughly evaluate our proposed pipeline experimentally, showcasing its superior performance compared to what is standard practice in the domain, as well as against reasonable ML baselines that do not fall under the ICL paradigm, to ensure that we are not compromising performance in our quest for a seamless and easy-to-use interface for domain experts. Finally, we demonstrate effective discovery, with results validated by the domain experts in the team.
Abstract:Integration of diverse data will be a pivotal step towards improving scientific explorations in many disciplines. This work establishes a vision-language model (VLM) that encodes videos with text input in order to classify various behaviors of a mouse existing in and engaging with their environment. Importantly, this model produces a behavioral vector over time for each subject and for each session the subject undergoes. The output is a valuable dataset that few programs are able to produce with as high accuracy and with minimal user input. Specifically, we use the open-source Qwen2.5-VL model and enhance its performance through prompts, in-context learning (ICL) with labeled examples, and frame-level preprocessing. We found that each of these methods contributes to improved classification, and that combining them results in strong F1 scores across all behaviors, including rare classes like freezing and fleeing, without any model fine-tuning. Overall, this model will support interdisciplinary researchers studying mouse behavior by enabling them to integrate diverse behavioral features, measured across multiple time points and environments, into a comprehensive dataset that can address complex research questions.
Abstract:Producing large complex simulation datasets can often be a time and resource consuming task. Especially when these experiments are very expensive, it is becoming more reasonable to generate synthetic data for downstream tasks. Recently, these methods may include using generative machine learning models such as Generative Adversarial Networks or diffusion models. As these generative models improve efficiency in producing useful data, we introduce an internal tensor decomposition to these generative models to even further reduce costs. More specifically, for multidimensional data, or tensors, we generate the smaller tensor factors instead of the full tensor, in order to significantly reduce the model's output and overall parameters. This reduces the costs of generating complex simulation data, and our experiments show the generated data remains useful. As a result, tensor decomposition has the potential to improve efficiency in generative models, especially when generating multidimensional data, or tensors.




Abstract:Large Language Models (LLMs) have demonstrated the ability to solve complex tasks through In-Context Learning (ICL), where models learn from a few input-output pairs without explicit fine-tuning. In this paper, we explore the capacity of LLMs to solve non-linear numerical computations, with specific emphasis on functions of the Singular Value Decomposition. Our experiments show that while LLMs perform comparably to traditional models such as Stochastic Gradient Descent (SGD) based Linear Regression and Neural Networks (NN) for simpler tasks, they outperform these models on more complex tasks, particularly in the case of top-k Singular Values. Furthermore, LLMs demonstrate strong scalability, maintaining high accuracy even as the matrix size increases. Additionally, we found that LLMs can achieve high accuracy with minimal prior examples, converging quickly and avoiding the overfitting seen in classical models. These results suggest that LLMs could provide an efficient alternative to classical methods for solving high-dimensional problems. Future work will focus on extending these findings to larger matrices and more complex matrix operations while exploring the effect of using different numerical representations in ICL.