Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eric Modesitt

ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Dec 19, 2024

Eric Modesitt, Ke Yang, Spencer Hulsey, Chengxiang Zhai, Volodymyr Kindratenko

Figure 1 for ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Figure 2 for ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Figure 3 for ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Figure 4 for ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Abstract:Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning \textsc{LLaMA-3-8B} on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69\% to 76\% and achieved top results on AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA) outperformed \textsc{LLaMA-3-8B-base}, with GPT-4o evaluations preferring it in 73\% of cases across 1000 astronomy-specific questions. Additionally, we validated ORBIT's generalizability by applying it to law and medicine, achieving a significant improvement of data quality compared to an unfiltered baseline. We open-source the ORBIT methodology, including the curated datasets, the codebase, and the resulting model at \href{https://github.com/ModeEric/ORBIT-Llama}{https://github.com/ModeEric/ORBIT-Llama}.

Via

Access Paper or Ask Questions

ViT2EEG: Leveraging Hybrid Pretrained Vision Transformers for EEG Data

Aug 01, 2023

Ruiqi Yang, Eric Modesitt

Abstract:In this study, we demonstrate the application of a hybrid Vision Transformer (ViT) model, pretrained on ImageNet, on an electroencephalogram (EEG) regression task. Despite being originally trained for image classification tasks, when fine-tuned on EEG data, this model shows a notable increase in performance compared to other models, including an identical architecture ViT trained without the ImageNet weights. This discovery challenges the traditional understanding of model generalization, suggesting that Transformer models pretrained on seemingly unrelated image data can provide valuable priors for EEG regression tasks with an appropriate fine-tuning pipeline. The success of this approach suggests that the features extracted by ViT models in the context of visual tasks can be readily transformed for the purpose of EEG predictive modeling. We recommend utilizing this methodology not only in neuroscience and related fields, but generally for any task where data collection is limited by practical, financial, or ethical constraints. Our results illuminate the potential of pretrained models on tasks that are clearly distinct from their original purpose.

* 8 pages, 6 for article, 1 for citation, 1 for appendix. Accepted to KDD-UC 2023

Via

Access Paper or Ask Questions

Two Heads are Better than One: A Bio-inspired Method for Improving Classification on EEG-ET Data

Mar 25, 2023

Eric Modesitt, Ruiqi Yang, Qi Liu

Figure 1 for Two Heads are Better than One: A Bio-inspired Method for Improving Classification on EEG-ET Data

Figure 2 for Two Heads are Better than One: A Bio-inspired Method for Improving Classification on EEG-ET Data

Figure 3 for Two Heads are Better than One: A Bio-inspired Method for Improving Classification on EEG-ET Data

Figure 4 for Two Heads are Better than One: A Bio-inspired Method for Improving Classification on EEG-ET Data

Abstract:Classifying EEG data is integral to the performance of Brain Computer Interfaces (BCI) and their applications. However, external noise often obstructs EEG data due to its biological nature and complex data collection process. Especially when dealing with classification tasks, standard EEG preprocessing approaches extract relevant events and features from the entire dataset. However, these approaches treat all relevant cognitive events equally and overlook the dynamic nature of the brain over time. In contrast, we are inspired by neuroscience studies to use a novel approach that integrates feature selection and time segmentation of EEG data. When tested on the EEGEyeNet dataset, our proposed method significantly increases the performance of Machine Learning classifiers while reducing their respective computational complexity.

* 6 pages, 3 figures, HCI International 2023 Poster

Via

Access Paper or Ask Questions