for the Alzheimer's Disease Neuroimaging Initiative
Abstract:Text embeddings are useful features for several NLP applications, such as sentence similarity, text clustering, and semantic search. In this paper, we present a Low-rank Adaptation with a Contrastive objective on top of 8-bit Siamese-BLOOM, a multilingual large language model optimized to produce semantically meaningful word embeddings. The innovation is threefold. First, we cast BLOOM weights to 8-bit values. Second, we fine-tune BLOOM with a scalable adapter (LoRA) and 8-bit Adam optimizer for sentence similarity classification. Third, we apply a Siamese architecture on BLOOM model with a contrastive objective to ease the multi-lingual labeled data scarcity. The experiment results show the quality of learned embeddings from LACoS-BLOOM is proportional to the number of model parameters and the amount of unlabeled training data. With the parameter efficient fine-tuning design, we are able to run BLOOM 7.1 billion parameters end-to-end on a single GPU machine with 32GB memory. Compared to previous solution Sentence-BERT, we achieve significant improvement on both English and multi-lingual STS tasks.
Abstract:Associating genetic markers with a multidimensional phenotype is an important yet challenging problem. In this work, we establish the equivalence between two popular methods: kernel-machine regression (KMR), and kernel distance covariance (KDC). KMR is a semiparametric regression frameworks that models the covariate effects parametrically, while the genetic markers are considered non-parametrically. KDC represents a class of methods that includes distance covariance (DC) and Hilbert-Schmidt Independence Criterion (HSIC), which are nonparametric tests of independence. We show the equivalence between the score test of KMR and the KDC statistic under certain conditions. This result leads to a novel generalization of the KDC test that incorporates the covariates. Our contributions are three-fold: (1) establishing the equivalence between KMR and KDC; (2) showing that the principles of kernel machine regression can be applied to the interpretation of KDC; (3) the development of a broader class of KDC statistics, that the members are the quantities of different kernels. We demonstrate the proposals using simulation studies. Data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) is used to explore the association between the genetic variants on gene \emph{FLJ16124} and phenotypes represented in 3D structural brain MR images adjusting for age and gender. The results suggest that SNPs of \emph{FLJ16124} exhibit strong pairwise interaction effects that are correlated to the changes of brain region volumes.