Abstract:Galaxy morphology analysis involves classifying galaxies by their shapes and structures. For this task, directly training domain-specific models on large, annotated astronomical datasets is effective but costly. In contrast, fine-tuning vision foundation models on a smaller set of astronomical images is more resource-efficient but generally results in lower accuracy. To harness the benefits of both approaches and address their shortcomings, we propose GalaxAlign, a novel method that fine-tunes pre-trained foundation models to achieve high accuracy on astronomical tasks. Specifically, our method extends a contrastive learning architecture to align three types of data in fine-tuning: (1) a set of schematic symbols representing galaxy shapes and structures, (2) textual labels of these symbols, and (3) galaxy images. This way, GalaxAlign not only eliminates the need for expensive pretraining but also enhances the effectiveness of fine-tuning. Extensive experiments on galaxy classification and similarity search demonstrate that our method effectively fine-tunes general pre-trained models for astronomical tasks by incorporating domain-specific multi-modal knowledge.
Abstract:Radio telescopes produce visibility data about celestial objects, but these data are sparse and noisy. As a result, images created on raw visibility data are of low quality. Recent studies have used deep learning models to reconstruct visibility data to get cleaner images. However, these methods rely on a substantial amount of labeled training data, which requires significant labeling effort from radio astronomers. Addressing this challenge, we propose VisRec, a model-agnostic semi-supervised learning approach to the reconstruction of visibility data. Specifically, VisRec consists of both a supervised learning module and an unsupervised learning module. In the supervised learning module, we introduce a set of data augmentation functions to produce diverse training examples. In comparison, the unsupervised learning module in VisRec augments unlabeled data and uses reconstructions from non-augmented visibility data as pseudo-labels for training. This hybrid approach allows VisRec to effectively leverage both labeled and unlabeled data. This way, VisRec performs well even when labeled data is scarce. Our evaluation results show that VisRec outperforms all baseline methods in reconstruction quality, robustness against common observation perturbation, and generalizability to different telescope configurations.
Abstract:The successful integration of large language models (LLMs) into recommendation systems has proven to be a major breakthrough in recent studies, paving the way for more generic and transferable recommendations. However, LLMs struggle to effectively utilize user and item IDs, which are crucial identifiers for successful recommendations. This is mainly due to their distinct representation in a semantic space that is different from the natural language (NL) typically used to train LLMs. To tackle such issue, we introduce ControlRec, an innovative Contrastive prompt learning framework for Recommendation systems. ControlRec treats user IDs and NL as heterogeneous features and encodes them individually. To promote greater alignment and integration between them in the semantic space, we have devised two auxiliary contrastive objectives: (1) Heterogeneous Feature Matching (HFM) aligning item description with the corresponding ID or user's next preferred ID based on their interaction sequence, and (2) Instruction Contrastive Learning (ICL) effectively merging these two crucial data sources by contrasting probability distributions of output sequences generated by diverse tasks. Experimental results on four public real-world datasets demonstrate the effectiveness of the proposed method on improving model performance.
Abstract:We propose a method named AudioFormer,which learns audio feature representations through the acquisition of discrete acoustic codes and subsequently fine-tunes them for audio classification tasks. Initially,we introduce a novel perspective by considering the audio classification task as a form of natural language understanding (NLU). Leveraging an existing neural audio codec model,we generate discrete acoustic codes and utilize them to train a masked language model (MLM),thereby obtaining audio feature representations. Furthermore,we pioneer the integration of a Multi-Positive sample Contrastive (MPC) learning approach. This method enables the learning of joint representations among multiple discrete acoustic codes within the same audio input. In our experiments,we treat discrete acoustic codes as textual data and train a masked language model using a cloze-like methodology,ultimately deriving high-quality audio representations. Notably,the MPC learning technique effectively captures collaborative representations among distinct positive samples. Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models across multiple datasets,and even outperforms audio-visual multimodal classification models on select datasets. Specifically,our approach achieves remarkable results on datasets including AudioSet (2M,20K),and FSD50K,with performance scores of 53.9,45.1,and 65.6,respectively. We have openly shared both the code and models: https://github.com/LZH-0225/AudioFormer.git.
Abstract:To solve the problem of poor performance of deep neural network models due to insufficient data, a simple yet effective interpolation-based data augmentation method is proposed: MSMix (Manifold Swap Mixup). This method feeds two different samples to the same deep neural network model, and then randomly select a specific layer and partially replace hidden features at that layer of one of the samples by the counterpart of the other. The mixed hidden features are fed to the model and go through the rest of the network. Two different selection strategies are also proposed to obtain richer hidden representation. Experiments are conducted on three Chinese intention recognition datasets, and the results show that the MSMix method achieves better results than other methods in both full-sample and small-sample configurations.
Abstract:Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity, following a classical testing theory. Taking the Named Entity Recognition (NER) datasets as a case study, we introduce $9$ statistical metrics for a statistical dataset evaluation framework. Experimental results and human evaluation validate that our evaluation framework effectively assesses various aspects of the dataset quality. Furthermore, we study how the dataset scores on our statistical metrics affect the model performance, and appeal for dataset quality evaluation or targeted dataset improvement before training or testing models.
Abstract:The booming of electric vehicles demands efficient battery disassembly for recycling to be environment-friendly. Currently, battery disassembly is still primarily done by humans, probably assisted by robots, due to the unstructured environment and high uncertainties. It is highly desirable to design autonomous solutions to improve work efficiency and lower human risks in high voltage and toxic environments. This paper proposes a novel neurosymbolic method, which augments the traditional Variational Autoencoder (VAE) model to learn symbolic operators based on raw sensory inputs and their relationships. The symbolic operators include a probabilistic state symbol grounding model and a state transition matrix for predicting states after each execution to enable autonomous task and motion planning. At last, the method's feasibility is verified through test results.
Abstract:The booming of electric vehicles demands efficient battery disassembly for recycling to be environment-friendly. Due to the unstructured environment and high uncertainties, battery disassembly is still primarily done by humans, probably assisted by robots. It is highly desirable to design autonomous solutions to improve work efficiency and lower human risks in high voltage and toxic environments. This paper proposes a novel framework of the NeuroSymbolic task and motion planning method to disassemble batteries in an unstructured environment using robots automatically. It enables robots to independently locate and disassemble battery bolts, with or without obstacles. This study not only provides a solution for intelligently disassembling electric vehicle batteries but also verifies its feasibility through a set of test results with the robot accomplishing the disassembly tasks in a complex and dynamic environment.
Abstract:The use of multi-modal data such as the combination of whole slide images (WSIs) and gene expression data for survival analysis can lead to more accurate survival predictions. Previous multi-modal survival models are not able to efficiently excavate the intrinsic information within each modality. Moreover, despite experimental results show that WSIs provide more effective information than gene expression data, previous methods regard the information from different modalities as similarly important so they cannot flexibly utilize the potential connection between the modalities. To address the above problems, we propose a new asymmetrical multi-modal method, termed as AMMASurv. Specifically, we design an asymmetrical multi-modal attention mechanism (AMMA) in Transformer encoder for multi-modal data to enable a more flexible multi-modal information fusion for survival prediction. Different from previous works, AMMASurv can effectively utilize the intrinsic information within every modality and flexibly adapts to the modalities of different importance. Extensive experiments are conducted to validate the effectiveness of the proposed model. Encouraging results demonstrate the superiority of our method over other state-of-the-art methods.
Abstract:In recent years, distantly-supervised relation extraction has achieved a certain success by using deep neural networks. Distant Supervision (DS) can automatically generate large-scale annotated data by aligning entity pairs from Knowledge Bases (KB) to sentences. However, these DS-generated datasets inevitably have wrong labels that result in incorrect evaluation scores during testing, which may mislead the researchers. To solve this problem, we build a new dataset NYTH, where we use the DS-generated data as training data and hire annotators to label test data. Compared with the previous datasets, NYT-H has a much larger test set and then we can perform more accurate and consistent evaluation. Finally, we present the experimental results of several widely used systems on NYT-H. The experimental results show that the ranking lists of the comparison systems on the DS-labelled test data and human-annotated test data are different. This indicates that our human-annotated data is necessary for evaluation of distantly-supervised relation extraction.