Abstract:Lowering the memory requirement in full-parameter training on large models has become a hot research area. MeZO fine-tunes the large language models (LLMs) by just forward passes in a zeroth-order SGD optimizer (ZO-SGD), demonstrating excellent performance with the same GPU memory usage as inference. However, the simulated perturbation stochastic approximation for gradient estimate in MeZO leads to severe oscillations and incurs a substantial time overhead. Moreover, without momentum regularization, MeZO shows severe over-fitting problems. Lastly, the perturbation-irrelevant momentum on ZO-SGD does not improve the convergence rate. This study proposes ZO-AdaMU to resolve the above problems by adapting the simulated perturbation with momentum in its stochastic approximation. Unlike existing adaptive momentum methods, we relocate momentum on simulated perturbation in stochastic gradient approximation. Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD. Extensive experiments demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning across various NLP tasks than MeZO and its momentum variants.
Abstract:The diversity of tables makes table detection a great challenge, leading to existing models becoming more tedious and complex. Despite achieving high performance, they often overfit to the table style in training set, and suffer from significant performance degradation when encountering out-of-distribution tables in other domains. To tackle this problem, we start from the essence of the table, which is a set of text arranged in rows and columns. Based on this, we propose a novel, light-weighted and robust Table Detection method based on Learning Text Arrangement, namely TDeLTA. TDeLTA takes the text blocks as input, and then models the arrangement of them with a sequential encoder and an attention module. To locate the tables precisely, we design a text-classification task, classifying the text blocks into 4 categories according to their semantic roles in the tables. Experiments are conducted on both the text blocks parsed from PDF and extracted by open-source OCR tools, respectively. Compared to several state-of-the-art methods, TDeLTA achieves competitive results with only 3.1M model parameters on the large-scale public datasets. Moreover, when faced with the cross-domain data under the 0-shot setting, TDeLTA outperforms baselines by a large margin of nearly 7%, which shows the strong robustness and transferability of the proposed model.
Abstract:The excellent generalization, contextual learning, and emergence abilities in the pre-trained large models (PLMs) handle specific tasks without direct training data, making them the better foundation models in the adversarial domain adaptation (ADA) methods to transfer knowledge learned from the source domain to target domains. However, existing ADA methods fail to account for the confounder properly, which is the root cause of the source data distribution that differs from the target domains. This study proposes an adversarial domain adaptation with confounder balancing for PLMs fine-tuning (ADA-CBF). The ADA-CBF includes a PLM as the foundation model for a feature extractor, a domain classifier and a confounder classifier, and they are jointly trained with an adversarial loss. This loss is designed to improve the domain-invariant representation learning by diluting the discrimination in the domain classifier. At the same time, the adversarial loss also balances the confounder distribution among source and unmeasured domains in training. Compared to existing ADA methods, ADA-CBF can correctly identify confounders in domain-invariant features, thereby eliminating the confounder biases in the extracted features from PLMs. The confounder classifier in ADA-CBF is designed as a plug-and-play and can be applied in the confounder measurable, unmeasurable, or partially measurable environments. Empirical results on natural language processing and computer vision downstream tasks show that ADA-CBF outperforms the newest GPT-4, LLaMA2, ViT and ADA methods.
Abstract:Multi-sense word embeddings have been promising solutions for word sense learning. Nevertheless, building large-scale training corpus and learning appropriate word sense are still open issues. In this paper, we propose a method for Decomposing the word Embedding into context-specific Sense representation, called DecE2S. First, the unsupervised polysemy embedding is fed into capsule network to produce its multiple sememe-like vectors. Second, with attention operations, DecE2S integrates the word context to represent the context-specific sense vector. To train DecE2S, we design a word matching training method for learning the context-specific sense representation. DecE2S was experimentally evaluated on two sense learning tasks, i.e., word in context and word sense disambiguation. Results on two public corpora Word-in-Context and English all-words Word Sense Disambiguation show that, the DesE2S model achieves the new state-of-the-art for the word in context and word sense disambiguation tasks.
Abstract:Music emotion recognition (MER) is usually regarded as a multi-label tagging task, and each segment of music can inspire specific emotion tags. Most researchers extract acoustic features from music and explore the relations between these features and their corresponding emotion tags. Considering the inconsistency of emotions inspired by the same music segment for human beings, seeking for the key acoustic features that really affect on emotions is really a challenging task. In this paper, we propose a novel MER method by using deep convolutional neural network (CNN) on the music spectrograms that contains both the original time and frequency domain information. By the proposed method, no additional effort on extracting specific features required, which is left to the training procedure of the CNN model. Experiments are conducted on the standard CAL500 and CAL500exp dataset. Results show that, for both datasets, the proposed method outperforms state-of-the-art methods.
Abstract:In this paper, we propose a novel model, named Stroke Sequence-dependent Deep Convolutional Neural Network (SSDCNN), using the stroke sequence information and eight-directional features for Online Handwritten Chinese Character Recognition (OLHCCR). On one hand, SSDCNN can learn the representation of Online Handwritten Chinese Character (OLHCC) by incorporating the natural sequence information of the strokes. On the other hand, SSDCNN can incorporate eight-directional features in a natural way. In order to train SSDCNN, we divide the process of training into two stages: 1) The training data is used to pre-train the whole architecture until the performance tends to converge. 2) Fully-connected neural network which is used to combine the stroke sequence-dependent representation with eight-directional features and softmax layer are further trained. Experiments were conducted on the OLHCCR competition tasks of ICDAR 2013. Results show that, SSDCNN can reduce the recognition error by 50\% (5.13\% vs 2.56\%) compared to the model which only use eight-directional features. The proposed SSDCNN achieves 97.44\% accuracy which reduces the recognition error by about 1.9\% compared with the best submitted system on ICDAR2013 competition. These results indicate that SSDCNN can exploit the stroke sequence information to learn high-quality representation of OLHCC. It also shows that the learnt representation and the classical eight-directional features complement each other within the SSDCNN architecture.