Abstract:Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements make deployment on devices with constrained resources extremely difficult. Among various efficiency considerations, model binarization and Early Exit (EE) are common effective solutions. However, binarization may lead to performance loss due to reduced precision affecting gradient estimation and parameter updates. Besides, the present early-exit mechanisms are still in the nascent stages of research. To ameliorate these issues, we propose Binarized Early Exit Transformer (BEExformer), the first-ever selective learning transformer architecture to combine early exit with binarization for textual inference. It improves the binarization process through a differentiable second-order approximation to the impulse function. This enables gradient computation concerning both the sign as well as the magnitude of the weights. In contrast to absolute threshold-based EE, the proposed EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. While binarization results in 18.44 times reduction in model size, early exit reduces the FLOPs during inference by 54.85% and even improves accuracy by 5.98% through resolving the "overthinking" problem inherent in deep networks. Moreover, the proposed BEExformer simplifies training by not requiring knowledge distillation from a full-precision LLM. Extensive evaluation on the GLUE dataset and comparison with the SOTA works showcase its pareto-optimal performance-efficiency trade-off.
Abstract:Environmental Sound Classification is an important problem of sound recognition and is more complicated than speech recognition problems as environmental sounds are not well structured with respect to time and frequency. Researchers have used various CNN models to learn audio features from different audio features like log mel spectrograms, gammatone spectral coefficients, mel-frequency spectral coefficients, generated from the audio files, over the past years. In this paper, we propose a new methodology : Two-Level Classification; the Level 1 Classifier will be responsible to classify the audio signal into a broader class and the Level 2 Classifiers will be responsible to find the actual class to which the audio belongs, based on the output of the Level 1 Classifier. We have also shown the effects of different audio filters, among which a new method of Audio Crop is introduced in this paper, which gave the highest accuracies in most of the cases. We have used the ESC-50 dataset for our experiment and obtained a maximum accuracy of 78.75% in case of Level 1 Classification and 98.04% in case of Level 2 Classifications.
Abstract:A language is made up of an infinite/finite number of sentences, which in turn is composed of a number of words. The Electrocardiogram (ECG) is the most popular noninvasive medical tool for studying heart function and diagnosing various irregular cardiac rhythms. Intuitive inspection of the ECG reveals a marked similarity between ECG signals and the spoken language. As a result, the ECG signal may be thought of as a series of heartbeats (similar to sentences in a spoken language), with each heartbeat consisting of a collection of waves (similar to words in a sentence) with varying morphologies. Just as natural language processing (NLP) is used to help computers comprehend and interpret human natural language, it is conceivable to create NLP-inspired algorithms to help computers comprehend the electrocardiogram data more efficiently. In this study, we propose a novel ECG analysis technique, based on embedding and self attention, to capture the spatial as well as the temporal dependencies of the ECG data. To generate the embedding, an encoder-decoder network was proposed to capture the temporal dependencies of the ECG signal and perform data compression. The compressed and encoded data was fed to the embedding layer as its weights. Finally, the proposed CNN-LSTM-Self Attention classifier works on the embedding layer and classifies the signal as normal or anomalous. The approach was tested using the PTB-xl dataset, which is severely imbalanced. Our emphasis was to appropriately recognise the disease classes present in minority numbers, in order to limit the detection of False Negative cases. An accuracy of 91% was achieved with a good F1-score for all the disease classes. Additionally, the the size of the model was reduced by 34% due to compression, making it suitable for deployment in real time applications
Abstract:One of the principal objectives of Natural Language Processing (NLP) is to generate meaningful representations from text. Improving the informativeness of the representations has led to a tremendous rise in the dimensionality and the memory footprint. It leads to a cascading effect amplifying the complexity of the downstream model by increasing its parameters. The available techniques cannot be applied to cross-modal applications such as text-to-image. To ameliorate these issues, a novel Text-to-Image methodology for generating fixed-length representations through a self-supervised Variational Auto-Encoder (VAE) for semantic evaluation applying transformers (TexIm FAST) has been proposed in this paper. The pictorial representations allow oblivious inference while retaining the linguistic intricacies, and are potent in cross-modal applications. TexIm FAST deals with variable-length sequences and generates fixed-length representations with over 75% reduced memory footprint. It enhances the efficiency of the models for downstream tasks by reducing its parameters. The efficacy of TexIm FAST has been extensively analyzed for the task of Semantic Textual Similarity (STS) upon the MSRPC, CNN/ Daily Mail, and XSum data-sets. The results demonstrate 6% improvement in accuracy compared to the baseline and showcase its exceptional ability to compare disparate length sequences such as a text with its summary.