Abstract:Multimodal learning has been a popular area of research, yet integrating electroencephalogram (EEG) data poses unique challenges due to its inherent variability and limited availability. In this paper, we introduce a novel multimodal framework that accommodates not only conventional modalities such as video, images, and audio, but also incorporates EEG data. Our framework is designed to flexibly handle varying input sizes, while dynamically adjusting attention to account for feature importance across modalities. We evaluate our approach on a recently introduced emotion recognition dataset that combines data from three modalities, making it an ideal testbed for multimodal learning. The experimental results provide a benchmark for the dataset and demonstrate the effectiveness of the proposed framework. This work highlights the potential of integrating EEG into multimodal systems, paving the way for more robust and comprehensive applications in emotion recognition and beyond.
Abstract:In the quest for efficient neural network models for neural data interpretation and user intent classification in brain-computer interfaces (BCIs), learning meaningful sparse representations of the underlying neural subspaces is crucial. The present study introduces a sparse multitask learning framework for motor imagery (MI) and motor execution (ME) tasks, inspired by the natural partitioning of associated neural subspaces observed in the human brain. Given a dual-task CNN model for MI-ME classification, we apply a saliency-based sparsification approach to prune superfluous connections and reinforce those that show high importance in both tasks. Through our approach, we seek to elucidate the distinct and common neural ensembles associated with each task, employing principled sparsification techniques to eliminate redundant connections and boost the fidelity of neural signal decoding. Our results indicate that this tailored sparsity can mitigate the overfitting problem and improve the test performance with small amount of data, suggesting a viable path forward for computationally efficient and robust BCI systems.
Abstract:Increasing the size of embedding layers has shown to be effective in improving the performance of recommendation models, yet gradually causing their sizes to exceed terabytes in industrial recommender systems, and hence the increase of computing and storage costs. To save resources while maintaining model performances, we propose SHARK, the model compression practice we have summarized in the recommender system of industrial scenarios. SHARK consists of two main components. First, we use the novel first-order component of Taylor expansion as importance scores to prune the number of embedding tables (feature fields). Second, we introduce a new row-wise quantization method to apply different quantization strategies to each embedding. We conduct extensive experiments on both public and industrial datasets, demonstrating that each component of our proposed SHARK framework outperforms previous approaches. We conduct A/B tests in multiple models on Kuaishou, such as short video, e-commerce, and advertising recommendation models. The results of the online A/B test showed SHARK can effectively reduce the memory footprint of the embedded layer. For the short-video scenarios, the compressed model without any performance drop significantly saves 70% storage and thousands of machines, improves 30\% queries per second (QPS), and has been deployed to serve hundreds of millions of users and process tens of billions of requests every day.
Abstract:Data augmentation approaches are widely explored for the enhancement of decoding electroencephalogram signals. In subject-independent brain-computer interface system, domain adaption and generalization are utilized to shift source subjects' data distribution to match the target subject as an augmentation. However, previous works either introduce noises (e.g., by noise addition or generation with random noises) or modify target data, thus, cannot well depict the target data distribution and hinder further analysis. In this paper, we propose a target-centered subject transfer framework as a data augmentation approach. A subset of source data is first constructed to maximize the source-target relevance. Then, the generative model is applied to transfer the data to target domain. The proposed framework enriches the explainability of target domain by adding extra real data, instead of noises. It shows superior performance compared with other data augmentation methods. Extensive experiments are conducted to verify the effectiveness and robustness of our approach as a prosperous tool for further research.
Abstract:WiFi-based sensing has aroused immense attention over recent years. The rationale is that the signal fluctuations caused by humans carry the information of human behavior which can be extracted from the channel state information of WiFi. Still, the prior studies mainly focus on single-task sensing (STS), e.g., gesture recognition, indoor localization, user identification. Since the fluctuations caused by gestures are highly coupling with body features and the user's location, we propose a WiFi-based multi-task sensing model (Wimuse) to perform gesture recognition, indoor localization, and user identification tasks simultaneously. However, these tasks have different difficulty levels (i.e., imbalance issue) and need task-specific information (i.e., discrepancy issue). To address these issues, the knowledge distillation technique and task-specific residual adaptor are adopted in Wimuse. We first train the STS model for each task. Then, for solving the imbalance issue, the extracted common feature in Wimuse is encouraged to get close to the counterpart features of the STS models. Further, for each task, a task-specific residual adaptor is applied to extract the task-specific compensation feature which is fused with the common feature to address the discrepancy issue. We conduct comprehensive experiments on three public datasets and evaluation suggests that Wimuse achieves state-of-the-art performance with the average accuracy of 85.20%, 98.39%, and 98.725% on the joint task of gesture recognition, indoor localization, and user identification, respectively.