Abstract:Multimodal Sentiment Analysis (MSA) leverages multiple modals to analyze sentiments. Typically, advanced fusion methods and representation learning-based methods are designed to tackle it. Our proposed GSIFN solves two key problems to be solved in MSA: (i) In multimodal fusion, the decoupling of modal combinations and tremendous parameter redundancy in existing fusion methods, which lead to poor fusion performance and efficiency. (ii) The trade-off between representation capability and computation overhead of the unimodal feature extractors and enhancers. GSIFN incorporates two main components to solve these problems: (i) Graph-Structured and Interlaced-Masked Multimodal Transformer. It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the computation overhead. (ii) A self-supervised learning framework with low computation overhead and high performance, which utilizes a parallelized LSTM with matrix memory to enhance non-verbal modal feature for unimodal label generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS, GSIFN demonstrates superior performance with significantly lower computation overhead compared with state-of-the-art methods.