Abstract:In the pathway toward Artificial General Intelligence (AGI), understanding human's affection is essential to enhance machine's cognition abilities. For achieving more sensual human-AI interaction, Multimodal Affective Computing (MAC) in human-spoken videos has attracted increasing attention. However, previous methods are mainly devoted to designing multimodal fusion algorithms, suffering from two issues: semantic imbalance caused by diverse pre-processing operations and semantic mismatch raised by inconsistent affection content contained in different modalities comparing with the multimodal ground truth. Besides, the usage of manual features extractors make they fail in building end-to-end pipeline for multiple MAC downstream tasks. To address above challenges, we propose a novel end-to-end framework named SemanticMAC to compute multimodal semantic-centric affection for human-spoken videos. We firstly employ pre-trained Transformer model in multimodal data pre-processing and design Affective Perceiver module to capture unimodal affective information. Moreover, we present a semantic-centric approach to unify multimodal representation learning in three ways, including gated feature interaction, multi-task pseudo label generation, and intra-/inter-sample contrastive learning. Finally, SemanticMAC effectively learn specific- and shared-semantic representations in the guidance of semantic-centric labels. Extensive experimental results demonstrate that our approach surpass the state-of-the-art methods on 7 public datasets in four MAC downstream tasks.
Abstract:Sparse array designs have focused mostly on angular resolution, peak sidelobe level and directivity factor of virtual arrays for multiple-input multiple-output (MIMO) radar. The notion of the MIMO radar virtual array is based on the direct path assumption in that the direction-of-departure (DOD) and direction-of-arrival (DOA) of the targets are equal. However, the DOD and DOA of targets in multipath scenarios are likely to be very different. The identification of multipath targets requires DOD-DOA imaging using the the transmit and receive arrays, not the virtual array. To improve the imaging of both direct path and multipath targets, we introduce several new criteria for MIMO radar sparse linear array (SLA) designs for multipath scenarios. Under the new criteria, we adopt a cyclic optimization strategy under a coordinate descent framework to design the MIMO SLAs. We present several numerical examples to demonstrate the effectiveness of the proposed approaches.
Abstract:Phase-modulated continuous-wave (PMCW) multiple-input multiple-output (MIMO) radar systems are known to possess excellent mutual interference mitigation capabilities, but require costly and power-hungry high sampling rate and high-precision analog-to-digital converters (ADC's). To reduce cost and power consumption, we consider a mixed-ADC architecture, in which most receive antenna outputs are sampled by one-bit ADC's, and only one or a few outputs by high-precision ADC's. We first derive the Cram{\'e}r-Rao bound (CRB) for the mixed-ADC based PMCW MIMO radar to characterize the best achievable performance of an unbiased target parameter estimator. The CRB analysis demonstrates that the mixed-ADC architecture with a relatively small number of high-precision ADC's and a large number of one-bit ADC's allows us to drastically reduce the hardware cost and power consumption while still maintain a high dynamic range needed for autonomous driving applications. We also introduce a two-step estimator to realize the computationally efficient maximum likelihood (ML) estimation of the target parameters. We formulate the angle-Doppler imaging problem as a sparse parameter estimation problem, and a computationally efficient majorization-minimization (MM) based estimator of sparse parameters, referred to as mLIKES, is devised for accurate angle-Doppler imaging. This is followed by using a relaxation-based approach to cyclically refine the results of mLIKES for accurate off-grid target parameter estimation. Numerical examples are provided to demonstrate the effectiveness of the proposed algorithms for angle-Doppler imaging using mixed-ADC based PMCW MIMO radar.
Abstract:Multimodal representation learning is a challenging task in which previous work mostly focus on either uni-modality pre-training or cross-modality fusion. In fact, we regard modeling multimodal representation as building a skyscraper, where laying stable foundation and designing the main structure are equally essential. The former is like encoding robust uni-modal representation while the later is like integrating interactive information among different modalities, both of which are critical to learning an effective multimodal representation. Recently, contrastive learning has been successfully applied in representation learning, which can be utilized as the pillar of the skyscraper and benefit the model to extract the most important features contained in the multimodal data. In this paper, we propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously. Specifically, we devise uni-modal contrastive coding with an efficient uni-modal feature augmentation strategy to filter inherent noise contained in acoustic and visual modality and acquire more robust uni-modality representations. Besides, a pseudo siamese network is presented to predict representation across different modalities, which successfully captures cross-modal dynamics. Moreover, we design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment. Extensive experiments conducted on two public datasets demonstrate that our method surpasses the state-of-the-art methods.