Abstract:Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion, which is essential for real-world applications like autonomous driving, where visible, depth, event, LiDAR, etc., are used. Besides, few attempts for multi-modal fusion, \emph{e.g.}, simple concatenation, cross-modal attention, and token selection, cannot well dig into the intrinsic shared and specific details of multiple modalities. To tackle the challenge, in this paper, we propose a Part-Whole Relational Fusion (PWRF) framework. For the first time, this framework treats multi-modal fusion as part-whole relational fusion. It routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets). Through this part-whole routing, our PWRF generates modal-shared and modal-specific semantics from the whole-level modal capsules and the routing coefficients, respectively. On top of that, modal-shared and modal-specific details can be employed to solve the issue of multi-modal scene understanding, including synthetic multi-modal segmentation and visible-depth-thermal salient object detection in this paper. Experiments on several datasets demonstrate the superiority of the proposed PWRF framework for multi-modal scene understanding. The source code has been released on https://github.com/liuyi1989/PWRF.
Abstract:Co-salient object detection targets at detecting co-existed salient objects among a group of images. Recently, a generalist model for segmenting everything in context, called SegGPT, is gaining public attention. In view of its breakthrough for segmentation, we can hardly wait to probe into its contribution to the task of co-salient object detection. In this report, we first design a framework to enable SegGPT for the problem of co-salient object detection. Proceed to the next step, we evaluate the performance of SegGPT on the problem of co-salient object detection on three available datasets. We achieve a finding that co-saliency scenes challenges SegGPT due to context discrepancy within a group of co-saliency images.