Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

May 09, 2024

Zhizhen Zhang, Ning Wang, Haojie Li, Zhihui Wang

Figure 1 for Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

Figure 2 for Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

Figure 3 for Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

Figure 4 for Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

Share this with someone who'll enjoy it:

Abstract:The purpose of semantic location prediction is to extract relevant semantic location information from multimodal social media posts, offering a more contextual understanding of daily activities compared to GPS coordinates. However, this task becomes challenging due to the presence of noise and irrelevant information in "text-image" pairs. Existing methods suffer from insufficient feature representations and fail to consider the comprehensive integration of similarity at different granularities, making it difficult to filter out noise and irrelevant information. To address these challenges, we propose a Similarity-Guided Multimodal Fusion Transformer (SG-MFT) for predicting social users' semantic locations. First, we utilize a pre-trained large-scale vision-language model to extract high-quality feature representations from social media posts. Then, we introduce a Similarity-Guided Interaction Module (SIM) to alleviate modality heterogeneity and noise interference by incorporating coarse-grained and fine-grained similarity guidance for modality interactions. Specifically, we propose a novel similarity-aware feature interpolation attention mechanism at the coarse level, leveraging modality-wise similarity to mitigate heterogeneity and reduce noise within each modality. Meanwhile, we employ a similarity-aware feed-forward block at the fine level, utilizing element-wise similarity to further mitigate the impact of modality heterogeneity. Building upon pre-processed features with minimal noise and modal interference, we propose a Similarity-aware Feature Fusion Module (SFM) to fuse two modalities with cross-attention mechanism. Comprehensive experimental results demonstrate the superior performance of our proposed method in handling modality imbalance while maintaining efficient fusion effectiveness.

View paper on

Share this with someone who'll enjoy it:

Title:Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

Paper and Code