Federated Learning (FL) enables decentralized model training across multiple parties while preserving privacy. However, most FL systems assume clients hold only unimodal data, limiting their real-world applicability, as institutions often possess multimodal data. Moreover, the lack of labeled data further constrains the performance of most FL methods. In this work, we propose FedEPA, a novel FL framework for multimodal learning. FedEPA employs a personalized local model aggregation strategy that leverages labeled data on clients to learn personalized aggregation weights, thereby alleviating the impact of data heterogeneity. We also propose an unsupervised modality alignment strategy that works effectively with limited labeled data. Specifically, we decompose multimodal features into aligned features and context features. We then employ contrastive learning to align the aligned features across modalities, ensure the independence between aligned features and context features within each modality, and promote the diversity of context features. A multimodal feature fusion strategy is introduced to obtain a joint embedding. The experimental results show that FedEPA significantly outperforms existing FL methods in multimodal classification tasks under limited labeled data conditions.