As vehicle intelligence advances, multi-modal sensing-aided communication emerges as a key enabler for reliable Vehicle-to-Everything (V2X) connectivity through precise environmental characterization. As centralized learning may suffer from data privacy, model heterogeneity and communication overhead issues, federated learning (FL) has been introduced to support V2X. However, the practical deployment of FL faces critical challenges: model performance degradation from label imbalance across vehicles and training instability induced by modality disparities in sensor-equipped agents. To overcome these limitations, we propose a generative FL approach for beam selection (GFL4BS). Our solution features two core innovations: 1) An adaptive zero-shot multi-modal generator coupled with spectral-regularized loss functions to enhance the expressiveness of synthetic data compensating for both label scarcity and missing modalities; 2) A hybrid training paradigm integrating feature fusion with decentralized optimization to ensure training resilience while minimizing communication costs. Experimental evaluations demonstrate significant improvements over baselines achieving 16.2% higher accuracy than the current state-of-the-art under severe label imbalance conditions while maintaining over 70% successful rate even when two agents lack both LiDAR and RGB camera inputs.