Authorization systems are increasingly relying on processing radio frequency (RF) waveforms at receivers to fingerprint (i.e., determine the identity) of the corresponding transmitter. Federated learning (FL) has emerged as a popular paradigm to perform RF fingerprinting in networks with multiple access points (APs), as they allow effective deep learning-based device identification without requiring the centralization of locally collected RF signals stored at multiple APs. Yet, FL algorithms that operate merely on in-phase and quadrature (IQ) time samples incur high convergence rates, resulting in excessive training rounds and inefficient training times. In this work, we propose FLAME: an FL approach for multimodal RF fingerprinting. Our framework consists of simultaneously representing received RF waveforms in multiple complimentary modalities beyond IQ samples in an effort to reduce training times. We theoretically demonstrate the feasibility and efficiency of our methodology and derive a convergence bound that incurs lower loss and thus higher accuracies in the same training round in comparison to single-modal FL-based RF fingerprinting. Extensive empirical evaluations validate our theoretical results and demonstrate the superiority of FLAME with with improvements of up to 30% in comparison to multiple considered baselines.