Augmenting federated learning (FL) with direct device-to-device (D2D) communications can help improve convergence speed and reduce model bias through rapid local information exchange. However, data privacy concerns, device trust issues, and unreliable wireless channels each pose challenges to determining an effective yet resource efficient D2D structure. In this paper, we develop a decentralized reinforcement learning (RL) methodology for D2D graph discovery that promotes communication of non-sensitive yet impactful data-points over trusted yet reliable links. Each device functions as an RL agent, training a policy to predict the impact of incoming links. Local (device-level) and global rewards are coupled through message passing within and between device clusters. Numerical experiments confirm the advantages offered by our method in terms of convergence speed and straggler resilience across several datasets and FL schemes.