Learning-based applications have demonstrated practical use cases in ubiquitous environments and amplified interest in exploiting the data stored on users' mobile devices. Distributed learning algorithms aim to leverage such distributed and diverse data to learn a global phenomena by performing training amongst participating devices and repeatedly aggregating their local models' parameters into a global model. Federated learning is a promising paradigm that allows for extending local training among the participant devices before aggregating the parameters, offering better communication efficiency. However, in the cases where the participants' data are strongly skewed (i.e., non-IID), the model accuracy can significantly drop. To face this challenge, we leverage the edge computing paradigm to design a hierarchical learning system that performs Federated Gradient Descent on the user-edge layer and Federated Averaging on the edge-cloud layer. In this hierarchical architecture, the users are assigned to different edges, such that edge-level data distributions turn to be close to IID. We formalize and optimize this user-edge assignment problem to minimize classes' distribution distance between edge nodes, which enhances the Federated Averaging performance. Our experiments on multiple real-world datasets show that the proposed optimized assignment is tractable and leads to faster convergence of models towards a better accuracy value.