Here, we present IDNet, a user authentication framework from smartphone-acquired motion signals. Its goal is to recognize a target user from their way of walking, using the accelerometer and gyroscope (inertial) signals provided by a commercial smartphone worn in the front pocket of the user's trousers. IDNet features several innovations including: i) a robust and smartphone-orientation-independent walking cycle extraction block, ii) a novel feature extractor based on convolutional neural networks, iii) a one-class support vector machine to classify walking cycles, and the coherent integration of these into iv) a multi-stage authentication technique. IDNet is the first system that exploits a deep learning approach as universal feature extractors for gait recognition, and that combines classification results from subsequent walking cycles into a multi-stage decision making framework. Experimental results show the superiority of our approach against state-of-the-art techniques, leading to misclassification rates (either false negatives or positives) smaller than 0.15% with fewer than five walking cycles. Design choices are discussed and motivated throughout, assessing their impact on the user authentication performance.