A significant increase in the number of reconfigurable intelligent surface (RIS) elements results in a spherical wavefront in the near field of extremely large-scale RIS (XL-RIS). Although the channel matrix of the cascaded two-hop link may become sparse in the polar-domain representation, their accurate estimation of these polar-domain parameters cannot be readily guaranteed. To tackle this challenge, we exploit the sparsity inherent in the cascaded channel. To elaborate, we first estimate the significant path-angles and distances corresponding to the common paths between the BS and the XL-RIS. Then, the individual path parameters associated with different users are recovered. This results in a two-stage channel estimation scheme, in which distinct learning-based networks are used for channel training at each stage. More explicitly, in stage I, a denoising convolutional neural network (DnCNN) is employed for treating the grid mismatches as noise to determine the true grid index of the angles and distances. By contrast, an iterative shrinkage thresholding algorithm (ISTA) based network is proposed for adaptively adjusting the column coherence of the dictionary matrix in stage II. Finally, our simulation results demonstrate that the proposed two-stage learning-based channel estimation outperforms the state-of-the-art benchmarks.