Dynamic functional connectivity networks (dFCN) based on rs-fMRI have demonstrated tremendous potential for brain function analysis and brain disease classification. Recently, studies have applied deep learning techniques (i.e., convolutional neural network, CNN) to dFCN classification, and achieved better performance than the traditional machine learning methods. Nevertheless, previous deep learning methods usually perform successive convolutional operations on the input dFCNs to obtain high-order brain network aggregation features, extracting them from each sliding window using a series split, which may neglect non-linear correlations among different regions and the sequentiality of information. Thus, important high-order sequence information of dFCNs, which could further improve the classification performance, is ignored in these studies. Nowadays, inspired by the great success of Transformer in natural language processing and computer vision, some latest work has also emerged on the application of Transformer for brain disease diagnosis based on rs-fMRI data. Although Transformer is capable of capturing non-linear correlations, it lacks accounting for capturing local spatial feature patterns and modelling the temporal dimension due to parallel computing, even equipped with a positional encoding technique. To address these issues, we propose a self-attention (SA) based convolutional recurrent network (SA-CRN) learning framework for brain disease classification with rs-fMRI data. The experimental results on a public dataset (i.e., ADNI) demonstrate the effectiveness of our proposed SA-CRN method.