EEG-based Emotion recognition holds significant promise for applications in human-computer interaction, medicine, and neuroscience. While deep learning has shown potential in this field, current approaches usually rely on large-scale high-quality labeled datasets, limiting the performance of deep learning. Self-supervised learning offers a solution by automatically generating labels, but its inter-subject generalizability remains under-explored. For this reason, our interest lies in offering a self-supervised learning paradigm with better inter-subject generalizability. Inspired by recent efforts in combining low-level and high-level tasks in deep learning, we propose a cascaded self-supervised architecture for EEG emotion recognition. Then, we introduce a low-level task, time-to-frequency reconstruction (TFR). This task leverages the inherent time-frequency relationship in EEG signals. Our architecture integrates it with the high-level contrastive learning modules, performing self-supervised learning for EEG-based emotion recognition. Experiment on DEAP and DREAMER datasets demonstrates superior performance of our method over similar works. The outcome results also highlight the indispensability of the TFR task and the robustness of our method to label scarcity, validating the effectiveness of the proposed method.