Temporal graph learning aims to generate high-quality representations for graph-based tasks along with dynamic information, which has recently drawn increasing attention. Unlike the static graph, a temporal graph is usually organized in the form of node interaction sequences over continuous time instead of an adjacency matrix. Most temporal graph learning methods model current interactions by combining historical information over time. However, such methods merely consider the first-order temporal information while ignoring the important high-order structural information, leading to sub-optimal performance. To solve this issue, by extracting both temporal and structural information to learn more informative node representations, we propose a self-supervised method termed S2T for temporal graph learning. Note that the first-order temporal information and the high-order structural information are combined in different ways by the initial node representations to calculate two conditional intensities, respectively. Then the alignment loss is introduced to optimize the node representations to be more informative by narrowing the gap between the two intensities. Concretely, besides modeling temporal information using historical neighbor sequences, we further consider the structural information from both local and global levels. At the local level, we generate structural intensity by aggregating features from the high-order neighbor sequences. At the global level, a global representation is generated based on all nodes to adjust the structural intensity according to the active statuses on different nodes. Extensive experiments demonstrate that the proposed method S2T achieves at most 10.13% performance improvement compared with the state-of-the-art competitors on several datasets.