In this paper, we propose the semantic graph Transformer (SGT) for 3D scene graph generation. The task aims to parse a cloud point-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure. Existing methods based on graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and could only propagate information from limited neighboring nodes. In contrast, our SGT uses Transformer layers as the base building block to allow global information passing, with two types of proposed Transformer layers tailored for the 3D scene graph generation task. Specifically, we introduce the graph embedding layer to best utilize the global information in graph edges while maintaining comparable computation costs. Additionally, we propose the semantic injection layer to leverage categorical text labels and visual object knowledge. We benchmark our SGT on the established 3DSSG benchmark and achieve a 35.9% absolute improvement in relationship prediction's R@50 and an 80.4% boost on the subset with complex scenes over the state-of-the-art. Our analyses further show SGT's superiority in the long-tailed and zero-shot scenarios. We will release the code and model.