Building height retrieval from synthetic aperture radar (SAR) imagery is of great importance for urban applications, yet highly challenging owing to the complexity of SAR data. This paper addresses the issue of building height retrieval in large-scale urban areas from a single TerraSAR-X spotlight or stripmap image. Based on the radar viewing geometry, we propose that this problem can be formulated as a bounding box regression problem and therefore allows for integrating height data from multiple data sources in generating ground truth on a larger scale. We introduce building footprints from geographic information system (GIS) data as complementary information and propose a bounding box regression network that exploits the location relationship between a building's footprint and its bounding box, allowing for fast computation. This is important for large-scale applications. The method is validated on four urban data sets using TerraSAR-X images in both high-resolution spotlight and stripmap modes. Experimental results show that the proposed network can reduce the computation cost significantly while keeping the height accuracy of individual buildings compared to a Faster R-CNN based method. Moreover, we investigate the impact of inaccurate GIS data on our proposed network, and this study shows that the bounding box regression network is robust against positioning errors in GIS data. The proposed method has great potential to be applied to regional or even global scales.