Street Scene Semantic Understanding (denoted as TriSU) is a crucial but complex task for world-wide distributed autonomous driving (AD) vehicles (e.g., Tesla). Its inference model faces poor generalization issue due to inter-city domain-shift. Hierarchical Federated Learning (HFL) offers a potential solution for improving TriSU model generalization, but suffers from slow convergence rate because of vehicles' surrounding heterogeneity across cities. Going beyond existing HFL works that have deficient capabilities in complex tasks, we propose a rapid-converged heterogeneous HFL framework (FedRC) to address the inter-city data heterogeneity and accelerate HFL model convergence rate. In our proposed FedRC framework, both single RGB image and RGB dataset are modelled as Gaussian distributions in HFL aggregation weight design. This approach not only differentiates each RGB sample instead of typically equalizing them, but also considers both data volume and statistical properties rather than simply taking data quantity into consideration. Extensive experiments on the TriSU task using across-city datasets demonstrate that FedRC converges faster than the state-of-the-art benchmark by 38.7%, 37.5%, 35.5%, and 40.6% in terms of mIoU, mPrecision, mRecall, and mF1, respectively. Furthermore, qualitative evaluations in the CARLA simulation environment confirm that the proposed FedRC framework delivers top-tier performance.