In the past decade, using Street View images and machine learning to measure human perception has become a mainstream research approach in urban science. However, this approach using only image-shallow information makes it difficult to comprehensively understand the deep semantic features of human perception of a scene. In this study, we proposed a new framework based on a pre-train natural language model to understand the relationship between human perception and the sense of a scene. Firstly, Place Pulse 2.0 was used as our base dataset, which contains a variety of human-perceived labels, namely, beautiful, safe, wealthy, depressing, boring, and lively. An image captioning network was used to extract the description information of each street view image. Secondly, a pre-trained BERT model was finetuning and added a regression function for six human perceptual dimensions. Furthermore, we compared the performance of five traditional regression methods with our approach and conducted a migration experiment in Hong Kong. Our results show that human perception scoring by deep semantic features performed better than previous studies by machine learning methods with shallow features. The use of deep scene semantic features provides new ideas for subsequent human perception research, as well as better explanatory power in the face of spatial heterogeneity.