Abstract:Two challenges are presented when parsing road scenes in UAV images. First, the high resolution of UAV images makes processing difficult. Second, supervised deep learning methods require a large amount of manual annotations to train robust and accurate models. In this paper, an unsupervised road parsing framework that leverages recent advances in vision language models and fundamental computer vision model is introduced.Initially, a vision language model is employed to efficiently process ultra-large resolution UAV images to quickly detect road regions of interest in the images. Subsequently, the vision foundation model SAM is utilized to generate masks for the road regions without category information. Following that, a self-supervised representation learning network extracts feature representations from all masked regions. Finally, an unsupervised clustering algorithm is applied to cluster these feature representations and assign IDs to each cluster. The masked regions are combined with the corresponding IDs to generate initial pseudo-labels, which initiate an iterative self-training process for regular semantic segmentation. The proposed method achieves an impressive 89.96% mIoU on the development dataset without relying on any manual annotation. Particularly noteworthy is the extraordinary flexibility of the proposed method, which even goes beyond the limitations of human-defined categories and is able to acquire knowledge of new categories from the dataset itself.
Abstract:Through extensive research on deep learning in recent years and its application in construction, crack detection has evolved rapidly from rough detection at the image-level and patch-level to fine-grained detection at the pixel-level, which better suits the nature of this field. Despite numerous existing studies utilizing off-the-shelf deep learning models or enhancing them, these models are not always effective or efficient in real-world applications. In order to bridge this gap, we propose a High-resolution model with Semantic guidance, specifically designed for real-time crack segmentation, referred to as HrSegNet. Our model maintains high resolution throughout the entire process, as opposed to recovering from low-resolution features to high-resolution ones, thereby maximizing the preservation of crack details. Moreover, to enhance the context information, we use low-resolution semantic features to guide the reconstruction of high-resolution features. To ensure the efficiency of the algorithm, we design a simple yet effective method to control the computation cost of the entire model by controlling the capacity of high-resolution channels, while providing the model with extremely strong scalability. Extensive quantitative and qualitative evaluations demonstrate that our proposed HrSegNet has exceptional crack segmentation capabilities, and that maintaining high resolution and semantic guidance are crucial to the final prediction. Compared to state-of-the-art segmentation models, HrSegNet achieves the best trade-off between efficiency and effectiveness. Specifically, on the crack dataset CrackSeg9k, our fastest model HrSegNet-B16 achieves a speed of 182 FPS with 78.43% mIoU, while our most accurate model HrSegNet-B48 achieves 80.32% mIoU with an inference speed of 140.3 FPS.