Abstract:3D lanes offer a more comprehensive understanding of the road surface geometry than 2D lanes, thereby providing crucial references for driving decisions and trajectory planning. While many efforts aim to improve prediction accuracy, we recognize that an efficient network can bring results closer to lane modeling. However, if the modeling data is imprecise, the results might not accurately capture the real-world scenario. Therefore, accurate lane modeling is essential to align prediction results closely with the environment. This study centers on efficient and accurate lane modeling, proposing a joint modeling approach that combines Bezier curves and interpolation methods. Furthermore, based on this lane modeling approach, we developed a Global2Local Lane Matching method with Bezier Control-Point and Key-Point, which serve as a comprehensive solution that leverages hierarchical features with two mathematical models to ensure a precise match. We also introduce a novel 3D Spatial Constructor, representing an exploration of 3D surround-view lane detection research. The framework is suitable for front-view or surround-view 3D lane detection. By directly outputting the key points of lanes in 3D space, it overcomes the limitations of anchor-based methods, enabling accurate prediction of closed-loop or U-shaped lanes and effective adaptation to complex road conditions. This innovative method establishes a new benchmark in front-view 3D lane detection on the Openlane dataset and achieves competitive performance in surround-view 2D lane detection on the Argoverse2 dataset.
Abstract:Scene text spotting is essential in various computer vision applications, enabling extracting and interpreting textual information from images. However, existing methods often neglect the spatial semantics of word images, leading to suboptimal detection recall rates for long and short words within long-tailed word length distributions that exist prominently in dense scenes. In this paper, we present WordLenSpotter, a novel word length-aware spotter for scene text image detection and recognition, improving the spotting capabilities for long and short words, particularly in the tail data of dense text images. We first design an image encoder equipped with a dilated convolutional fusion module to integrate multiscale text image features effectively. Then, leveraging the Transformer framework, we synergistically optimize text detection and recognition accuracy after iteratively refining text region image features using the word length prior. Specially, we design a Spatial Length Predictor module (SLP) using character count prior tailored to different word lengths to constrain the regions of interest effectively. Furthermore, we introduce a specialized word Length-aware Segmentation (LenSeg) proposal head, enhancing the network's capacity to capture the distinctive features of long and short terms within categories characterized by long-tailed distributions. Comprehensive experiments on public datasets and our dense text spotting dataset DSTD1500 demonstrate the superiority of our proposed methods, particularly in dense text image detection and recognition tasks involving long-tailed word length distributions encompassing a range of long and short words.