Abstract:Conventional domain adaptation typically transfers knowledge from a source domain to a stationary target domain. However, in many real-world cases, target data usually emerge sequentially and have continuously evolving distributions. Restoring and adapting to such target data results in escalating computational and resource consumption over time. Hence, it is vital to devise algorithms to address the evolving domain adaptation (EDA) problem, \emph{i.e.,} adapting models to evolving target domains without access to historic target domains. To achieve this goal, we propose a simple yet effective approach, termed progressive conservative adaptation (PCAda). To manage new target data that diverges from previous distributions, we fine-tune the classifier head based on the progressively updated class prototypes. Moreover, as adjusting to the most recent target domain can interfere with the features learned from previous target domains, we develop a conservative sparse attention mechanism. This mechanism restricts feature adaptation within essential dimensions, thus easing the inference related to historical knowledge. The proposed PCAda is implemented with a meta-learning framework, which achieves the fast adaptation of the classifier with the help of the progressively updated class prototypes in the inner loop and learns a generalized feature without severely interfering with the historic knowledge via the conservative sparse attention in the outer loop. Experiments on Rotated MNIST, Caltran, and Portraits datasets demonstrate the effectiveness of our method.
Abstract:Object pose estimation is an important topic in 3D vision. Though most current state-of-the-art method that trains on real-world pose annotations achieve good results, the cost of such real-world training data is too high. In this paper, we propose a novel method for sim-to-real pose estimation, which is effective on both instance-level and category-level settings. The proposed method is based on the point-pair voting scheme from CPPF to vote for object centers, orientations, and scales. Unlike naive point pairs, to enrich the context provided by each voting unit, we introduce N-point tuples to fuse features from more than two points. Besides, a novel vote selection module is leveraged in order to discard those `bad' votes. Experiments show that our proposed method greatly advances the performance on both instance-level and category-level scenarios. Our method further narrows the gap between sim-to-real and real-training methods by generating synthetic training data online efficiently, while all previous sim-to-real methods need to generate data offline, because of their complex background synthesizing or photo-realistic rendering. Code repository: https://github.com/qq456cvb/BeyondPPF.
Abstract:In this paper, we first provide a new perspective to divide existing high performance object detection methods into direct and indirect regressions. Direct regression performs boundary regression by predicting the offsets from a given point, while indirect regression predicts the offsets from some bounding box proposals. Then we analyze the drawbacks of the indirect regression, which the recent state-of-the-art detection structures like Faster-RCNN and SSD follows, for multi-oriented scene text detection, and point out the potential superiority of direct regression. To verify this point of view, we propose a deep direct regression based method for multi-oriented scene text detection. Our detection framework is simple and effective with a fully convolutional network and one-step post processing. The fully convolutional network is optimized in an end-to-end way and has bi-task outputs where one is pixel-wise classification between text and non-text, and the other is direct regression to determine the vertex coordinates of quadrilateral text boundaries. The proposed method is particularly beneficial for localizing incidental scene texts. On the ICDAR2015 Incidental Scene Text benchmark, our method achieves the F1-measure of 81%, which is a new state-of-the-art and significantly outperforms previous approaches. On other standard datasets with focused scene texts, our method also reaches the state-of-the-art performance.