Abstract:Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation. Recently, end-to-end transformer-based methods have achieved success by eliminating the need for post-processing operators compared to traditional CNN-based methods. However, directly extending transformers to oriented object detection presents three main issues: 1) objects rotate arbitrarily, necessitating the encoding of angles along with position and size; 2) the geometric relations of oriented objects are lacking in self-attention, due to the absence of interaction between content and positional queries; and 3) oriented objects cause misalignment, mainly between values and positional queries in cross-attention, making accurate classification and localization difficult. In this paper, we propose an end-to-end transformer-based oriented object detector, consisting of three dedicated modules to address these issues. First, Gaussian positional encoding is proposed to encode the angle, position, and size of oriented boxes using Gaussian distributions. Second, Wasserstein self-attention is proposed to introduce geometric relations and facilitate interaction between content and positional queries by utilizing Gaussian Wasserstein distance scores. Third, oriented cross-attention is proposed to align values and positional queries by rotating sampling points around the positional query according to their angles. Experiments on six datasets DIOR-R, a series of DOTA, HRSC2016 and ICDAR2015 show the effectiveness of our approach. Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP$_{50}$ on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3$\times$ to 1$\times$. The codes are available at https://github.com/wokaikaixinxin/OrientedFormer.
Abstract:Object instances in remote sensing images often distribute with multi-orientations, varying scales, and dense distribution. These issues bring challenges to end-to-end oriented object detectors including multi-scale features alignment and a large number of queries. To address these limitations, we propose an end-to-end oriented detector equipped with an efficient decoder, which incorporates two technologies, Rotated RoI attention (RRoI attention) and Selective Distinct Queries (SDQ). Specifically, RRoI attention effectively focuses on oriented regions of interest through a cross-attention mechanism and aligns multi-scale features. SDQ collects queries from intermediate decoder layers and then filters similar queries to obtain distinct queries. The proposed SDQ can facilitate the optimization of one-to-one label assignment, without introducing redundant initial queries or extra auxiliary branches. Extensive experiments on five datasets demonstrate the effectiveness of our method. Notably, our method achieves state-of-the-art performance on DIOR-R (67.31% mAP), DOTA-v1.5 (67.43% mAP), and DOTA-v2.0 (53.28% mAP) with the ResNet50 backbone.
Abstract:In this work, we present our solution to the vehicle re-identification (vehicle Re-ID) track in AI City Challenge 2020 (AIC2020). The purpose of vehicle Re-ID is to retrieve the same vehicle appeared across multiple cameras, and it could make a great contribution to the Intelligent Traffic System(ITS) and smart city. Due to the vehicle's orientation, lighting and inter-class similarity, it is difficult to achieve robust and discriminative representation feature. For the vehicle Re-ID track in AIC2020, our method is to fuse features extracted from different networks in order to take advantages of these networks and achieve complementary features. For each single model, several methods such as multi-loss, filter grafting, semi-supervised are used to increase the representation ability as better as possible. Top performance in City-Scale Multi-Camera Vehicle Re-Identification demonstrated the advantage of our methods, and we got 5-th place in the vehicle Re-ID track of AIC2020. The codes are available at https://github.com/gggcy/AIC2020_ReID.
Abstract:Object segmentation and object tracking are fundamental research area in the computer vision community. These two topics are diffcult to handle some common challenges, such as occlusion, deformation, motion blur, and scale variation. The former contains heterogeneous object, interacting object, edge ambiguity, and shape complexity. And the latter suffers from difficulties in handling fast motion, out-of-view, and real-time processing. Combining the two problems of video object segmentation and tracking (VOST) can overcome their respective difficulties and improve their performance. VOST can be widely applied to many practical applications such as video summarization, high definition video compression, human computer interaction, and autonomous vehicles. This article aims to provide a comprehensive review of the state-of-the-art tracking methods, and classify these methods into different categories, and identify new trends. First, we provide a hierarchical categorization existing approaches, including unsupervised VOS, semi-supervised VOS, interactive VOS, weakly supervised VOS, and segmentation-based tracking methods. Second, we provide a detailed discussion and overview of the technical characteristics of the different methods. Third, we summarize the characteristics of the related video dataset, and provide a variety of evaluation metrics. Finally, we point out a set of interesting future works and draw our own conclusions.
Abstract:Finding a good classifier is a multiobjective optimization problem with different error rates and the costs to be minimized. The receiver operating characteristic is widely used in the machine learning community to analyze the performance of parametric classifiers or sets of Pareto optimal classifiers. In order to directly compare two sets of classifiers the area (or volume) under the convex hull can be used as a scalar indicator for the performance of a set of classifiers in receiver operating characteristic space. Recently, the convex hull based multiobjective genetic programming algorithm was proposed and successfully applied to maximize the convex hull area for binary classification problems. The contribution of this paper is to extend this algorithm for dealing with higher dimensional problem formulations. In particular, we discuss problems where parsimony (or classifier complexity) is stated as a third objective and multi-class classification with three different true classification rates to be maximized. The design of the algorithm proposed in this paper is inspired by indicator-based evolutionary algorithms, where first a performance indicator for a solution set is established and then a selection operator is designed that complies with the performance indicator. In this case, the performance indicator will be the volume under the convex hull. The algorithm is tested and analyzed in a proof of concept study on different benchmarks that are designed for measuring its capability to capture relevant parts of a convex hull. Further benchmark and application studies on email classification and feature selection round up the analysis and assess robustness and usefulness of the new algorithm in real world settings.