Abstract:Most existing Human-Object Interaction~(HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel end-to-end zero-shot HOI Detection (EoID) framework via vision-language knowledge distillation. We first design an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human-object pairs in an action-agnostic manner. Then we transfer the distribution of action probability from the pretrained vision-language teacher as well as the seen ground truth to the HOI model to attain zero-shot HOI classification. Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our method outperforms the previous SOTA by 8.92% on unseen mAP and 10.18% on overall mAP under UA setting, by 6.02% on unseen mAP and 9.1% on overall mAP under UC setting. Moreover, our method is generalizable to large-scale object detection data to further scale up the action sets. The source code will be available at: https://github.com/mrwu-mac/EoID.
Abstract:The rapidly decreasing computation and memory cost has recently driven the success of many applications in the field of deep learning. Practical applications of deep learning in resource-limited hardware, such as embedded devices and smart phones, however, remain challenging. For binary convolutional networks, the reason lies in the degraded representation caused by binarizing full-precision filters. To address this problem, we propose new circulant filters (CiFs) and a circulant binary convolution (CBConv) to enhance the capacity of binarized convolutional features via our circulant back propagation (CBP). The CiFs can be easily incorporated into existing deep convolutional neural networks (DCNNs), which leads to new Circulant Binary Convolutional Networks (CBCNs). Extensive experiments confirm that the performance gap between the 1-bit and full-precision DCNNs is minimized by increasing the filter diversity, which further increases the representational ability in our networks. Our experiments on ImageNet show that CBCNs achieve 61.4% top-1 accuracy with ResNet18. Compared to the state-of-the-art such as XNOR, CBCNs can achieve up to 10% higher top-1 accuracy with more powerful representational ability.
Abstract:Deep convolutional neural networks (DCNNs) have dominated the recent developments in computer vision through making various record-breaking models. However, it is still a great challenge to achieve powerful DCNNs in resource-limited environments, such as on embedded devices and smart phones. Researchers have realized that 1-bit CNNs can be one feasible solution to resolve the issue; however, they are baffled by the inferior performance compared to the full-precision DCNNs. In this paper, we propose a novel approach, called Bayesian optimized 1-bit CNNs (denoted as BONNs), taking the advantage of Bayesian learning, a well-established strategy for hard problems, to significantly improve the performance of extreme 1-bit CNNs. We incorporate the prior distributions of full-precision kernels and features into the Bayesian framework to construct 1-bit CNNs in an end-to-end manner, which have not been considered in any previous related methods. The Bayesian losses are achieved with a theoretical support to optimize the network simultaneously in both continuous and discrete spaces, aggregating different losses jointly to improve the model capacity. Extensive experiments on the ImageNet and CIFAR datasets show that BONNs achieve the best classification performance compared to state-of-the-art 1-bit CNNs.
Abstract:The advancement of deep convolutional neural networks (DCNNs) has driven significant improvement in the accuracy of recognition systems for many computer vision tasks. However, their practical applications are often restricted in resource-constrained environments. In this paper, we introduce projection convolutional neural networks (PCNNs) with a discrete back propagation via projection (DBPP) to improve the performance of binarized neural networks (BNNs). The contributions of our paper include: 1) for the first time, the projection function is exploited to efficiently solve the discrete back propagation problem, which leads to a new highly compressed CNNs (termed PCNNs); 2) by exploiting multiple projections, we learn a set of diverse quantized kernels that compress the full-precision kernels in a more efficient way than those proposed previously; 3) PCNNs achieve the best classification performance compared to other state-of-the-art BNNs on the ImageNet and CIFAR datasets.
Abstract:Compression artifacts reduction (CAR) is a challenging problem in the field of remote sensing. Most recent deep learning based methods have demonstrated superior performance over the previous hand-crafted methods. In this paper, we propose an end-to-end one-two-one (OTO) network, to combine different deep models, i.e., summation and difference models, to solve the CAR problem. Particularly, the difference model motivated by the Laplacian pyramid is designed to obtain the high frequency information, while the summation model aggregates the low frequency information. We provide an in-depth investigation into our OTO architecture based on the Taylor expansion, which shows that these two kinds of information can be fused in a nonlinear scheme to gain more capacity of handling complicated image compression artifacts, especially the blocking effect in compression. Extensive experiments are conducted to demonstrate the superior performance of the OTO networks, as compared to the state-of-the-arts on remote sensing datasets and other benchmark datasets.