Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexey Bochkovskiy

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Jul 06, 2022

Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao

Figure 1 for YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Figure 2 for YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Figure 3 for YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Figure 4 for YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Abstract:YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWIN-L Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by 509% in speed and 2% in accuracy, and convolutional-based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy, as well as YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy. Moreover, we train YOLOv7 only on MS COCO dataset from scratch without using any other datasets or pre-trained weights. Source code is released in https://github.com/WongKinYiu/yolov7.

Via

Access Paper or Ask Questions

Non-deep Networks

Oct 14, 2021

Ankit Goyal, Alexey Bochkovskiy, Jia Deng, Vladlen Koltun

Abstract:Depth is the hallmark of deep neural networks. But more depth means more sequential computation and higher latency. This begs the question -- is it possible to build high-performing "non-deep" neural networks? We show that it is. To do so, we use parallel subnetworks instead of stacking one layer after another. This helps effectively reduce depth while maintaining high performance. By utilizing parallel substructures, we show, for the first time, that a network with a depth of just 12 can achieve top-1 accuracy over 80% on ImageNet, 96% on CIFAR10, and 81% on CIFAR100. We also show that a network with a low-depth (12) backbone can achieve an AP of 48% on MS-COCO. We analyze the scaling rules for our design and show how to increase performance without changing the network's depth. Finally, we provide a proof of concept for how non-deep networks could be used to build low-latency recognition systems. Code is available at https://github.com/imankgoyal/NonDeepNetworks.

Via

Access Paper or Ask Questions

Vision Transformers for Dense Prediction

Mar 24, 2021

René Ranftl, Alexey Bochkovskiy, Vladlen Koltun

Figure 1 for Vision Transformers for Dense Prediction

Figure 2 for Vision Transformers for Dense Prediction

Figure 3 for Vision Transformers for Dense Prediction

Figure 4 for Vision Transformers for Dense Prediction

Abstract:We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

* 15 pages

Via

Access Paper or Ask Questions

Scaled-YOLOv4: Scaling Cross Stage Partial Network

Nov 16, 2020

Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao

Figure 1 for Scaled-YOLOv4: Scaling Cross Stage Partial Network

Figure 2 for Scaled-YOLOv4: Scaling Cross Stage Partial Network

Figure 3 for Scaled-YOLOv4: Scaling Cross Stage Partial Network

Figure 4 for Scaled-YOLOv4: Scaling Cross Stage Partial Network

Abstract:We show that the YOLOv4 object detection neural network based on the CSP approach, scales both up and down and is applicable to small and large networks while maintaining optimal speed and accuracy. We propose a network scaling approach that modifies not only the depth, width, resolution, but also structure of the network. YOLOv4-large model achieves state-of-the-art results: 55.4% AP (73.3% AP50) for the MS COCO dataset at a speed of 15 FPS on Tesla V100, while with the test time augmentation, YOLOv4-large achieves 55.8% AP (73.2 AP50). To the best of our knowledge, this is currently the highest accuracy on the COCO dataset among any published work. The YOLOv4-tiny model achieves 22.0% AP (42.0% AP50) at a speed of 443 FPS on RTX 2080Ti, while by using TensorRT, batch size = 4 and FP16-precision the YOLOv4-tiny achieves 1774 FPS.

Via

Access Paper or Ask Questions

YOLOv4: Optimal Speed and Accuracy of Object Detection

Apr 23, 2020

Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao

Figure 1 for YOLOv4: Optimal Speed and Accuracy of Object Detection

Figure 2 for YOLOv4: Optimal Speed and Accuracy of Object Detection

Figure 3 for YOLOv4: Optimal Speed and Accuracy of Object Detection

Figure 4 for YOLOv4: Optimal Speed and Accuracy of Object Detection

Abstract:There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet

Via

Access Paper or Ask Questions