Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nisarg Kothari

YouTube-8M: A Large-Scale Video Classification Benchmark

Sep 27, 2016

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan

Figure 1 for YouTube-8M: A Large-Scale Video Classification Benchmark

Figure 2 for YouTube-8M: A Large-Scale Video Classification Benchmark

Figure 3 for YouTube-8M: A Large-Scale Video Classification Benchmark

Figure 4 for YouTube-8M: A Large-Scale Video Classification Benchmark

Abstract:Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals including metadata and query click signals. We filtered the video labels (Knowledge Graph entities) using both automated and manual curation strategies, including asking human raters if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow. We plan to release code for training a TensorFlow model and for computing metrics.

* 10 pages

Via

Access Paper or Ask Questions

Pose Embeddings: A Deep Architecture for Learning to Match Human Poses

Jul 01, 2015

Greg Mori, Caroline Pantofaru, Nisarg Kothari, Thomas Leung, George Toderici, Alexander Toshev, Weilong Yang

Figure 1 for Pose Embeddings: A Deep Architecture for Learning to Match Human Poses

Figure 2 for Pose Embeddings: A Deep Architecture for Learning to Match Human Poses

Figure 3 for Pose Embeddings: A Deep Architecture for Learning to Match Human Poses

Figure 4 for Pose Embeddings: A Deep Architecture for Learning to Match Human Poses

Abstract:We present a method for learning an embedding that places images of humans in similar poses nearby. This embedding can be used as a direct method of comparing images based on human pose, avoiding potential challenges of estimating body joint positions. Pose embedding learning is formulated under a triplet-based distance criterion. A deep architecture is used to allow learning of a representation capable of making distinctions between different poses. Experiments on human pose matching and retrieval from video data demonstrate the potential of the method.

Via

Access Paper or Ask Questions