Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joseph Redmon

OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

Nov 17, 2025

Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley(+16 more)

Abstract:Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at $\href{https://github.com/allenai/olmoearth_pretrain}{\text{https://github.com/allenai/olmoearth_pretrain}}$.

Via

Access Paper or Ask Questions

IQA: Visual Question Answering in Interactive Environments

Sep 06, 2018

Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, Ali Farhadi

Figure 1 for IQA: Visual Question Answering in Interactive Environments

Figure 2 for IQA: Visual Question Answering in Interactive Environments

Figure 3 for IQA: Visual Question Answering in Interactive Environments

Figure 4 for IQA: Visual Question Answering in Interactive Environments

Abstract:We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects (code and dataset available at https://github.com/danielgordon10/thor-iqa-cvpr-2018). IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98

* Published in CVPR 2018

Via

Access Paper or Ask Questions

Who Let The Dogs Out? Modeling Dog Behavior From Visual Data

May 17, 2018

Kiana Ehsani, Hessam Bagherinezhad, Joseph Redmon, Roozbeh Mottaghi, Ali Farhadi

Figure 1 for Who Let The Dogs Out? Modeling Dog Behavior From Visual Data

Figure 2 for Who Let The Dogs Out? Modeling Dog Behavior From Visual Data

Figure 3 for Who Let The Dogs Out? Modeling Dog Behavior From Visual Data

Figure 4 for Who Let The Dogs Out? Modeling Dog Behavior From Visual Data

Abstract:We introduce the task of directly modeling a visually intelligent agent. Computer vision typically focuses on solving various subtasks related to visual intelligence. We depart from this standard approach to computer vision; instead we directly model a visually intelligent agent. Our model takes visual information as input and directly predicts the actions of the agent. Toward this end we introduce DECADE, a large-scale dataset of ego-centric videos from a dog's perspective as well as her corresponding movements. Using this data we model how the dog acts and how the dog plans her movements. We show under a variety of metrics that given just visual input we can successfully model this intelligent agent in many situations. Moreover, the representation learned by our model encodes distinct information compared to representations trained on image classification, and our learned representation can generalize to other domains. In particular, we show strong results on the task of walkable surface estimation by using this dog modeling task as representation learning.

* Accepted to CVPR18

Via

Access Paper or Ask Questions

YOLOv3: An Incremental Improvement

Apr 08, 2018

Joseph Redmon, Ali Farhadi

Figure 1 for YOLOv3: An Incremental Improvement

Figure 2 for YOLOv3: An Incremental Improvement

Figure 3 for YOLOv3: An Incremental Improvement

Figure 4 for YOLOv3: An Incremental Improvement

Abstract:We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/

* Tech Report

Via

Access Paper or Ask Questions

YOLO9000: Better, Faster, Stronger

Dec 25, 2016

Joseph Redmon, Ali Farhadi

Figure 1 for YOLO9000: Better, Faster, Stronger

Figure 2 for YOLO9000: Better, Faster, Stronger

Figure 3 for YOLO9000: Better, Faster, Stronger

Figure 4 for YOLO9000: Better, Faster, Stronger

Abstract:We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don't have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.

Via

Access Paper or Ask Questions

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Aug 02, 2016

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi

Figure 1 for XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Figure 2 for XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Figure 3 for XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Figure 4 for XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

Abstract:We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58x faster convolutional operations and 32x memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is only 2.9% less than the full-precision AlexNet (in top-1 measure). We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy.

Via

Access Paper or Ask Questions

You Only Look Once: Unified, Real-Time Object Detection

May 09, 2016

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

Figure 1 for You Only Look Once: Unified, Real-Time Object Detection

Figure 2 for You Only Look Once: Unified, Real-Time Object Detection

Figure 3 for You Only Look Once: Unified, Real-Time Object Detection

Figure 4 for You Only Look Once: Unified, Real-Time Object Detection

Abstract:We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.

Via

Access Paper or Ask Questions

Real-Time Grasp Detection Using Convolutional Neural Networks

Feb 28, 2015

Joseph Redmon, Anelia Angelova

Figure 1 for Real-Time Grasp Detection Using Convolutional Neural Networks

Figure 2 for Real-Time Grasp Detection Using Convolutional Neural Networks

Figure 3 for Real-Time Grasp Detection Using Convolutional Neural Networks

Figure 4 for Real-Time Grasp Detection Using Convolutional Neural Networks

Abstract:We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks. Our network performs single-stage regression to graspable bounding boxes without using standard sliding window or region proposal techniques. The model outperforms state-of-the-art approaches by 14 percentage points and runs at 13 frames per second on a GPU. Our network can simultaneously perform classification so that in a single step it recognizes the object and finds a good grasp rectangle. A modification to this model predicts multiple grasps per object by using a locally constrained prediction mechanism. The locally constrained model performs significantly better, especially on objects that can be grasped in a variety of ways.

* Accepted to ICRA 2015

Via

Access Paper or Ask Questions