Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jyh-Jing Hwang

S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

May 30, 2025

Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng(+2 more)

Abstract:The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches--which directly learn from sensor inputs to generate planning trajectories without human annotations often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in 2D image space rather than the native 3D space in which autonomous vehicles plan. To this end, we propose S4-Driver, a scalable self-supervised motion planning algorithm with spatio-temporal visual representation, based on the popular PaLI multimodal large language model. S4-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multi-view and multi-frame visual inputs and enables better prediction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and Waymo Open Motion Dataset (with in-house camera data). Results show that S4-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of unannotated driving logs.

* Accepted by CVPR2025

Via

Access Paper or Ask Questions

EMMA: End-to-End Multimodal Model for Autonomous Driving

Oct 30, 2024

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp(+3 more)

Figure 1 for EMMA: End-to-End Multimodal Model for Autonomous Driving

Figure 2 for EMMA: End-to-End Multimodal Model for Autonomous Driving

Figure 3 for EMMA: End-to-End Multimodal Model for Autonomous Driving

Figure 4 for EMMA: End-to-End Multimodal Model for Autonomous Driving

Abstract:We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures.

* Blog post: https://waymo.com/blog/2024/10/introducing-emma/

Via

Access Paper or Ask Questions

CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection

Oct 18, 2022

Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Rafferty, Nicholas Armstrong-Crews, Tiffany Chen, Dragomir Anguelov

Figure 1 for CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection

Figure 2 for CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection

Figure 3 for CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection

Figure 4 for CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection

Abstract:Robust 3D object detection is critical for safe autonomous driving. Camera and radar sensors are synergistic as they capture complementary information and work well under different environmental conditions. Fusing camera and radar data is challenging, however, as each of the sensors lacks information along a perpendicular axis, that is, depth is unknown to camera and elevation is unknown to radar. We propose the camera-radar matching network CramNet, an efficient approach to fuse the sensor readings from camera and radar in a joint 3D space. To leverage radar range measurements for better camera depth predictions, we propose a novel ray-constrained cross-attention mechanism that resolves the ambiguity in the geometric correspondences between camera features and radar features. Our method supports training with sensor modality dropout, which leads to robust 3D object detection, even when a camera or radar sensor suddenly malfunctions on a vehicle. We demonstrate the effectiveness of our fusion approach through extensive experiments on the RADIATE dataset, one of the few large-scale datasets that provide radar radio frequency imagery. A camera-only variant of our method achieves competitive performance in monocular 3D object detection on the Waymo Open Dataset.

* ECCV 2022

Via

Access Paper or Ask Questions

CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Oct 04, 2022

Tsung-Wei Ke, Jyh-Jing Hwang, Stella X. Yu

Figure 1 for CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Figure 2 for CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Figure 3 for CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Figure 4 for CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Abstract:Recognizing an image and segmenting it into coherent regions are often treated as separate tasks. Human vision, however, has a general sense of segmentation hierarchy before recognition occurs. We are thus inspired to learn image recognition with hierarchical image segmentation based entirely on unlabeled images. Our insight is to learn fine-to-coarse features concurrently at superpixels, segments, and full image levels, enforcing consistency and goodness of feature induced segmentations while maximizing discrimination among image instances. Our model innovates vision transformers on three aspects. 1) We use adaptive segment tokens instead of fixed-shape patch tokens. 2) We create a token hierarchy by inserting graph pooling between transformer blocks, naturally producing consistent multi-scale segmentations while increasing the segment size and reducing the number of tokens. 3) We produce hierarchical image segmentation for free while training for recognition by maximizing image-wise discrimination. Our work delivers the first concurrent recognition and hierarchical segmentation model without any supervision. Validated on ImageNet and PASCAL VOC, it achieves better recognition and segmentation with higher computational efficiency.

Via

Access Paper or Ask Questions

LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection

Jun 15, 2022

Wei-Chih Hung, Henrik Kretzschmar, Vincent Casser, Jyh-Jing Hwang, Dragomir Anguelov

Figure 1 for LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection

Figure 2 for LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection

Figure 3 for LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection

Figure 4 for LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection

Abstract:The popular object detection metric 3D Average Precision (3D AP) relies on the intersection over union between predicted bounding boxes and ground truth bounding boxes. However, depth estimation based on cameras has limited accuracy, which may cause otherwise reasonable predictions that suffer from such longitudinal localization errors to be treated as false positives and false negatives. We therefore propose variants of the popular 3D AP metric that are designed to be more permissive with respect to depth estimation errors. Specifically, our novel longitudinal error tolerant metrics, LET-3D-AP and LET-3D-APL, allow longitudinal localization errors of the predicted bounding boxes up to a given tolerance. The proposed metrics have been used in the Waymo Open Dataset 3D Camera-Only Detection Challenge. We believe that they will facilitate advances in the field of camera-only 3D detection by providing more informative performance signals.

* Find the primary metrics for the 2022 Waymo Open Dataset 3D Camera-Only Detection Challenge at https://waymo.com/open/challenges/2022/3d-camera-only-detection/ . Find the code at https://github.com/waymo-research/waymo-open-dataset

Via

Access Paper or Ask Questions

Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers

Apr 25, 2022

Tsung-Wei Ke, Jyh-Jing Hwang, Yunhui Guo, Xudong Wang, Stella X. Yu

Figure 1 for Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers

Figure 2 for Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers

Figure 3 for Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers

Figure 4 for Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers

Abstract:Unsupervised semantic segmentation aims to discover groupings within and across images that capture object and view-invariance of a category without external supervision. Grouping naturally has levels of granularity, creating ambiguity in unsupervised segmentation. Existing methods avoid this ambiguity and treat it as a factor outside modeling, whereas we embrace it and desire hierarchical grouping consistency for unsupervised segmentation. We approach unsupervised segmentation as a pixel-wise feature learning problem. Our idea is that a good representation shall reveal not just a particular level of grouping, but any level of grouping in a consistent and predictable manner. We enforce spatial consistency of grouping and bootstrap feature learning with co-segmentation among multiple views of the same image, and enforce semantic consistency across the grouping hierarchy with clustering transformers between coarse- and fine-grained features. We deliver the first data-driven unsupervised hierarchical semantic segmentation method called Hierarchical Segment Grouping (HSG). Capturing visual similarity and statistical co-occurrences, HSG also outperforms existing unsupervised segmentation methods by a large margin on five major object- and scene-centric benchmarks. Our code is publicly available at https://github.com/twke18/HSG .

* In CVPR 2022. Webpage & Code: https://twke18.github.io/projects/hsg.html

Via

Access Paper or Ask Questions

Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning

May 11, 2021

Tsung-Wei Ke, Jyh-Jing Hwang, Stella X. Yu

Figure 1 for Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning

Figure 2 for Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning

Figure 3 for Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning

Figure 4 for Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning

Abstract:Weakly supervised segmentation requires assigning a label to every pixel based on training instances with partial annotations such as image-level tags, object bounding boxes, labeled points and scribbles. This task is challenging, as coarse annotations (tags, boxes) lack precise pixel localization whereas sparse annotations (points, scribbles) lack broad region coverage. Existing methods tackle these two types of weak supervision differently: Class activation maps are used to localize coarse labels and iteratively refine the segmentation model, whereas conditional random fields are used to propagate sparse labels to the entire image. We formulate weakly supervised segmentation as a semi-supervised metric learning problem, where pixels of the same (different) semantics need to be mapped to the same (distinctive) features. We propose 4 types of contrastive relationships between pixels and segments in the feature space, capturing low-level image similarity, semantic annotation, co-occurrence, and feature affinity They act as priors; the pixel-wise feature can be learned from training images with any partial annotations in a data-driven fashion. In particular, unlabeled pixels in training images participate not only in data-driven grouping within each image, but also in discriminative feature learning within and across images. We deliver a universal weakly supervised segmenter with significant gains on Pascal VOC and DensePose. Our code is publicly available at https://github.com/twke18/SPML.

* In ICLR 2021. Webpage & Code: https://twke18.github.io/projects/spml.html

Via

Access Paper or Ask Questions

SegSort: Segmentation by Discriminative Sorting of Segments

Oct 30, 2019

Jyh-Jing Hwang, Stella X. Yu, Jianbo Shi, Maxwell D. Collins, Tien-Ju Yang, Xiao Zhang, Liang-Chieh Chen

Figure 1 for SegSort: Segmentation by Discriminative Sorting of Segments

Figure 2 for SegSort: Segmentation by Discriminative Sorting of Segments

Figure 3 for SegSort: Segmentation by Discriminative Sorting of Segments

Figure 4 for SegSort: Segmentation by Discriminative Sorting of Segments

Abstract:Almost all existing deep learning approaches for semantic segmentation tackle this task as a pixel-wise classification problem. Yet humans understand a scene not in terms of pixels, but by decomposing it into perceptual groups and structures that are the basic building blocks of recognition. This motivates us to propose an end-to-end pixel-wise metric learning approach that mimics this process. In our approach, the optimal visual representation determines the right segmentation within individual images and associates segments with the same semantic classes across images. The core visual learning problem is therefore to maximize the similarity within segments and minimize the similarity between segments. Given a model trained this way, inference is performed consistently by extracting pixel-wise embeddings and clustering, with the semantic label determined by the majority vote of its nearest neighbors from an annotated set. As a result, we present the SegSort, as a first attempt using deep learning for unsupervised semantic segmentation, achieving $76\%$ performance of its supervised counterpart. When supervision is available, SegSort shows consistent improvements over conventional approaches based on pixel-wise softmax training. Additionally, our approach produces more precise boundaries and consistent region predictions. The proposed SegSort further produces an interpretable result, as each choice of label can be easily understood from the retrieved nearest segments.

* In ICCV 2019. Webpage & Code: https://jyhjinghwang.github.io/projects/segsort.html

Via

Access Paper or Ask Questions

DeeperLab: Single-Shot Image Parser

Mar 12, 2019

Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, Liang-Chieh Chen

Figure 1 for DeeperLab: Single-Shot Image Parser

Figure 2 for DeeperLab: Single-Shot Image Parser

Figure 3 for DeeperLab: Single-Shot Image Parser

Figure 4 for DeeperLab: Single-Shot Image Parser

Abstract:We present a single-shot, bottom-up approach for whole image parsing. Whole image parsing, also known as Panoptic Segmentation, generalizes the tasks of semantic segmentation for 'stuff' classes and instance segmentation for 'thing' classes, assigning both semantic and instance labels to every pixel in an image. Recent approaches to whole image parsing typically employ separate standalone modules for the constituent semantic and instance segmentation tasks and require multiple passes of inference. Instead, the proposed DeeperLab image parser performs whole image parsing with a significantly simpler, fully convolutional approach that jointly addresses the semantic and instance segmentation tasks in a single-shot manner, resulting in a streamlined system that better lends itself to fast processing. For quantitative evaluation, we use both the instance-based Panoptic Quality (PQ) metric and the proposed region-based Parsing Covering (PC) metric, which better captures the image parsing quality on 'stuff' classes and larger object instances. We report experimental results on the challenging Mapillary Vistas dataset, in which our single model achieves 31.95% (val) / 31.6% PQ (test) and 55.26% PC (val) with 3 frames per second (fps) on GPU or near real-time speed (22.6 fps on GPU) with reduced accuracy.

* 20 pages. The code of the proposed Parsing Covering metric is available at http://deeperlab.mit.edu

Via

Access Paper or Ask Questions

Adaptive Affinity Fields for Semantic Segmentation

Aug 21, 2018

Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, Stella X. Yu

Figure 1 for Adaptive Affinity Fields for Semantic Segmentation

Figure 2 for Adaptive Affinity Fields for Semantic Segmentation

Figure 3 for Adaptive Affinity Fields for Semantic Segmentation

Figure 4 for Adaptive Affinity Fields for Semantic Segmentation

Abstract:Semantic segmentation has made much progress with increasingly powerful pixel-wise classifiers and incorporating structural priors via Conditional Random Fields (CRF) or Generative Adversarial Networks (GAN). We propose a simpler alternative that learns to verify the spatial structure of segmentation during training only. Unlike existing approaches that enforce semantic labels on individual pixels and match labels between neighbouring pixels, we propose the concept of Adaptive Affinity Fields (AAF) to capture and match the semantic relations between neighbouring pixels in the label space. We use adversarial learning to select the optimal affinity field size for each semantic category. It is formulated as a minimax problem, optimizing our segmentation neural network in a best worst-case learning scenario. AAF is versatile for representing structures as a collection of pixel-centric relations, easier to train than GAN and more efficient than CRF without run-time inference. Our extensive evaluations on PASCAL VOC 2012, Cityscapes, and GTA5 datasets demonstrate its above-par segmentation performance and robust generalization across domains.

* To appear in European Conference on Computer Vision (ECCV) 2018

Via

Access Paper or Ask Questions