Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bryan Bo Cao

StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

Jun 05, 2025

Ranjith Merugu, Bryan Bo Cao, Shubham Jain

Abstract:Model merging has emerged as a promising solution to accommodate multiple large models within constrained memory budgets. We present StatsMerging, a novel lightweight learning-based model merging method guided by weight distribution statistics without requiring ground truth labels or test samples. StatsMerging offers three key advantages: (1) It uniquely leverages singular values from singular value decomposition (SVD) to capture task-specific weight distributions, serving as a proxy for task importance to guide task coefficient prediction; (2) It employs a lightweight learner StatsMergeLearner to model the weight distributions of task-specific pre-trained models, improving generalization and enhancing adaptation to unseen samples; (3) It introduces Task-Specific Teacher Distillation for merging vision models with heterogeneous architectures, a merging learning paradigm that avoids costly ground-truth labels by task-specific teacher distillation. Notably, we present two types of knowledge distillation, (a) distilling knowledge from task-specific models to StatsMergeLearner; and (b) distilling knowledge from models with heterogeneous architectures prior to merging. Extensive experiments across eight tasks demonstrate the effectiveness of StatsMerging. Our results show that StatsMerging outperforms state-of-the-art techniques in terms of overall accuracy, generalization to unseen tasks, and robustness to image quality variations.

* 14 pages, 4 figures, 7 tables

Via

Access Paper or Ask Questions

Memory Proxy Maps for Visual Navigation

Nov 15, 2024

Faith Johnson, Bryan Bo Cao, Ashwin Ashok, Shubham Jain, Kristin Dana

Figure 1 for Memory Proxy Maps for Visual Navigation

Figure 2 for Memory Proxy Maps for Visual Navigation

Figure 3 for Memory Proxy Maps for Visual Navigation

Figure 4 for Memory Proxy Maps for Visual Navigation

Abstract:Visual navigation takes inspiration from humans, who navigate in previously unseen environments using vision without detailed environment maps. Inspired by this, we introduce a novel no-RL, no-graph, no-odometry approach to visual navigation using feudal learning to build a three tiered agent. Key to our approach is a memory proxy map (MPM), an intermediate representation of the environment learned in a self-supervised manner by the high-level manager agent that serves as a simplified memory, approximating what the agent has seen. We demonstrate that recording observations in this learned latent space is an effective and efficient memory proxy that can remove the need for graphs and odometry in visual navigation tasks. For the mid-level manager agent, we develop a waypoint network (WayNet) that outputs intermediate subgoals, or waypoints, imitating human waypoint selection during local navigation. For the low-level worker agent, we learn a classifier over a discrete action space that avoids local obstacles and moves the agent towards the WayNet waypoint. The resulting feudal navigation network offers a novel approach with no RL, no graph, no odometry, and no metric map; all while achieving SOTA results on the image goal navigation task.

* arXiv admin note: substantial text overlap with arXiv:2402.12498

Via

Access Paper or Ask Questions

Few-Class Arena: A Benchmark for Efficient Selection of Vision Models and Dataset Difficulty Measurement

Nov 02, 2024

Bryan Bo Cao, Lawrence O'Gorman, Michael Coss, Shubham Jain

Figure 1 for Few-Class Arena: A Benchmark for Efficient Selection of Vision Models and Dataset Difficulty Measurement

Figure 2 for Few-Class Arena: A Benchmark for Efficient Selection of Vision Models and Dataset Difficulty Measurement

Figure 3 for Few-Class Arena: A Benchmark for Efficient Selection of Vision Models and Dataset Difficulty Measurement

Figure 4 for Few-Class Arena: A Benchmark for Efficient Selection of Vision Models and Dataset Difficulty Measurement

Abstract:We propose Few-Class Arena (FCA), as a unified benchmark with focus on testing efficient image classification models for few classes. A wide variety of benchmark datasets with many classes (80-1000) have been created to assist Computer Vision architectural evolution. An increasing number of vision models are evaluated with these many-class datasets. However, real-world applications often involve substantially fewer classes of interest (2-10). This gap between many and few classes makes it difficult to predict performance of the few-class applications using models trained on the available many-class datasets. To date, little has been offered to evaluate models in this Few-Class Regime. We conduct a systematic evaluation of the ResNet family trained on ImageNet subsets from 2 to 1000 classes, and test a wide spectrum of Convolutional Neural Networks and Transformer architectures over ten datasets by using our newly proposed FCA tool. Furthermore, to aid an up-front assessment of dataset difficulty and a more efficient selection of models, we incorporate a difficulty measure as a function of class similarity. FCA offers a new tool for efficient machine learning in the Few-Class Regime, with goals ranging from a new efficient class similarity proposal, to lightweight model architecture design, to a new scaling law. FCA is user-friendly and can be easily extended to new models and datasets, facilitating future research work. Our benchmark is available at https://github.com/fewclassarena/fca.

* 9 pages, 27 pages including References and Appendix, 20 figures, 5 tables

Via

Access Paper or Ask Questions

Representation Similarity: A Better Guidance of DNN Layer Sharing for Edge Computing without Training

Oct 15, 2024

Bryan Bo Cao, Abhinav Sharma, Manavjeet Singh, Anshul Gandhi, Samir Das, Shubham Jain

Abstract:Edge computing has emerged as an alternative to reduce transmission and processing delay and preserve privacy of the video streams. However, the ever-increasing complexity of Deep Neural Networks (DNNs) used in video-based applications (e.g. object detection) exerts pressure on memory-constrained edge devices. Model merging is proposed to reduce the DNNs' memory footprint by keeping only one copy of merged layers' weights in memory. In existing model merging techniques, (i) only architecturally identical layers can be shared; (ii) requires computationally expensive retraining in the cloud; (iii) assumes the availability of ground truth for retraining. The re-evaluation of a merged model's performance, however, requires a validation dataset with ground truth, typically runs at the cloud. Common metrics to guide the selection of shared layers include the size or computational cost of shared layers or representation size. We propose a new model merging scheme by sharing representations (i.e., outputs of layers) at the edge, guided by representation similarity S. We show that S is extremely highly correlated with merged model's accuracy with Pearson Correlation Coefficient |r| > 0.94 than other metrics, demonstrating that representation similarity can serve as a strong validation accuracy indicator without ground truth. We present our preliminary results of the newly proposed model merging scheme with identified challenges, demonstrating a promising research future direction.

* 3 pages, 4 figures, ACM MobiCom '24, November 18-22, 2024, Washington D.C., DC, USA

Via

Access Paper or Ask Questions

A Lightweight Measure of Classification Difficulty from Application Dataset Characteristics

Apr 09, 2024

Bryan Bo Cao, Abhinav Sharma, Lawrence O'Gorman, Michael Coss, Shubham Jain

Figure 1 for A Lightweight Measure of Classification Difficulty from Application Dataset Characteristics

Figure 2 for A Lightweight Measure of Classification Difficulty from Application Dataset Characteristics

Figure 3 for A Lightweight Measure of Classification Difficulty from Application Dataset Characteristics

Figure 4 for A Lightweight Measure of Classification Difficulty from Application Dataset Characteristics

Abstract:Despite accuracy and computation benchmarks being widely available to help choose among neural network models, these are usually trained on datasets with many classes, and do not give a precise idea of performance for applications of few (< 10) classes. The conventional procedure to predict performance is to train and test repeatedly on the different models and dataset variations of interest. However, this is computationally expensive. We propose an efficient classification difficulty measure that is calculated from the number of classes and intra- and inter-class similarity metrics of the dataset. After a single stage of training and testing per model family, relative performance for different datasets and models of the same family can be predicted by comparing difficulty measures - without further training and testing. We show how this measure can help a practitioner select a computationally efficient model for a small dataset 6 to 29x faster than through repeated training and testing. We give an example of use of the measure for an industrial application in which options are identified to select a model 42% smaller than the baseline YOLOv5-nano model, and if class merging from 3 to 2 classes meets requirements, 85% smaller.

* 13 pages, 3 figures

Via

Access Paper or Ask Questions

A Landmark-Aware Visual Navigation Dataset

Feb 22, 2024

Faith Johnson, Bryan Bo Cao, Kristin Dana, Shubham Jain, Ashwin Ashok

Figure 1 for A Landmark-Aware Visual Navigation Dataset

Figure 2 for A Landmark-Aware Visual Navigation Dataset

Figure 3 for A Landmark-Aware Visual Navigation Dataset

Figure 4 for A Landmark-Aware Visual Navigation Dataset

Abstract:Map representation learned by expert demonstrations has shown promising research value. However, recent advancements in the visual navigation field face challenges due to the lack of human datasets in the real world for efficient supervised representation learning of the environments. We present a Landmark-Aware Visual Navigation (LAVN) dataset to allow for supervised learning of human-centric exploration policies and map building. We collect RGB observation and human point-click pairs as a human annotator explores virtual and real-world environments with the goal of full coverage exploration of the space. The human annotators also provide distinct landmark examples along each trajectory, which we intuit will simplify the task of map or graph building and localization. These human point-clicks serve as direct supervision for waypoint prediction when learning to explore in environments. Our dataset covers a wide spectrum of scenes, including rooms in indoor environments, as well as walkways outdoors. Dataset is available at DOI: 10.5281/zenodo.10608067.

Via

Access Paper or Ask Questions

Feudal Networks for Visual Navigation

Feb 19, 2024

Faith Johnson, Bryan Bo Cao, Kristin Dana, Shubham Jain, Ashwin Ashok

Figure 1 for Feudal Networks for Visual Navigation

Figure 2 for Feudal Networks for Visual Navigation

Figure 3 for Feudal Networks for Visual Navigation

Figure 4 for Feudal Networks for Visual Navigation

Abstract:Visual navigation follows the intuition that humans can navigate without detailed maps. A common approach is interactive exploration while building a topological graph with images at nodes that can be used for planning. Recent variations learn from passive videos and can navigate using complex social and semantic cues. However, a significant number of training videos are needed, large graphs are utilized, and scenes are not unseen since odometry is utilized. We introduce a new approach to visual navigation using feudal learning, which employs a hierarchical structure consisting of a worker agent, a mid-level manager, and a high-level manager. Key to the feudal learning paradigm, agents at each level see a different aspect of the task and operate at different spatial and temporal scales. Two unique modules are developed in this framework. For the high-level manager, we learn a memory proxy map in a self supervised manner to record prior observations in a learned latent space and avoid the use of graphs and odometry. For the mid-level manager, we develop a waypoint network that outputs intermediate subgoals imitating human waypoint selection during local navigation. This waypoint network is pre-trained using a new, small set of teleoperation videos that we make publicly available, with training environments different from testing environments. The resulting feudal navigation network achieves near SOTA performance, while providing a novel no-RL, no-graph, no-odometry, no-metric map approach to the image goal navigation task.

Via

Access Paper or Ask Questions

ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time Measurements

Oct 04, 2023

Bryan Bo Cao, Abrar Alali, Hansi Liu, Nicholas Meegan, Marco Gruteser, Kristin Dana, Ashwin Ashok, Shubham Jain

Abstract:Tracking subjects in videos is one of the most widely used functions in camera-based IoT applications such as security surveillance, smart city traffic safety enhancement, vehicle to pedestrian communication and so on. In the computer vision domain, tracking is usually achieved by first detecting subjects with bounding boxes, then associating detected bounding boxes across video frames. For many IoT systems, images captured by cameras are usually sent over the network to be processed at a different site that has more powerful computing resources than edge devices. However, sending entire frames through the network causes significant bandwidth consumption that may exceed the system bandwidth constraints. To tackle this problem, we propose ViFiT, a transformer-based model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements). It leverages a transformer ability of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi Dataset, a large-scale multimodal dataset in 5 diverse real world scenes, including indoor and outdoor environments. To fill the gap of proper metrics of jointly capturing the system characteristics of both tracking quality and video bandwidth reduction, we propose a novel evaluation framework dubbed Minimum Required Frames (MRF) and Minimum Required Frames Ratio (MRFR). ViFiT achieves an MRFR of 0.65 that outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM Encoder-Decoder architecture X-Translator of 0.98, resulting in a high frame reduction rate as 97.76%.

* 22 pages, 12 figures, 9 tables. MobiCom 2023 ISACom

Via

Access Paper or Ask Questions

Data-Side Efficiencies for Lightweight Convolutional Neural Networks

Aug 24, 2023

Bryan Bo Cao, Lawrence O'Gorman, Michael Coss, Shubham Jain

Figure 1 for Data-Side Efficiencies for Lightweight Convolutional Neural Networks

Figure 2 for Data-Side Efficiencies for Lightweight Convolutional Neural Networks

Figure 3 for Data-Side Efficiencies for Lightweight Convolutional Neural Networks

Figure 4 for Data-Side Efficiencies for Lightweight Convolutional Neural Networks

Abstract:We examine how the choice of data-side attributes for two important visual tasks of image classification and object detection can aid in the choice or design of lightweight convolutional neural networks. We show by experimentation how four data attributes - number of classes, object color, image resolution, and object scale affect neural network model size and efficiency. Intra- and inter-class similarity metrics, based on metric learning, are defined to guide the evaluation of these attributes toward achieving lightweight models. Evaluations made using these metrics are shown to require 30x less computation than running full inference tests. We provide, as an example, applying the metrics and methods to choose a lightweight model for a robot path planning application and achieve computation reduction of 66% and accuracy gain of 3.5% over the pre-method model.

* 10 pages, 5 figures, 6 tables

Via

Access Paper or Ask Questions