Abstract:We present an optimization study of the Vision-Language Frontier Maps (VLFM) applied to the Object Goal Navigation task in robotics. Our work evaluates the efficiency and performance of various vision-language models, object detectors, segmentation models, and multi-modal comprehension and Visual Question Answering modules. Using the $\textit{val-mini}$ and $\textit{val}$ splits of Habitat-Matterport 3D dataset, we conduct experiments on a desktop with limited VRAM. We propose a solution that achieves a higher success rate (+1.55%) improving over the VLFM BLIP-2 baseline without substantial success-weighted path length loss while requiring $\textbf{2.3 times}$ less video memory. Our findings provide insights into balancing model performance and computational efficiency, suggesting effective deployment strategies for resource-limited environments.
Abstract:Human-robot walking with prosthetic legs and exoskeletons, especially over complex terrains such as stairs, remains a significant challenge. Egocentric vision has the unique potential to detect the walking environment prior to physical interactions, which can improve transitions to and from stairs. This motivated us to create the StairNet initiative to support the development of new deep learning models for visual sensing and recognition of stairs, with an emphasis on lightweight and efficient neural networks for onboard real-time inference. In this study, we present an overview of the development of our large-scale dataset with over 515,000 manually labeled images, as well as our development of different deep learning models (e.g., 2D and 3D CNN, hybrid CNN and LSTM, and ViT networks) and training methods (e.g., supervised learning with temporal data and semi-supervised learning with unlabeled images) using our new dataset. We consistently achieved high classification accuracy (i.e., up to 98.8%) with different designs, offering trade-offs between model accuracy and size. When deployed on mobile devices with GPU and NPU accelerators, our deep learning models achieved inference speeds up to 2.8 ms. We also deployed our models on custom-designed CPU-powered smart glasses. However, limitations in the embedded hardware yielded slower inference speeds of 1.5 seconds, presenting a trade-off between human-centered design and performance. Overall, we showed that StairNet can be an effective platform to develop and study new visual perception systems for human-robot locomotion with applications in exoskeleton and prosthetic leg control.