Abstract:The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs' layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.
Abstract:This paper jointly addresses three key limitations in conventional pedestrian trajectory forecasting: pedestrian perception errors, real-world data collection costs, and person ID annotation costs. We propose a novel framework, RealTraj, that enhances the real-world applicability of trajectory forecasting. Our approach includes two training phases--self-supervised pretraining on synthetic data and weakly-supervised fine-tuning with limited real-world data--to minimize data collection efforts. To improve robustness to real-world errors, we focus on both model design and training objectives. Specifically, we present Det2TrajFormer, a trajectory forecasting model that remains invariant in tracking noise by using past detections as inputs. Additionally, we pretrain the model using multiple pretext tasks, which enhance robustness and improve forecasting performance based solely on detection data. Unlike previous trajectory forecasting methods, our approach fine-tunes the model using only ground-truth detections, significantly reducing the need for costly person ID annotations. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art trajectory forecasting methods on multiple datasets.
Abstract:Large-scale vision-language models, such as CLIP, are known to contain harmful societal bias regarding protected attributes (e.g., gender and age). In this paper, we aim to address the problems of societal bias in CLIP. Although previous studies have proposed to debias societal bias through adversarial learning or test-time projecting, our comprehensive study of these works identifies two critical limitations: 1) loss of attribute information when it is explicitly disclosed in the input and 2) use of the attribute annotations during debiasing process. To mitigate societal bias in CLIP and overcome these limitations simultaneously, we introduce a simple-yet-effective debiasing method called SANER (societal attribute neutralizer) that eliminates attribute information from CLIP text features only of attribute-neutral descriptions. Experimental results show that SANER, which does not require attribute annotations and preserves original information for attribute-specific descriptions, demonstrates superior debiasing ability than the existing methods.
Abstract:A crowd density forecasting task aims to predict how the crowd density map will change in the future from observed past crowd density maps. However, the past crowd density maps are often incomplete due to the miss-detection of pedestrians, and it is crucial to develop a robust crowd density forecasting model against the miss-detection. This paper presents a MAsked crowd density Completion framework for crowd density forecasting (CrowdMAC), which is simultaneously trained to forecast future crowd density maps from partially masked past crowd density maps (i.e., forecasting maps from past maps with miss-detection) while reconstructing the masked observation maps (i.e., imputing past maps with miss-detection). Additionally, we propose Temporal-Density-aware Masking (TDM), which non-uniformly masks tokens in the observed crowd density map, considering the sparsity of the crowd density maps and the informativeness of the subsequent frames for the forecasting task. Moreover, we introduce multi-task masking to enhance training efficiency. In the experiments, CrowdMAC achieves state-of-the-art performance on seven large-scale datasets, including SDD, ETH-UCY, inD, JRDB, VSCrowd, FDST, and croHD. We also demonstrate the robustness of the proposed method against both synthetic and realistic miss-detections.
Abstract:Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-format captions and recent GCE processes from the perspectives of "gender bias" and "hallucination", showing that enriched captions suffer from increased gender bias and hallucination. Furthermore, models trained on these enriched captions amplify gender bias by an average of 30.9% and increase hallucination by 59.5%. This study serves as a caution against the trend of making captions more descriptive.
Abstract:We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings: (1) the extreme domain gap in egocentric videos (\eg, daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference speed. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model's adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving $2.2$ times faster inference speed. Project page: https://masashi-hatano.github.io/MM-CDFSL/
Abstract:Predicting future human behavior from egocentric videos is a challenging but critical task for human intention understanding. Existing methods for forecasting 2D hand positions rely on visual representations and mainly focus on hand-object interactions. In this paper, we investigate the hand forecasting task and tackle two significant issues that persist in the existing methods: (1) 2D hand positions in future frames are severely affected by ego-motions in egocentric videos; (2) prediction based on visual information tends to overfit to background or scene textures, posing a challenge for generalization on novel scenes or human behaviors. To solve the aforementioned problems, we propose EMAG, an ego-motion-aware and generalizable 2D hand forecasting method. In response to the first problem, we propose a method that considers ego-motion, represented by a sequence of homography matrices of two consecutive frames. We further leverage modalities such as optical flow, trajectories of hands and interacting objects, and ego-motions, thereby alleviating the second issue. Extensive experiments on two large-scale egocentric video datasets, Ego4D and EPIC-Kitchens 55, verify the effectiveness of the proposed method. In particular, our model outperforms prior methods by $7.0$\% on cross-dataset evaluations. Project page: https://masashi-hatano.github.io/EMAG/
Abstract:Surgical tool detection is essential for analyzing and evaluating minimally invasive surgery videos. Current approaches are mostly based on supervised methods that require large, fully instance-level labels (i.e., bounding boxes). However, large image datasets with instance-level labels are often limited because of the burden of annotation. Thus, surgical tool detection is important when providing image-level labels instead of instance-level labels since image-level annotations are considerably more time-efficient than instance-level annotations. In this work, we propose to strike a balance between the extremely costly annotation burden and detection performance. We further propose a co-occurrence loss, which considers a characteristic that some tool pairs often co-occur together in an image to leverage image-level labels. Encapsulating the knowledge of co-occurrence using the co-occurrence loss helps to overcome the difficulty in classification that originates from the fact that some tools have similar shapes and textures. Extensive experiments conducted on the Endovis2018 dataset in various data settings show the effectiveness of our method.
Abstract:The ability to automatically detect and track surgical instruments in endoscopic videos can enable transformational interventions. Assessing surgical performance and efficiency, identifying skilled tool use and choreography, and planning operational and logistical aspects of OR resources are just a few of the applications that could benefit. Unfortunately, obtaining the annotations needed to train machine learning models to identify and localize surgical tools is a difficult task. Annotating bounding boxes frame-by-frame is tedious and time-consuming, yet large amounts of data with a wide variety of surgical tools and surgeries must be captured for robust training. Moreover, ongoing annotator training is needed to stay up to date with surgical instrument innovation. In robotic-assisted surgery, however, potentially informative data like timestamps of instrument installation and removal can be programmatically harvested. The ability to rely on tool installation data alone would significantly reduce the workload to train robust tool-tracking models. With this motivation in mind we invited the surgical data science community to participate in the challenge, SurgToolLoc 2022. The goal was to leverage tool presence data as weak labels for machine learning models trained to detect tools and localize them in video frames with bounding boxes. We present the results of this challenge along with many of the team's efforts. We conclude by discussing these results in the broader context of machine learning and surgical data science. The training data used for this challenge consisting of 24,695 video clips with tool presence labels is also being released publicly and can be accessed at https://console.cloud.google.com/storage/browser/isi-surgtoolloc-2022.
Abstract:In this paper, we present an end-to-end unsupervised anomaly detection framework for 3D point clouds. To the best of our knowledge, this is the first work to tackle the anomaly detection task on a general object represented by a 3D point cloud. We propose a deep variational autoencoder-based unsupervised anomaly detection network adapted to the 3D point cloud and an anomaly score specifically for 3D point clouds. To verify the effectiveness of the model, we conducted extensive experiments on the ShapeNet dataset. Through quantitative and qualitative evaluation, we demonstrate that the proposed method outperforms the baseline method. Our code is available at https://github.com/llien30/point_cloud_anomaly_detection.