Osaka University
Abstract:Large language models (LLMs) have recently showcased remarkable capabilities, spanning a wide range of tasks and applications, including those in the medical domain. Models like GPT-4 excel in medical question answering but may face challenges in the lack of interpretability when handling complex tasks in real clinical settings. We thus introduce the diagnostic reasoning dataset for clinical notes (DiReCT), aiming at evaluating the reasoning ability and interpretability of LLMs compared to human doctors. It contains 511 clinical notes, each meticulously annotated by physicians, detailing the diagnostic reasoning process from observations in a clinical note to the final diagnosis. Additionally, a diagnostic knowledge graph is provided to offer essential knowledge for reasoning, which may not be covered in the training data of existing LLMs. Evaluations of leading LLMs on DiReCT bring out a significant gap between their reasoning ability and that of human doctors, highlighting the critical need for models that can reason effectively in real-world clinical scenarios.
Abstract:Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding severe head movement problems. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, our proposed Facial Local Graph Pooling (FLGP) and convolutional layers are applied for multi-scale spatio-temporal feature extraction. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV and CAS(ME)^2 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.
Abstract:In this work, we propose a novel learning-based method to jointly estimate the shape and subsurface scattering (SSS) parameters of translucent objects by utilizing polarization cues. Although polarization cues have been used in various applications, such as shape from polarization (SfP), BRDF estimation, and reflection removal, their application in SSS estimation has not yet been explored. Our observations indicate that the SSS affects not only the light intensity but also the polarization signal. Hence, the polarization signal can provide additional cues for SSS estimation. We also introduce the first large-scale synthetic dataset of polarized translucent objects for training our model. Our method outperforms several baselines from the SfP and inverse rendering realms on both synthetic and real data, as demonstrated by qualitative and quantitative results.
Abstract:The imperative to comprehend the behaviors of deep learning models is of utmost importance. In this realm, Explainable Artificial Intelligence (XAI) has emerged as a promising avenue, garnering increasing interest in recent years. Despite this, most existing methods primarily depend on gradients or input perturbation, which often fails to embed explanations directly within the model's decision-making process. Addressing this gap, we introduce ESCOUTER, a visually explainable classifier based on the modified slot attention mechanism. ESCOUTER distinguishes itself by not only delivering high classification accuracy but also offering more transparent insights into the reasoning behind its decisions. It differs from prior approaches in two significant aspects: (a) ESCOUTER incorporates explanations into the final confidence scores for each category, providing a more intuitive interpretation, and (b) it offers positive or negative explanations for all categories, elucidating "why an image belongs to a certain category" or "why it does not." A novel loss function specifically for ESCOUTER is designed to fine-tune the model's behavior, enabling it to toggle between positive and negative explanations. Moreover, an area loss is also designed to adjust the size of the explanatory regions for a more precise explanation. Our method, rigorously tested across various datasets and XAI metrics, outperformed previous state-of-the-art methods, solidifying its effectiveness as an explanatory tool.
Abstract:The growing use of deep learning in safety-critical applications, such as medical imaging, has raised concerns about limited labeled data, where this demand is amplified as model complexity increases, posing hurdles for domain experts to annotate data. In response to this, active learning (AL) is used to efficiently train models with limited annotation costs. In the context of deep neural networks (DNNs), AL often uses confidence or probability outputs as a score for selecting the most informative samples. However, modern DNNs exhibit unreliable confidence outputs, making calibration essential. We propose an AL framework that self-calibrates the confidence used for sample selection during the training process, referred to as Confident Active Learning with Integrated CalibratiOn (CALICO). CALICO incorporates the joint training of a classifier and an energy-based model, instead of the standard softmax-based classifier. This approach allows for simultaneous estimation of the input data distribution and the class probabilities during training, improving calibration without needing an additional labeled dataset. Experimental results showcase improved classification performance compared to a softmax-based classifier with fewer labeled samples. Furthermore, the calibration stability of the model is observed to depend on the prior class distribution of the data.
Abstract:Facial expression spotting is a significant but challenging task in facial expression analysis. The accuracy of expression spotting is affected not only by irrelevant facial movements but also by the difficulty of perceiving subtle motions in micro-expressions. In this paper, we propose a Multi-Scale Spatio-Temporal Graph Convolutional Network (SpoT-GCN) for facial expression spotting. To extract more robust motion features, we track both short- and long-term motion of facial muscles in compact sliding windows whose window length adapts to the temporal receptive field of the network. This strategy, termed the receptive field adaptive sliding window strategy, effectively magnifies the motion features while alleviating the problem of severe head movement. The subtle motion features are then converted to a facial graph representation, whose spatio-temporal graph patterns are learned by a graph convolutional network. This network learns both local and global features from multiple scales of facial graph structures using our proposed facial local graph pooling (FLGP). Furthermore, we introduce supervised contrastive learning to enhance the discriminative capability of our model for difficult-to-classify frames. The experimental results on the SAMM-LV and CAS(ME)^2 datasets demonstrate that our method achieves state-of-the-art performance, particularly in micro-expression spotting. Ablation studies further verify the effectiveness of our proposed modules.
Abstract:We propose a computational imaging method for time-efficient light-field acquisition that combines a coded aperture with an event-based camera. Different from the conventional coded-aperture imaging method, our method applies a sequence of coding patterns during a single exposure for an image frame. The parallax information, which is related to the differences in coding patterns, is recorded as events. The image frame and events, all of which are measured in a single exposure, are jointly used to computationally reconstruct a light field. We also designed an algorithm pipeline for our method that is end-to-end trainable on the basis of deep optics and compatible with real camera hardware. We experimentally showed that our method can achieve more accurate reconstruction than several other imaging methods with a single exposure. We also developed a hardware prototype with the potential to complete the measurement on the camera within 22 msec and demonstrated that light fields from real 3-D scenes can be obtained with convincing visual quality. Our software and supplementary video are available from our project website.
Abstract:Multi-view inverse rendering is the problem of estimating the scene parameters such as shapes, materials, or illuminations from a sequence of images captured under different viewpoints. Many approaches, however, assume single light bounce and thus fail to recover challenging scenarios like inter-reflections. On the other hand, simply extending those methods to consider multi-bounced light requires more assumptions to alleviate the ambiguity. To address this problem, we propose Neural Incident Stokes Fields (NeISF), a multi-view inverse rendering framework that reduces ambiguities using polarization cues. The primary motivation for using polarization cues is that it is the accumulation of multi-bounced light, providing rich information about geometry and material. Based on this knowledge, the proposed incident Stokes field efficiently models the accumulated polarization effect with the aid of an original physically-based differentiable polarimetric renderer. Lastly, experimental results show that our method outperforms the existing works in synthetic and real scenarios.
Abstract:Large-scale models trained on extensive datasets, have emerged as the preferred approach due to their high generalizability across various tasks. In-context learning (ICL), a popular strategy in natural language processing, uses such models for different tasks by providing instructive prompts but without updating model parameters. This idea is now being explored in computer vision, where an input-output image pair (called an in-context pair) is supplied to the model with a query image as a prompt to exemplify the desired output. The efficacy of visual ICL often depends on the quality of the prompts. We thus introduce a method coined Instruct Me More (InMeMo), which augments in-context pairs with a learnable perturbation (prompt), to explore its potential. Our experiments on mainstream tasks reveal that InMeMo surpasses the current state-of-the-art performance. Specifically, compared to the baseline without learnable prompt, InMeMo boosts mIoU scores by 7.35 and 15.13 for foreground segmentation and single object detection tasks, respectively. Our findings suggest that InMeMo offers a versatile and efficient way to enhance the performance of visual ICL with lightweight training. Code is available at https://github.com/Jackieam/InMeMo.
Abstract:In this work, we propose an inverse rendering model that estimates 3D shape, spatially-varying reflectance, homogeneous subsurface scattering parameters, and an environment illumination jointly from only a pair of captured images of a translucent object. In order to solve the ambiguity problem of inverse rendering, we use a physically-based renderer and a neural renderer for scene reconstruction and material editing. Because two renderers are differentiable, we can compute a reconstruction loss to assist parameter estimation. To enhance the supervision of the proposed neural renderer, we also propose an augmented loss. In addition, we use a flash and no-flash image pair as the input. To supervise the training, we constructed a large-scale synthetic dataset of translucent objects, which consists of 117K scenes. Qualitative and quantitative results on both synthetic and real-world datasets demonstrated the effectiveness of the proposed model.