Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shigeo Morishima

Understanding and Supporting Formal Email Exchange by Answering AI-Generated Questions

Feb 06, 2025

Yusuke Miura, Chi-Lan Yang, Masaki Kuribayashi, Keigo Matsumoto, Hideaki Kuzuoka, Shigeo Morishima

Abstract:Replying to formal emails is time-consuming and cognitively demanding, as it requires polite phrasing and ensuring an adequate response to the sender's demands. Although systems with Large Language Models (LLM) were designed to simplify the email replying process, users still needed to provide detailed prompts to obtain the expected output. Therefore, we proposed and evaluated an LLM-powered question-and-answer (QA)-based approach for users to reply to emails by answering a set of simple and short questions generated from the incoming email. We developed a prototype system, ResQ, and conducted controlled and field experiments with 12 and 8 participants. Our results demonstrated that QA-based approach improves the efficiency of replying to emails and reduces workload while maintaining email quality compared to a conventional prompt-based approach that requires users to craft appropriate prompts to obtain email drafts. We discuss how QA-based approach influences the email reply process and interpersonal relationship dynamics, as well as the opportunities and challenges associated with using a QA-based approach in AI-mediated communication.

Via

Access Paper or Ask Questions

SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering

Dec 11, 2024

Hiroki Nishizawa, Keitaro Tanaka, Asuka Hirata, Shugo Yamaguchi, Qi Feng, Masatoshi Hamanaka, Shigeo Morishima

Figure 1 for SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering

Figure 2 for SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering

Figure 3 for SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering

Figure 4 for SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering

Abstract:Automatically generating realistic musical performance motion can greatly enhance digital media production, often involving collaboration between professionals and musicians. However, capturing the intricate body, hand, and finger movements required for accurate musical performances is challenging. Existing methods often fall short due to the complex mapping between audio and motion, typically requiring additional inputs like scores or MIDI data. In this work, we present SyncViolinist, a multi-stage end-to-end framework that generates synchronized violin performance motion solely from audio input. Our method overcomes the challenge of capturing both global and fine-grained performance features through two key modules: a bowing/fingering module and a motion generation module. The bowing/fingering module extracts detailed playing information from the audio, which the motion generation module uses to create precise, coordinated body motions reflecting the temporal granularity and nature of the violin performance. We demonstrate the effectiveness of SyncViolinist with significantly improved qualitative and quantitative results from unseen violin performance audio, outperforming state-of-the-art methods. Extensive subjective evaluations involving professional violinists further validate our approach. The code and dataset are available at https://github.com/Kakanat/SyncViolinist.

* 10 pages, 7 figures, 6 tables, WACV 2025

Via

Access Paper or Ask Questions

Detect Fake with Fake: Leveraging Synthetic Data-driven Representation for Synthetic Image Detection

Sep 13, 2024

Hina Otake, Yoshihiro Fukuhara, Yoshiki Kubotani, Shigeo Morishima

Abstract:Are general-purpose visual representations acquired solely from synthetic data useful for detecting fake images? In this work, we show the effectiveness of synthetic data-driven representations for synthetic image detection. Upon analysis, we find that vision transformers trained by the latest visual representation learners with synthetic data can effectively distinguish fake from real images without seeing any real images during pre-training. Notably, using SynCLR as the backbone in a state-of-the-art detection method demonstrates a performance improvement of +10.32 mAP and +4.73% accuracy over the widely used CLIP, when tested on previously unseen GAN models. Code is available at https://github.com/cvpaperchallenge/detect-fake-with-fake.

* Accepted to TWYN workshop at ECCV 2024

Via

Access Paper or Ask Questions

Monte Carlo Path Tracing and Statistical Event Detection for Event Camera Simulation

Aug 15, 2024

Yuichiro Manabe, Tatsuya Yatagawa, Shigeo Morishima, Hiroyuki Kubo

Figure 1 for Monte Carlo Path Tracing and Statistical Event Detection for Event Camera Simulation

Figure 2 for Monte Carlo Path Tracing and Statistical Event Detection for Event Camera Simulation

Figure 3 for Monte Carlo Path Tracing and Statistical Event Detection for Event Camera Simulation

Figure 4 for Monte Carlo Path Tracing and Statistical Event Detection for Event Camera Simulation

Abstract:This paper presents a novel event camera simulation system fully based on physically based Monte Carlo path tracing with adaptive path sampling. The adaptive sampling performed in the proposed method is based on a statistical technique, hypothesis testing for the hypothesis whether the difference of logarithmic luminances at two distant periods is significantly larger than a predefined event threshold. To this end, our rendering system collects logarithmic luminances rather than raw luminance in contrast to the conventional rendering system imitating conventional RGB cameras. Then, based on the central limit theorem, we reasonably assume that the distribution of the population mean of logarithmic luminance can be modeled as a normal distribution, allowing us to model the distribution of the difference of logarithmic luminance as a normal distribution. Then, using Student's t-test, we can test the hypothesis and determine whether to discard the null hypothesis for event non-occurrence. When we sample a sufficiently large number of path samples to satisfy the central limit theorem and obtain a clean set of events, our method achieves significant speed up compared to a simple approach of sampling paths uniformly at every pixel. To our knowledge, we are the first to simulate the behavior of event cameras in a physically accurate manner using an adaptive sampling technique in Monte Carlo path tracing, and we believe this study will contribute to the development of computer vision applications using event cameras.

* 10 pages, 7 figures, Presented at ICCP 2024

Via

Access Paper or Ask Questions

Memory-Maze: Scenario Driven Benchmark and Visual Language Navigation Model for Guiding Blind People

May 11, 2024

Masaki Kuribayashi, Kohei Uehara, Allan Wang, Daisuke Sato, Simon Chu, Shigeo Morishima

Abstract:Visual Language Navigation (VLN) powered navigation robots have the potential to guide blind people by understanding and executing route instructions provided by sighted passersby. This capability allows robots to operate in environments that are often unknown a priori. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contain stutters, errors, and omission of details as opposed to those obtained by thinking out loud, such as in the Room-to-Room dataset. However, currently, there is no benchmark that simulates instructions that were obtained from human memory in environments where blind people navigate. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual environment and novel route instruction data from human memory. To collect natural language instructions, we conducted two studies from sighted passersby onsite and annotators online. Our analysis demonstrates that instructions data collected onsite were more lengthy and contained more varied wording. Alongside our benchmark, we propose a VLN model better equipped to handle the scenario. Our proposed VLN model uses Large Language Models (LLM) to parse instructions and generate Python codes for robot control. We further show that the existing state-of-the-art model performed suboptimally on our benchmark. In contrast, our proposed method outperformed the state-of-the-art model by a fair margin. We found that future research should exercise caution when considering VLN technology for practical applications, as real-world scenarios have different characteristics than ones collected in traditional settings.

Via

Access Paper or Ask Questions

Gaze-Driven Sentence Simplification for Language Learners: Enhancing Comprehension and Readability

Sep 30, 2023

Taichi Higasa, Keitaro Tanaka, Qi Feng, Shigeo Morishima

Figure 1 for Gaze-Driven Sentence Simplification for Language Learners: Enhancing Comprehension and Readability

Figure 2 for Gaze-Driven Sentence Simplification for Language Learners: Enhancing Comprehension and Readability

Figure 3 for Gaze-Driven Sentence Simplification for Language Learners: Enhancing Comprehension and Readability

Figure 4 for Gaze-Driven Sentence Simplification for Language Learners: Enhancing Comprehension and Readability

Abstract:Language learners should regularly engage in reading challenging materials as part of their study routine. Nevertheless, constantly referring to dictionaries is time-consuming and distracting. This paper presents a novel gaze-driven sentence simplification system designed to enhance reading comprehension while maintaining their focus on the content. Our system incorporates machine learning models tailored to individual learners, combining eye gaze features and linguistic features to assess sentence comprehension. When the system identifies comprehension difficulties, it provides simplified versions by replacing complex vocabulary and grammar with simpler alternatives via GPT-3.5. We conducted an experiment with 19 English learners, collecting data on their eye movements while reading English text. The results demonstrated that our system is capable of accurately estimating sentence-level comprehension. Additionally, we found that GPT-3.5 simplification improved readability in terms of traditional readability metrics and individual word difficulty, paraphrasing across different linguistic levels.

* Accepted by ACM ICMI 2023 workshops (Multimodal, Interactive Interfaces for Education)

Via

Access Paper or Ask Questions

Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

Sep 19, 2023

Ryosuke Oshima, Seitaro Shinagawa, Hideki Tsunashima, Qi Feng, Shigeo Morishima

Figure 1 for Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

Figure 2 for Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

Figure 3 for Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

Figure 4 for Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

Abstract:Effective communication between humans and intelligent agents has promising applications for solving complex problems. One such approach is visual dialogue, which leverages multimodal context to assist humans. However, real-world scenarios occasionally involve human mistakes, which can cause intelligent agents to fail. While most prior research assumes perfect answers from human interlocutors, we focus on a setting where the agent points out unintentional mistakes for the interlocutor to review, better reflecting real-world situations. In this paper, we show that human answer mistakes depend on question type and QA turn in the visual dialogue by analyzing a previously unused data collection of human mistakes. We demonstrate the effectiveness of those factors for the model's accuracy in a pointing-human-mistake task through experiments using a simple MLP model and a Visual Language Model.

* Accepted at ICCVW 2023

Via

Access Paper or Ask Questions

Enhancing Perception and Immersion in Pre-Captured Environments through Learning-Based Eye Height Adaptation

Aug 24, 2023

Qi Feng, Hubert P. H. Shum, Shigeo Morishima

Abstract:Pre-captured immersive environments using omnidirectional cameras provide a wide range of virtual reality applications. Previous research has shown that manipulating the eye height in egocentric virtual environments can significantly affect distance perception and immersion. However, the influence of eye height in pre-captured real environments has received less attention due to the difficulty of altering the perspective after finishing the capture process. To explore this influence, we first propose a pilot study that captures real environments with multiple eye heights and asks participants to judge the egocentric distances and immersion. If a significant influence is confirmed, an effective image-based approach to adapt pre-captured real-world environments to the user's eye height would be desirable. Motivated by the study, we propose a learning-based approach for synthesizing novel views for omnidirectional images with altered eye heights. This approach employs a multitask architecture that learns depth and semantic segmentation in two formats, and generates high-quality depth and semantic segmentation to facilitate the inpainting stage. With the improved omnidirectional-aware layered depth image, our approach synthesizes natural and realistic visuals for eye height adaptation. Quantitative and qualitative evaluation shows favorable results against state-of-the-art methods, and an extensive user study verifies improved perception and immersion for pre-captured real-world environments.

* 10 pages, 13 figures, 3 tables, submitted to ISMAR 2023

Via

Access Paper or Ask Questions

Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction

Jun 10, 2023

Tomoya Yoshinaga, Keitaro Tanaka, Shigeo Morishima

Figure 1 for Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction

Figure 2 for Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction

Figure 3 for Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction

Figure 4 for Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction

Abstract:This paper describes an audio-visual speech enhancement (AV-SE) method that estimates from noisy input audio a mixture of the speech of the speaker appearing in an input video (on-screen target speech) and of a selected speaker not appearing in the video (off-screen target speech). Although conventional AV-SE methods have suppressed all off-screen sounds, it is necessary to listen to a specific pre-known speaker's speech (e.g., family member's voice and announcements in stations) in future applications of AV-SE (e.g., hearing aids), even when users' sight does not capture the speaker. To overcome this limitation, we extract a visual clue for the on-screen target speech from the input video and a voiceprint clue for the off-screen one from a pre-recorded speech of the speaker. Two clues from different domains are integrated as an audio-visual clue, and the proposed model directly estimates the target mixture. To improve the estimation accuracy, we introduce a temporal attention mechanism for the voiceprint clue and propose a training strategy called the muting strategy. Experimental results show that our method outperforms a baseline method that uses the state-of-the-art AV-SE and speaker extraction methods individually in terms of estimation accuracy and computational efficiency.

* Accepted by EUSIPCO 2023

Via

Access Paper or Ask Questions

Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning

May 23, 2023

Sara Kashiwagi, Keitaro Tanaka, Qi Feng, Shigeo Morishima

Abstract:This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR). The difference in lip movements between the two poses a challenge for existing VSR models, which exhibit degraded accuracy when applied to silent speech. To solve this issue and tackle the scarcity of training data for silent speech, we propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes. Specifically, we aim to map the input of two speech types close to each other in a latent space if they have similar viseme representations. By minimizing the Kullback-Leibler divergence of the predicted viseme probability distributions between and within the two speech types, our model effectively learns and predicts viseme identities. Our evaluation demonstrates that our method improves the accuracy of silent VSR, even when limited training data is available.

* Accepted by INTERSPEECH 2023

Via

Access Paper or Ask Questions