Abstract:In-car Voice Assistants (VAs) play an increasingly critical role in automotive user interface design. However, existing VAs primarily perform simple 'query-answer' tasks, limiting their ability to sustain drivers' long-term attention. In this study, we investigate the effectiveness of an in-car Robot Assistant (RA) that offers functionalities beyond voice interaction. We aim to answer the question: How does the presence of a social robot impact user experience in real driving scenarios? Our study begins with a user survey to understand perspectives on in-car VAs and their influence on driving experiences. We then conduct non-driving and on-road experiments with selected participants to assess user experiences with an RA. Additionally, we conduct subjective ratings to evaluate user perceptions of the RA's personality, which is crucial for robot design. We also explore potential concerns regarding ethical risks. Finally, we provide a comprehensive discussion and recommendations for the future development of in-car RAs.
Abstract:Voice conversion systems have made significant advancements in terms of naturalness and similarity in common voice conversion tasks. However, their performance in more complex tasks such as cross-lingual voice conversion and expressive voice conversion remains imperfect. In this study, we propose a novel approach that combines a jointly trained speaker encoder and content features extracted from the cross-lingual speech recognition model Whisper to achieve high-quality cross-lingual voice conversion. Additionally, we introduce a speaker consistency loss to the joint encoder, which improves the similarity between the converted speech and the reference speech. To further explore the capabilities of the joint speaker encoder, we use the phonetic posteriorgram as the content feature, which enables the model to effectively reproduce both the speaker characteristics and the emotional aspects of the reference speech.
Abstract:Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective reactions based on its current behaviors before the user speaks. In this work, we investigate two scenarios: speech and laughter. In speech, we propose to predict the user's future emotion based on its temporal relationship with the system's current emotion and its causal relationship with the system's current Dialogue Act (DA). In laughter, we propose to predict the occurrence and type of the user's laughter using the system's laughter behaviors in the current turn. Preliminary analysis of human-robot dialogue demonstrated synchronicity in the emotions and laughter displayed by the human and robot, as well as DA-emotion causality in their dialogue. This verifies that our architecture can contribute to the development of an anticipatory SDS.
Abstract:With the development of automatic speech recognition (ASR) and text-to-speech (TTS) technology, high-quality voice conversion (VC) can be achieved by extracting source content information and target speaker information to reconstruct waveforms. However, current methods still require improvement in terms of inference speed. In this study, we propose a lightweight VITS-based VC model that uses the HuBERT-Soft model to extract content information features without speaker information. Through subjective and objective experiments on synthesized speech, the proposed model demonstrates competitive results in terms of naturalness and similarity. Importantly, unlike the original VITS model, we use the inverse short-time Fourier transform (iSTFT) to replace the most computationally expensive part. Experimental results show that our model can generate samples at over 5000 kHz on the 3090 GPU and over 250 kHz on the i9-10900K CPU, achieving competitive speed for the same hardware configuration.
Abstract:In recent years, various service robots have been introduced in stores as recommendation systems. Previous studies attempted to increase the influence of these robots by improving their social acceptance and trust. However, when such service robots recommend a product to customers in real environments, the effect on the customers is influenced not only by the robot itself, but also by the social influence of the surrounding people such as store clerks. Therefore, leveraging the social influence of the clerks may increase the influence of the robots on the customers. Hence, we compared the influence of robots with and without collaborative customer service between the robots and clerks in two bakery stores. The experimental results showed that collaborative customer service increased the purchase rate of the recommended bread and improved the impression regarding the robot and store experience of the customers. Because the results also showed that the workload required for the clerks to collaborate with the robot was not high, this study suggests that all stores with service robots may show high effectiveness in introducing collaborative customer service.
Abstract:In this paper, we report on a field study in which we employed two service robots in a bakery store as a sales promotion. Previous studies have explored public applications of service robots public such as shopping malls. However, more evidence is needed that service robots can contribute to sales in real stores. Moreover, the behaviors of customers and service robots in the context of sales promotions have not been examined well. Hence, the types of robot behavior that can be considered effective and the customers' responses to these robots remain unclear. To address these issues, we installed two tele-operated service robots in a bakery store for nearly 2 weeks, one at the entrance as a greeter and the other one inside the store to recommend products. The results show a dramatic increase in sales during the days when the robots were applied. Furthermore, we annotated the video recordings of both the robots' and customers' behavior. We found that although the robot placed at the entrance successfully attracted the interest of the passersby, no apparent increase in the number of customers visiting the store was observed. However, we confirmed that the recommendations of the robot operating inside the store did have a positive impact. We discuss our findings in detail and provide both theoretical and practical recommendations for future research and applications.
Abstract:Humans communicate non-verbally by sharing physical rhythms, such as nodding and gestures, to involve each other. This sharing of physicality creates a sense of unity and makes humans feel involved with others. In this paper, we developed a new body motion generation system based on the free-energy principle (FEP), which not only responds passively but also prompts human actions. The proposed system consists of two modules, the sampling module, and the motion selection module. We conducted a subjective experiment to evaluate the "feeling of interacting with the agent" of the FEP based behavior. The results suggested that FEP based behaviors show more "feeling of interacting with the agent". Furthermore, we confirmed that the agent's gestures elicited subject gestures. This result not only reinforces the impression of feeling interaction but could also realization of agents that encourage people to change their behavior.
Abstract:In this study, we explore the transformer's ability to capture intra-relations among frames by augmenting the receptive field of models. Concretely, we propose a CycleGAN-based model with the transformer and investigate its ability in the emotional voice conversion task. In the training procedure, we adopt curriculum learning to gradually increase the frame length so that the model can see from the short segment till the entire speech. The proposed method was evaluated on the Japanese emotional speech dataset and compared to several baselines (ACVAE, CycleGAN) with objective and subjective evaluations. The results show that our proposed model is able to convert emotion with higher strength and quality.
Abstract:Research currently being conducted on the use of robots as human labor support technology. In particular, the service industry needs to allocate more manpower, and it will be important for robots to support people. This study focuses on using a humanoid robot as a social service robot to convey information in a shopping mall, and the robot's behavioral concepts were analyzed. In order to convey the information, two processes must occur. Pedestrians must stop in front of the robot, and the robot must continue the engagement with them. For the purpose of this study, three types of autonomous behavioral concepts of the robot for the general use were analyzed and compared in these processes in the experiment: active, passive-negative, and passive-positive concepts. After interactions were attempted with 65,000+ pedestrians, this study revealed that the passive-negative concept can make pedestrians stop more and stay longer. In order to evaluate the effectiveness of the robot in a real environment, the comparative results between three behaviors and human advertisers revealed that (1) the results of the active and passive-positive concepts of the robot are comparable to those of the humans, and (2) the performance of the passive-negative concept is higher than that of all participants. These findings demonstrate that the performance of robots is comparable to that of humans in providing information tasks in a limited environment; therefore, it is expected that service robots as a labor support technology will be able to perform well in the real world.
Abstract:Text categorization is the task of assigning labels to documents written in a natural language, and it has numerous real-world applications including sentiment analysis as well as traditional topic assignment tasks. In this paper, we propose 5 different configurations for the semantic matrix-based memory neural network with end-to-end learning manner and evaluate our proposed method on two corpora of news articles (AG news, Sogou news). The best performance of our proposed method outperforms the baseline VDCNN models on the text classification task and gives a faster speed for learning semantics. Moreover, we also evaluate our model on small scale datasets. The results show that our proposed method can still achieve better results in comparison to VDCNN on the small scale dataset. This paper is to appear in the Proceedings of the 2020 IEEE 14th International Conference on Semantic Computing (ICSC 2020), San Diego, California, 2020.