Abstract:The spoken language serves as an accessible and efficient interface, enabling non-experts and disabled users to interact with complex assistant robots. However, accurately grounding language utterances gives a significant challenge due to the acoustic variability in speakers' voices and environmental noise. In this work, we propose a novel speech-scene graph grounding network (SGGNet$^2$) that robustly grounds spoken utterances by leveraging the acoustic similarity between correctly recognized and misrecognized words obtained from automatic speech recognition (ASR) systems. To incorporate the acoustic similarity, we extend our previous grounding model, the scene-graph-based grounding network (SGGNet), with the ASR model from NVIDIA NeMo. We accomplish this by feeding the latent vector of speech pronunciations into the BERT-based grounding network within SGGNet. We evaluate the effectiveness of using latent vectors of speech commands in grounding through qualitative and quantitative studies. We also demonstrate the capability of SGGNet$^2$ in a speech-based navigation task using a real quadruped robot, RBQ-3, from Rainbow Robotics.
Abstract:Delaunay triangulation is a well-known geometric combinatorial optimization problem with various applications. Many algorithms can generate Delaunay triangulation given an input point set, but most are nontrivial algorithms requiring an understanding of geometry or the performance of additional geometric operations, such as the edge flip. Deep learning has been used to solve various combinatorial optimization problems; however, generating Delaunay triangulation based on deep learning remains a difficult problem, and very few research has been conducted due to its complexity. In this paper, we propose a novel deep-learning-based approach for learning Delaunay triangulation using a new attention mechanism based on self-attention and domain knowledge. The proposed model is designed such that the model efficiently learns point-to-point relationships using self-attention in the encoder. In the decoder, a new attention score function using domain knowledge is proposed to provide a high penalty when the geometric requirement is not satisfied. The strength of the proposed attention score function lies in its ability to extend its application to solving other combinatorial optimization problems involving geometry. When the proposed neural net model is well trained, it is simple and efficient because it automatically predicts the Delaunay triangulation for an input point set without requiring any additional geometric operations. We conduct experiments to demonstrate the effectiveness of the proposed model and conclude that it exhibits better performance compared with other deep-learning-based approaches.