Abstract:Task oriented dialogue (TOD) requires the complex interleaving of a number of individually controllable components with strong guarantees for explainability and verifiability. This has made it difficult to adopt the multi-turn multi-domain dialogue generation capabilities of streamlined end-to-end open-domain dialogue systems. In this paper, we present a new framework, DLGNet-Task, a unified task-oriented dialogue system which employs autoregressive transformer networks such as DLGNet and GPT-2/3 to complete user tasks in multi-turn multi-domain conversations. Our framework enjoys the controllable, verifiable, and explainable outputs of modular approaches, and the low development, deployment and maintenance cost of end-to-end systems. Treating open-domain system components as additional TOD system modules allows DLGNet-Task to learn the joint distribution of the inputs and outputs of all the functional blocks of existing modular approaches such as, natural language understanding (NLU), state tracking, action policy, as well as natural language generation (NLG). Rather than training the modules individually, as is common in real-world systems, we trained them jointly with appropriate module separations. When evaluated on the MultiWOZ2.1 dataset, DLGNet-Task shows comparable performance to the existing state-of-the-art approaches. Furthermore, using DLGNet-Task in conversational AI systems reduces the level of effort required for developing, deploying, and maintaining intelligent assistants at scale.
Abstract:Speech processing systems rely on robust feature extraction to handle phonetic and semantic variations found in natural language. While techniques exist for desensitizing features to common noise patterns produced by Speech-to-Text (STT) and Text-to-Speech (TTS) systems, the question remains how to best leverage state-of-the-art language models (which capture rich semantic features, but are trained on only written text) on inputs with ASR errors. In this paper, we present Telephonetic, a data augmentation framework that helps robustify language model features to ASR corrupted inputs. To capture phonetic alterations, we employ a character-level language model trained using probabilistic masking. Phonetic augmentations are generated in two stages: a TTS encoder (Tacotron 2, WaveGlow) and a STT decoder (DeepSpeech). Similarly, semantic perturbations are produced by sampling from nearby words in an embedding space, which is computed using the BERT language model. Words are selected for augmentation according to a hierarchical grammar sampling strategy. Telephonetic is evaluated on the Penn Treebank (PTB) corpus, and demonstrates its effectiveness as a bootstrapping technique for transferring neural language models to the speech domain. Notably, our language model achieves a test perplexity of 37.49 on PTB, which to our knowledge is state-of-the-art among models trained only on PTB.