Abstract:We study self-supervised video representation learning that seeks to learn video features from unlabeled videos, which is widely used for video analysis as labeling videos is labor-intensive. Current methods often mask some video regions and then train a model to reconstruct spatial information in these regions (e.g., original pixels). However, the model is easy to reconstruct this information by considering content in a single frame. As a result, it may neglect to learn the interactions between frames, which are critical for video analysis. In this paper, we present a new self-supervised learning task, called Masked Motion Modeling (M$^3$Video), for learning representation by enforcing the model to predict the motion of moving objects in the masked regions. To generate motion targets for this task, we track the objects using optical flow. The motion targets consist of position transitions and shape changes of the tracked objects, thus the model has to consider multiple frames comprehensively. Besides, to help the model capture fine-grained motion details, we enforce the model to predict trajectory motion targets in high temporal resolution based on a video in low temporal resolution. After pre-training using our M$^3$Video task, the model is able to anticipate fine-grained motion details even taking a sparsely sampled video as input. We conduct extensive experiments on four benchmark datasets. Remarkably, when doing pre-training with 400 epochs, we improve the accuracy from 67.6\% to 69.2\% and from 78.8\% to 79.7\% on Something-Something V2 and Kinetics-400 datasets, respectively.
Abstract:A dialogue system for disease diagnosis aims at making a diagnosis by conversing with patients. Existing disease diagnosis dialogue systems highly rely on data-driven methods and statistical features, lacking profound comprehension of medical knowledge, such as symptom-disease relations. In addition, previous work pays less attention to demographic attributes of a patient, which are important factors in clinical diagnoses. To tackle these issues, this work presents a graph based and demographic attributes aware dialogue system for disease diagnosis. Specifically, we first build a weighted bidirectional graph based on clinical dialogues to depict the relationship between symptoms and diseases and then present a bidirectional graph based deep Q-network (BG-DQN) for dialogue management. By extending Graph Convolutional Network (GCN) to learn the embeddings of diseases and symptoms from both the structural and attribute information in the graph, BG-DQN could capture the relations between diseases and symptoms better. Moreover, BG-DQN also encodes the demographic attributes of a patient to assist the disease diagnosis process. Experimental results show that the proposed dialogue system outperforms several competitive methods in terms of diagnostic accuracy. More importantly, our method can complete the task with less dialogue turns and possesses better distinguishing capability on diseases with similar symptoms.
Abstract:Natural language generation (NLG) is an essential component of task-oriented dialog systems. Despite the recent success of neural approaches for NLG, they are typically developed in an offline manner for particular domains. To better fit real-life applications where new data come in a stream, we study NLG in a "continual learning" setting to expand its knowledge to new domains or functionalities incrementally. The major challenge towards this goal is catastrophic forgetting, meaning that a continually trained model tends to forget the knowledge it has learned before. To this end, we propose a method called ARPER (Adaptively Regularized Prioritized Exemplar Replay) by replaying prioritized historical exemplars, together with an adaptive regularization technique based on ElasticWeight Consolidation. Extensive experiments to continually learn new domains and intents are conducted on MultiWoZ-2.0 to benchmark ARPER with a wide range of techniques. Empirical results demonstrate that ARPER significantly outperforms other methods by effectively mitigating the detrimental catastrophic forgetting issue.