Abstract:Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) have resulted in a substantial improvement of existing passage retrieval systems. However, in Chinese field, especially for specific domain, passage retrieval systems are still immature due to quality-annotated dataset being limited by scale. Therefore, in this paper, we present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR). The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs. We implement various representative passage retrieval methods as baselines. We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain. Nevertheless, passage retrieval system built on in-domain annotated dataset can achieve significant improvement, which indeed demonstrates the necessity of domain labeled data for further optimization. We hope the release of the Multi-CPR dataset could benchmark Chinese passage retrieval task in specific domain and also make advances for future studies.
Abstract:How the human vision system addresses the object identity-preserving recognition problem is largely unknown. Here, we use a vision recognition-reconstruction network (RRN) to investigate the development, recognition, learning and forgetting mechanisms, and achieve similar characteristics to electrophysiological measurements in monkeys. First, in network development study, the RRN also experiences critical developmental stages characterized by specificities in neuron types, synapse and activation patterns, and visual task performance from the early stage of coarse salience map recognition to mature stage of fine structure recognition. In digit recognition study, we witness that the RRN could maintain object invariance representation under various viewing conditions by coordinated adjustment of responses of population neurons. And such concerted population responses contained untangled object identity and properties information that could be accurately extracted via high-level cortices or even a simple weighted summation decoder. In the learning and forgetting study, novel structure recognition is implemented by adjusting entire synapses in low magnitude while pattern specificities of original synaptic connectivity are preserved, which guaranteed a learning process without disrupting the existing functionalities. This work benefits the understanding of the human visual processing mechanism and the development of human-like machine intelligence.
Abstract:Using language makes human beings surpass animals in wisdom. To let machines understand, learn, and use language flexibly, we propose a human-like general language processing (HGLP) architecture, which contains sensorimotor, association, and cognitive systems. The HGLP network learns from easy to hard like a child, understands word meaning by coactivating multimodal neurons, comprehends and generates sentences by real-time constructing a virtual world model, and can express the whole thinking process verbally. HGLP rapidly learned 10+ different tasks including object recognition, sentence comprehension, imagination, attention control, query, inference, motion judgement, mixed arithmetic operation, digit tracing and writing, and human-like iterative thinking process guided by language. Language in the HGLP framework is not matching nor correlation statistics, but a script that can describe and control the imagination.