Abstract:Enlarging the context window of large language models (LLMs) has become a crucial research area, particularly for applications involving extremely long texts. In this work, we propose a novel training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLM$\times$MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output. The main challenge for divide-and-conquer long text processing frameworks lies in the risk of losing essential long-range information when splitting the document, which can lead the model to produce incomplete or incorrect answers based on the segmented texts. Disrupted long-range information can be classified into two categories: inter-chunk dependency and inter-chunk conflict. We design a structured information protocol to better cope with inter-chunk dependency and an in-context confidence calibration mechanism to resolve inter-chunk conflicts. Experimental results demonstrate that LLM$\times$MapReduce can outperform representative open-source and commercial long-context LLMs, and is applicable to several different models.
Abstract:Large language models respond well in high-resource languages like English but struggle in low-resource languages. It may arise from the lack of high-quality instruction following data in these languages. Directly translating English samples into these languages can be a solution but unreliable, leading to responses with translation errors and lacking language-specific or cultural knowledge. To address this issue, we propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages. Specifically, the language model first learns to generate appropriate English instructions according to the natural web texts in other languages as responses. The candidate cross-lingual instruction tuning samples are further refined and diversified. We have employed this method to build a large-scale cross-lingual instruction tuning dataset on 10 languages, namely X-Instruction. The instruction data built using our method incorporate more language-specific knowledge compared with the naive translation method. Experimental results have shown that the response quality of the model tuned on X-Instruction greatly exceeds the model distilled from a powerful teacher model, reaching or even surpassing the ones of ChatGPT. In addition, we find that models tuned on cross-lingual instruction following samples can follow the instruction in the output language without further tuning.
Abstract:We introduce VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.
Abstract:Recent fMRI-to-image approaches mainly focused on associating fMRI signals with specific conditions of pre-trained diffusion models. These approaches, while producing high-quality images, capture only a limited aspect of the complex information in fMRI signals and offer little detailed control over image creation. In contrast, this paper proposes to directly modulate the generation process of diffusion models using fMRI signals. Our approach, NeuroPictor, divides the fMRI-to-image process into three steps: i) fMRI calibrated-encoding, to tackle multi-individual pre-training for a shared latent space to minimize individual difference and enable the subsequent cross-subject training; ii) fMRI-to-image cross-subject pre-training, perceptually learning to guide diffusion model with high- and low-level conditions across different individuals; iii) fMRI-to-image single-subject refining, similar with step ii but focus on adapting to particular individual. NeuroPictor extracts high-level semantic features from fMRI signals that characterizing the visual stimulus and incrementally fine-tunes the diffusion model with a low-level manipulation network to provide precise structural instructions. By training with over 60,000 fMRI-image pairs from various individuals, our model enjoys superior fMRI-to-image decoding capacity, particularly in the within-subject setting, as evidenced in benchmark datasets. Project page: https://jingyanghuo.github.io/neuropictor/.
Abstract:Pre-trained computational language models have recently made remarkable progress in harnessing the language abilities which were considered unique to humans. Their success has raised interest in whether these models represent and process language like humans. To answer this question, this paper proposes MulCogBench, a multi-modal cognitive benchmark dataset collected from native Chinese and English participants. It encompasses a variety of cognitive data, including subjective semantic ratings, eye-tracking, functional magnetic resonance imaging (fMRI), and magnetoencephalography (MEG). To assess the relationship between language models and cognitive data, we conducted a similarity-encoding analysis which decodes cognitive data based on its pattern similarity with textual embeddings. Results show that language models share significant similarities with human cognitive data and the similarity patterns are modulated by the data modality and stimuli complexity. Specifically, context-aware models outperform context-independent models as language stimulus complexity increases. The shallow layers of context-aware models are better aligned with the high-temporal-resolution MEG signals whereas the deeper layers show more similarity with the high-spatial-resolution fMRI. These results indicate that language models have a delicate relationship with brain language representations. Moreover, the results between Chinese and English are highly consistent, suggesting the generalizability of these findings across languages.
Abstract:Automatic pavement crack detection is an important task to ensure the functional performances of pavements during their service life. Inspired by deep learning (DL), the encoder-decoder framework is a powerful tool for crack detection. However, these models are usually open-loop (OL) systems that tend to treat thin cracks as the background. Meanwhile, these models can not automatically correct errors in the prediction, nor can it adapt to the changes of the environment to automatically extract and detect thin cracks. To tackle this problem, we embed closed-loop feedback (CLF) into the neural network so that the model could learn to correct errors on its own, based on generative adversarial networks (GAN). The resulting model is called CrackCLF and includes the front and back ends, i.e. segmentation and adversarial network. The front end with U-shape framework is employed to generate crack maps, and the back end with a multi-scale loss function is used to correct higher-order inconsistencies between labels and crack maps (generated by the front end) to address open-loop system issues. Empirical results show that the proposed CrackCLF outperforms others methods on three public datasets. Moreover, the proposed CLF can be defined as a plug and play module, which can be embedded into different neural network models to improve their performances.
Abstract:Multilingual generative models obtain remarkable cross-lingual capabilities through pre-training on large-scale corpora. However, they still exhibit a performance bias toward high-resource languages, and learn isolated distributions of sentence representations across languages. To bridge this gap, we propose a simple yet effective alignment framework exploiting pairs of translation sentences. It aligns the internal sentence representations across different languages via multilingual contrastive learning and aligns model outputs by answering prompts in different languages. Experimental results demonstrate that even with less than 0.1 {\textperthousand} of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative models and mitigates the performance gap. Further analysis reveals that it results in a better internal multilingual representation distribution of multilingual models.
Abstract:Transformer-based models, even though achieving super-human performance on several downstream tasks, are often regarded as a black box and used as a whole. It is still unclear what mechanisms they have learned, especially their core module: multi-head attention. Inspired by functional specialization in the human brain, which helps to efficiently handle multiple tasks, this work attempts to figure out whether the multi-head attention module will evolve similar function separation under multi-tasking training. If it is, can this mechanism further improve the model performance? To investigate these questions, we introduce an interpreting method to quantify the degree of functional specialization in multi-head attention. We further propose a simple multi-task training method to increase functional specialization and mitigate negative information transfer in multi-task learning. Experimental results on seven pre-trained transformer models have demonstrated that multi-head attention does evolve functional specialization phenomenon after multi-task training which is affected by the similarity of tasks. Moreover, the multi-task training strategy based on functional specialization boosts performance in both multi-task learning and transfer learning without adding any parameters.
Abstract:Rotation group $\mathcal{SO}(d)$ synchronization is an important inverse problem and has attracted intense attention from numerous application fields such as graph realization, computer vision, and robotics. In this paper, we focus on the least-squares estimator of rotation group synchronization with general additive noise models, which is a nonconvex optimization problem with manifold constraints. Unlike the phase/orthogonal group synchronization, there are limited provable approaches for solving rotation group synchronization. First, we derive improved estimation results of the least-squares/spectral estimator, illustrating the tightness and validating the existing relaxation methods of solving rotation group synchronization through the optimum of relaxed orthogonal group version under near-optimal noise level for exact recovery. Moreover, departing from the standard approach of utilizing the geometry of the ambient Euclidean space, we adopt an intrinsic Riemannian approach to study orthogonal/rotation group synchronization. Benefiting from a quotient geometric view, we prove the positive definite condition of quotient Riemannian Hessian around the optimum of orthogonal group synchronization problem, and consequently the Riemannian local error bound property is established to analyze the convergence rate properties of various Riemannian algorithms. As a simple and feasible method, the sequential convergence guarantee of the (quotient) Riemannian gradient method for solving orthogonal/rotation group synchronization problem is studied, and we derive its global linear convergence rate to the optimum with the spectral initialization. All results are deterministic without any probabilistic model.
Abstract:Large language models (LLMs) demonstrate promising translation performance among various natural languages. However, many LLMs especially the open-sourced ones, such as BLOOM and LLaMA, are English-dominant and support only dozens of natural languages, making the potential of LLMs on language translation less explored. In this work, we present BigTrans which adapts LLaMA that covers only 20 languages and enhances it with multilingual translation capability on more than 100 languages. BigTrans is built upon LLaMA-13B and it is optimized in three steps. First, we continue training LLaMA with massive Chinese monolingual data. Second, we continue training the model with a large-scale parallel dataset that covers 102 natural languages. Third, we instruct-tune the foundation model with multilingual translation instructions, leading to our BigTrans model. The preliminary experiments on multilingual translation show that BigTrans performs comparably with ChatGPT and Google Translate in many languages and even outperforms ChatGPT in 8 language pairs. We release the BigTrans model and hope it can advance the research progress.