Chinese Academy of Sciences
Abstract:Simultaneous Speech Translation (SimulST) involves generating target language text while continuously processing streaming speech input, presenting significant real-time challenges. Multi-task learning is often employed to enhance SimulST performance but introduces optimization conflicts between primary and auxiliary tasks, potentially compromising overall efficiency. The existing model-level conflict resolution methods are not well-suited for this task which exacerbates inefficiencies and leads to high GPU memory consumption. To address these challenges, we propose a Modular Gradient Conflict Mitigation (MGCM) strategy that detects conflicts at a finer-grained modular level and resolves them utilizing gradient projection. Experimental results demonstrate that MGCM significantly improves SimulST performance, particularly under medium and high latency conditions, achieving a 0.68 BLEU score gain in offline tasks. Additionally, MGCM reduces GPU memory consumption by over 95\% compared to other conflict mitigation methods, establishing it as a robust solution for SimulST tasks.
Abstract:Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.
Abstract:Decision making demands intricate interplay between perception, memory, and reasoning to discern optimal policies. Conventional approaches to decision making face challenges related to low sample efficiency and poor generalization. In contrast, foundation models in language and vision have showcased rapid adaptation to diverse new tasks. Therefore, we advocate for the construction of foundation agents as a transformative shift in the learning paradigm of agents. This proposal is underpinned by the formulation of foundation agents with their fundamental characteristics and challenges motivated by the success of large language models (LLMs). Moreover, we specify the roadmap of foundation agents from large interactive data collection or generation, to self-supervised pretraining and adaptation, and knowledge and value alignment with LLMs. Lastly, we pinpoint critical research questions derived from the formulation and delineate trends for foundation agents supported by real-world use cases, addressing both technical and theoretical aspects to propel the field towards a more comprehensive and impactful future.
Abstract:Decision-making is a dynamic process requiring perception, memory, and reasoning to make choices and find optimal policies. Traditional approaches to decision-making suffer from sample efficiency and generalization, while large-scale self-supervised pretraining has enabled fast adaptation with fine-tuning or few-shot learning in language and vision. We thus argue to integrate knowledge acquired from generic large-scale self-supervised pretraining into downstream decision-making problems. We propose Pretrain-Then-Adapt pipeline and survey recent work on data collection, pretraining objectives and adaptation strategies for decision-making pretraining and downstream inference. Finally, we identify critical challenges and future directions for developing decision foundation model with the help of generic and flexible self-supervised pretraining.
Abstract:Continual learning addresses the problem of continuously acquiring and transferring knowledge without catastrophic forgetting of old concepts. While humans achieve continual learning via diverse neurocognitive mechanisms, there is a mismatch between cognitive properties and evaluation methods of continual learning models. First, the measurement of continual learning models mostly relies on evaluation metrics at a micro-level, which cannot characterize cognitive capacities of the model. Second, the measurement is method-specific, emphasizing model strengths in one aspect while obscuring potential weaknesses in other respects. To address these issues, we propose to integrate model cognitive capacities and evaluation metrics into a unified evaluation paradigm. We first characterize model capacities via desiderata derived from cognitive properties supporting human continual learning. The desiderata concern (1) adaptability in varying lengths of task sequence; (2) sensitivity to dynamic task variations; and (3) efficiency in memory usage and training time consumption. Then we design evaluation protocols for each desideratum to assess cognitive capacities of recent continual learning models. Experimental results show that no method we consider has satisfied all the desiderata and is still far away from realizing truly continual learning. Although some methods exhibit some degree of adaptability and efficiency, no method is able to identify task relationships when encountering dynamic task variations, or achieve a trade-off in learning similarities and differences between tasks. Inspired by these results, we discuss possible factors that influence model performance in these desiderata and provide guidance for the improvement of continual learning models.
Abstract:With the increasing demands in communication, Wi-Fi technology is advancing towards its next generation. Building on the foundation of Wi-Fi 7, millimeter-wave technology is anticipated to converge with Wi-Fi 8 in the near future. In this paper, we look into the millimeter-wave technology and other potential feasible features, providing a comprehensive perspective on the future of Wi-Fi 8. Our simulation results demonstrate that significant performance gains can be achieved, even in the presence of hardware impairments.
Abstract:While the pace of commercial scale application of Wi-Fi 6 accelerates, the IEEE 802.11 Working Group is about to complete the development of a new amendment standard IEEE 802.11be -- Extremely High Throughput (EHT), also known as Wi-Fi 7, which can be used to meet the demand for the throughput of 4K/8K videos up to tens of Gbps and low-latency video applications such as virtual reality (VR) and augmented reality (AR). Wi-Fi 7 not only scales Wi-Fi 6 with doubled bandwidth, but also supports real-time applications, which brings revolutionary changes to Wi-Fi. In this article, we start by introducing the main objectives and timeline of Wi-Fi 7 and then list the latest key techniques which promote the performance improvement of Wi-Fi 7. Finally, we validate the most critical objectives of Wi-Fi 7 -- the potential up to 30 Gbps throughput and lower latency. System-level simulation results suggest that by combining the new techniques, Wi-Fi 7 achieves 30 Gbps throughput and lower latency than Wi-Fi 6.
Abstract:In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Building upon the recent advances in CTC application, we develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art performances on the MuST-C ST benchmarks under resource-constrained scenarios. Intriguingly, our method also yields significant improvements in speech recognition performance, revealing the effect of cross-lingual learning on transcription and demonstrating its broad applicability. The source code is available at https://github.com/xuchennlp/S2T.
Abstract:Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency. In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST). In particular, we develop a model consisting of two encoders that are guided by CTC to predict the source and target texts, respectively. Introducing CTC into NAST on both language sides has obvious challenges: 1) the conditional independent generation somewhat breaks the interdependency among tokens, and 2) the monotonic alignment assumption in standard CTC does not hold in translation tasks. In response, we develop a prediction-aware encoding approach and a cross-layer attention approach to address these issues. We also use curriculum learning to improve convergence of training. Experiments on the MuST-C ST benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$\times$, which is comparable to the autoregressive counterpart and even outperforms the previous best result of 0.9 BLEU points.
Abstract:While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose \textit{Progressive Down-Sampling} (PDS) which gradually compresses the acoustic features into coarser-grained units containing more complete semantic information, like text-level representation. In addition, we develop a representation fusion method to alleviate information loss that occurs inevitably during high compression. In this way, we compress the acoustic features into 1/32 of the initial length while achieving better or comparable performances on the speech recognition task. And as a bonus, it yields inference speedups ranging from 1.20$\times$ to 1.47$\times$. By reducing the modeling burden, we also achieve competitive results when training on the more challenging speech translation task.