Abstract:Although the performance of Temporal Action Segmentation (TAS) has improved in recent years, achieving promising results often comes with a high computational cost due to dense inputs, complex model structures, and resource-intensive post-processing requirements. To improve the efficiency while keeping the performance, we present a novel perspective centered on per-segment classification. By harnessing the capabilities of Transformers, we tokenize each video segment as an instance token, endowed with intrinsic instance segmentation. To realize efficient action segmentation, we introduce BaFormer, a boundary-aware Transformer network. It employs instance queries for instance segmentation and a global query for class-agnostic boundary prediction, yielding continuous segment proposals. During inference, BaFormer employs a simple yet effective voting strategy to classify boundary-wise segments based on instance segmentation. Remarkably, as a single-stage approach, BaFormer significantly reduces the computational costs, utilizing only 6% of the running time compared to state-of-the-art method DiffAct, while producing better or comparable accuracy over several popular benchmarks. The code for this project is publicly available at https://github.com/peiyao-w/BaFormer.
Abstract:Deploying end-to-end speech recognition models with limited computing resources remains challenging, despite their impressive performance. Given the gradual increase in model size and the wide range of model applications, selectively executing model components for different inputs to improve the inference efficiency is of great interest. In this paper, we propose a dynamic layer-skipping method that leverages the CTC blank output from intermediate layers to trigger the skipping of the last few encoder layers for frames with high blank probabilities. Furthermore, we factorize the CTC output distribution and perform knowledge distillation on intermediate layers to reduce computation and improve recognition accuracy. Experimental results show that by utilizing the CTC blank, the encoder layer depth can be adjusted dynamically, resulting in 29% acceleration of the CTC model inference with minor performance degradation.
Abstract:Fully supervised action segmentation works on frame-wise action recognition with dense annotations and often suffers from the over-segmentation issue. Existing works have proposed a variety of solutions such as boundary-aware networks, multi-stage refinement, and temporal smoothness losses. However, most of them take advantage of frame-wise supervision, which cannot effectively tackle the evaluation metrics with different granularities. In this paper, for the desirable large receptive field, we first develop a novel local-global attention mechanism with temporal pyramid dilation and temporal pyramid pooling for efficient multi-scale attention. Then we decouple two inherent goals in action segmentation, ie, (1) individual identification solved by frame-wise supervision, and (2) temporal reasoning tackled by action set prediction. Afterward, an action alignment module fuses these different granularity predictions, leading to more accurate and smoother action segmentation. We achieve state-of-the-art accuracy, eg, 82.8% (+2.6%) on GTEA and 74.7% (+1.2%) on Breakfast, which demonstrates the effectiveness of our proposed method, accompanied by extensive ablation studies. The code will be made available later.
Abstract:Retrieval-based conversational systems learn to rank response candidates for a given dialogue context by computing the similarity between their vector representations. However, training on a single textual form of the multi-turn context limits the ability of a model to learn representations that generalize to natural perturbations seen during inference. In this paper we propose a framework that incorporates augmented versions of a dialogue context into the learning objective. We utilize contrastive learning as an auxiliary objective to learn robust dialogue context representations that are invariant to perturbations injected through the augmentation method. We experiment with four benchmark dialogue datasets and demonstrate that our framework combines well with existing augmentation methods and can significantly improve over baseline BERT-based ranking architectures. Furthermore, we propose a novel data augmentation method, ConMix, that adds token level perturbations through stochastic mixing of tokens from other contexts in the batch. We show that our proposed augmentation method outperforms previous data augmentation approaches, and provides dialogue representations that are more robust to common perturbations seen during inference.
Abstract:This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement and so on, to process training, validation and test sets. But we only selected WPE and beamforming as our frontend methods according to their experimental results. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, mainly including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, and so on, which brought us a great performance improvement. Finally, in order to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Comparing with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
Abstract:Spatial Description Resolution, as a language-guided localization task, is proposed for target location in a panoramic street view, given corresponding language descriptions. Explicitly characterizing an object-level relationship while distilling spatial relationships are currently absent but crucial to this task. Mimicking humans, who sequentially traverse spatial relationship words and objects with a first-person view to locate their target, we propose a novel spatial relationship induced (SIRI) network. Specifically, visual features are firstly correlated at an implicit object-level in a projected latent space; then they are distilled by each spatial relationship word, resulting in each differently activated feature representing each spatial relationship. Further, we introduce global position priors to fix the absence of positional information, which may result in global positional reasoning ambiguities. Both the linguistic and visual features are concatenated to finalize the target localization. Experimental results on the Touchdown show that our method is around 24\% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius. Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
Abstract:Compressive sensing (CS), aiming to reconstruct an image/signal from a small set of random measurements has attracted considerable attentions in recent years. Due to the high dimensionality of images, previous CS methods mainly work on image blocks to avoid the huge requirements of memory and computation, i.e., image blocks are measured with Gaussian random matrices, and the whole images are recovered from the reconstructed image blocks. Though efficient, such methods suffer from serious blocking artifacts. In this paper, we propose a convolutional CS framework that senses the whole image using a set of convolutional filters. Instead of reconstructing individual blocks, the whole image is reconstructed from the linear convolutional measurements. Specifically, the convolutional CS is implemented based on a convolutional neural network (CNN), which performs both the convolutional CS and nonlinear reconstruction. Through end-to-end training, the sensing filters and the reconstruction network can be jointly optimized. To facilitate the design of the CS reconstruction network, a novel two-branch CNN inspired from a sparsity-based CS reconstruction model is developed. Experimental results show that the proposed method substantially outperforms previous state-of-the-art CS methods in term of both PSNR and visual quality.
Abstract:Deep neural networks (DNNs) have shown very promising results for various image restoration (IR) tasks. However, the design of network architectures remains a major challenging for achieving further improvements. While most existing DNN-based methods solve the IR problems by directly mapping low quality images to desirable high-quality images, the observation models characterizing the image degradation processes have been largely ignored. In this paper, we first propose a denoising-based IR algorithm, whose iterative steps can be computed efficiently. Then, the iterative process is unfolded into a deep neural network, which is composed of multiple denoisers modules interleaved with back-projection (BP) modules that ensure the observation consistencies. A convolutional neural network (CNN) based denoiser that can exploit the multi-scale redundancies of natural images is proposed. As such, the proposed network not only exploits the powerful denoising ability of DNNs, but also leverages the prior of the observation model. Through end-to-end training, both the denoisers and the BP modules can be jointly optimized. Experimental results on several IR tasks, e.g., image denoising, super-resolution and deblurring show that the proposed method can lead to very competitive and often state-of-the-art results on several IR tasks, including image denoising, deblurring and super-resolution.