Abstract:Speech enhancement aims to improve speech quality and intelligibility in noisy environments. Recent advancements have concentrated on deep neural networks, particularly employing the Two-Stage (TS) architecture to enhance feature extraction. However, the complexity and size of these models remain significant, which limits their applicability in resource-constrained scenarios. Designing models suitable for edge devices presents its own set of challenges. Narrow lightweight models often encounter performance bottlenecks due to uneven loss landscapes. Additionally, advanced operators such as Transformers or Mamba may lack the practical adaptability and efficiency that convolutional neural networks (CNNs) offer in real-world deployments. To address these challenges, we propose Dense-TSNet, an innovative ultra-lightweight speech enhancement network. Our approach employs a novel Dense Two-Stage (Dense-TS) architecture, which, compared to the classic Two-Stage architecture, ensures more robust refinement of the objective function in the later training stages. This leads to improved final performance, addressing the early convergence limitations of the baseline model. We also introduce the Multi-View Gaze Block (MVGB), which enhances feature extraction by incorporating global, channel, and local perspectives through convolutional neural networks (CNNs). Furthermore, we discuss how the choice of loss function impacts perceptual quality. Dense-TSNet demonstrates promising performance with a compact model size of around 14K parameters, making it particularly well-suited for deployment in resource-constrained environments.
Abstract:Purpose: To develop a simultaneous multislice (SMS) first-pass perfusion technique that can achieve whole heart coverage with high multi-band factors, while avoiding the issue of slice leakage. Methods: The proposed Simultaneous Multislice Imaging via Linear phase modulated Extended field of view (SMILE) treats the SMS acquisition and reconstruction within an extended field of view framework, allowing arbitrarily under-sampling of phase encoding lines of the extended k-space matrix and enabling the direct application of 2D parallel imaging reconstruction techniques. We presented a theoretical framework that offers insights into the performance of SMILE. We performed retrospective comparison on 28 subjects and prospective perfusion experiments on 49 patients undergoing routine clinical CMR studies with SMILE at multiband (MB) factors of 3-5, with a total acceleration factor ($R$) of 8 and 10 respectively, and compared SMILE to conventional SMS techniques using standard FOV 2D CAIPI acquisition and standard 2D slice separation techniques including split-slice GRAPPA and ROCK-SPIRiT. Results: Retrospective studies demonstrated 5.2 to 8.0 dB improvement in signal to error ratio (SER) of SMILE over CAIPI perfusion. Prospective studies showed good image quality with grades of 4.5 $\pm$ 0.5 for MB=3, $R$=8 and 3.6 $\pm$ 0.8 for MB=5, $R$=10. (5-point Likert Scale) Conclusion: The theoretical derivation and experimental results validate the SMILE's improved performance at high acceleration and MB factors as compared to the existing 2D CAIPI SMS acquisition and reconstruction techniques for first-pass myocardial perfusion imaging.
Abstract:The behavior decision-making subsystem is a key component of the autonomous driving system, which reflects the decision-making ability of the vehicle and the driver, and is an important symbol of the high-level intelligence of the vehicle. However, the existing rule-based decision-making schemes are limited by the prior knowledge of designers, and it is difficult to cope with complex and changeable traffic scenarios. In this work, an advanced deep reinforcement learning model is adopted, which can autonomously learn and optimize driving strategies in a complex and changeable traffic environment by modeling the driving decision-making process as a reinforcement learning problem. Specifically, we used Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) for comparative experiments. DQN guides the agent to choose the best action by approximating the state-action value function, while PPO improves the decision-making quality by optimizing the policy function. We also introduce improvements in the design of the reward function to promote the robustness and adaptability of the model in real-world driving situations. Experimental results show that the decision-making strategy based on deep reinforcement learning has better performance than the traditional rule-based method in a variety of driving tasks.
Abstract:Achieving a balance between lightweight design and high performance remains a challenging task for speech enhancement. In this paper, we introduce Multi-path Enhanced Taylor (MET) Transformer based U-net for Speech Enhancement (MUSE), a lightweight speech enhancement network built upon the Unet architecture. Our approach incorporates a novel Multi-path Enhanced Taylor (MET) Transformer block, which integrates Deformable Embedding (DE) to enable flexible receptive fields for voiceprints. The MET Transformer is uniquely designed to fuse Channel and Spatial Attention (CSA) branches, facilitating channel information exchange and addressing spatial attention deficits within the Taylor-Transformer framework. Through extensive experiments conducted on the VoiceBank+DEMAND dataset, we demonstrate that MUSE achieves competitive performance while significantly reducing both training and deployment costs, boasting a mere 0.51M parameters.
Abstract:Current speech enhancement (SE) research has largely neglected channel attention and spatial attention, and encoder-decoder architecture-based networks have not adequately considered how to provide efficient inputs to the intermediate enhancement layer. To address these issues, this paper proposes a time-frequency (T-F) domain SE network (DPCFCS-Net) that incorporates improved densely connected blocks, dual-path modules, convolution-augmented transformers (conformers), channel attention, and spatial attention. Compared with previous models, our proposed model has a more efficient encoder-decoder and can learn comprehensive features. Experimental results on the VCTK+DEMAND dataset demonstrate that our method outperforms existing techniques in SE performance. Furthermore, the improved densely connected block and two dimensions attention module developed in this work are highly adaptable and easily integrated into existing networks.
Abstract:We present an efficient speech separation neural network, ARFDCN, which combines dilated convolutions, multi-scale fusion (MSF), and channel attention to overcome the limited receptive field of convolution-based networks and the high computational cost of transformer-based networks. The suggested network architecture is encoder-decoder based. By using dilated convolutions with gradually increasing dilation value to learn local and global features and fusing them at adjacent stages, the model can learn rich feature content. Meanwhile, by adding channel attention modules to the network, the model can extract channel weights, learn more important features, and thus improve its expressive power and robustness. Experimental results indicate that the model achieves a decent balance between performance and computational efficiency, making it a promising alternative to current mainstream models for practical applications.
Abstract:We present RND-SCI, a novel framework for compressive hyperspectral image (HSI) reconstruction. Our framework decomposes the reconstructed object into range-space and null-space components, where the range-space part ensures the solution conforms to the compression process, and the null-space term introduces a deep HSI prior to constraining the output to have satisfactory properties. RND-SCI is not only simple in design with strong interpretability but also can be easily adapted to various HSI reconstruction networks, improving the quality of HSIs with minimal computational overhead. RND-SCI significantly boosts the performance of HSI reconstruction networks in retraining, fine-tuning or plugging into a pre-trained off-the-shelf model. Based on the framework and SAUNet, we design an extremely fast HSI reconstruction network, RND-SAUNet, which achieves an astounding 91 frames per second while maintaining superior reconstruction accuracy compared to other less time-consuming methods. Code and models are available at https://github.com/hustvl/RND-SCI.
Abstract:We present a simple, efficient, and scalable unfolding network, SAUNet, to simplify the network design with an adaptive alternate optimization framework for hyperspectral image (HSI) reconstruction. SAUNet customizes a Residual Adaptive ADMM Framework (R2ADMM) to connect each stage of the network via a group of learnable parameters to promote the usage of mask prior, which greatly stabilizes training and solves the accuracy degradation issue. Additionally, we introduce a simple convolutional modulation block (CMB), which leads to efficient training, easy scale-up, and less computation. Coupling these two designs, SAUNet can be scaled to non-trivial 13 stages with continuous improvement. Without bells and whistles, SAUNet improves both performance and speed compared with the previous state-of-the-art counterparts, which makes it feasible for practical high-resolution HSI reconstruction scenarios. We set new records on CAVE and KAIST HSI reconstruction benchmarks. Code and models are available at https://github.com/hustvl/SAUNet.