Abstract:The Agriculture-Vision Challenge at CVPR 2024 aims at leveraging semantic segmentation models to produce pixel level semantic segmentation labels within regions of interest for multi-modality satellite images. It is one of the most famous and competitive challenges for global researchers to break the boundary between computer vision and agriculture sectors. However, there is a serious class imbalance problem in the agriculture-vision dataset, which hinders the semantic segmentation performance. To solve this problem, firstly, we propose a mosaic data augmentation with a rare class sampling strategy to enrich long-tail class samples. Secondly, we employ an adaptive class weight scheme to suppress the contribution of the common classes while increasing the ones of rare classes. Thirdly, we propose a probability post-process to increase the predicted value of the rare classes. Our methodology achieved a mean Intersection over Union (mIoU) score of 0.547 on the test set, securing second place in this challenge.
Abstract:Combining millimetre-wave (mmWave) communications with an extremely large-scale antenna array (ELAA) presents a promising avenue for meeting the spectral efficiency demands of the future sixth generation (6G) mobile communications. However, beam training for mmWave ELAA systems is challenged by excessive pilot overheads as well as insufficient accuracy, as the huge near-field codebook has to be accounted for. In this paper, inspired by the similarity between far-field sub-6 GHz channels and near-field mmWave channels, we propose to leverage sub-6 GHz uplink pilot signals to directly estimate the optimal near-field mmWave codeword, which aims to reduce pilot overhead and bypass the channel estimation. Moreover, we adopt deep learning to perform this dual mapping function, i.e., sub-6 GHz to mmWave, far-field to near-field, and a novel neural network structure called NMBEnet is designed to enhance the precision of beam training. Specifically, when considering the orthogonal frequency division multiplexing (OFDM) communication scenarios with high user density, correlations arise both between signals from different users and between signals from different subcarriers. Accordingly, the convolutional neural network (CNN) module and graph neural network (GNN) module included in the proposed NMBEnet can leverage these two correlations to further enhance the precision of beam training.
Abstract:Extremely large-scale multiple-input multiple-output (XL-MIMO) systems are capable of improving spectral efficiency by employing far more antennas than conventional massive MIMO at the base station (BS). However, beam training in multiuser XL-MIMO systems is challenging. To tackle these issues, we conceive a three-phase graph neural network (GNN)-based beam training scheme for multiuser XL-MIMO systems. In the first phase, only far-field wide beams have to be tested for each user and the GNN is utilized to map the beamforming gain information of the far-field wide beams to the optimal near-field beam for each user. In addition, the proposed GNN-based scheme can exploit the position-correlation between adjacent users for further improvement of the accuracy of beam training. In the second phase, a beam allocation scheme based on the probability vectors produced at the outputs of GNNs is proposed to address the above beam-direction conflicts between users. In the third phase, the hybrid TBF is designed for further reducing the inter-user interference. Our simulation results show that the proposed scheme improves the beam training performance of the benchmarks. Moreover, the performance of the proposed beam training scheme approaches that of an exhaustive search, despite requiring only about 7% of the pilot overhead.
Abstract:This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience. For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data. Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model. The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams.
Abstract:Most existing studies on massive grant-free access, proposed to support massive machine-type communications (mMTC) for the Internet of things (IoT), assume Rayleigh fading and perfect synchronization for simplicity. However, in practice, line-of-sight (LoS) components generally exist, and time and frequency synchronization are usually imperfect. This paper systematically investigates maximum likelihood estimation (MLE)-based device activity detection under Rician fading for massive grant-free access with perfect and imperfect synchronization. Specifically, we formulate device activity detection in the synchronous case and joint device activity and offset detection in three asynchronous cases (i.e., time, frequency, and time and frequency asynchronous cases) as MLE problems. In the synchronous case, we propose an iterative algorithm to obtain a stationary point of the MLE problem. In each asynchronous case, we propose two iterative algorithms with identical detection performance but different computational complexities. In particular, one is computationally efficient for small ranges of offsets, whereas the other one, relying on fast Fourier transform (FFT) and inverse FFT, is computationally efficient for large ranges of offsets. The proposed algorithms generalize the existing MLE-based methods for Rayleigh fading and perfect synchronization. Numerical results show the notable gains of the proposed algorithms over existing methods in detection accuracy and computation time.
Abstract:Extremely large-scale reconfigurable intelligent surface (XL-RIS) has recently been proposed and is recognized as a promising technology that can further enhance the capacity of communication systems and compensate for severe path loss . However, the pilot overhead of beam training in XL-RIS-assisted wireless communication systems is enormous because the near-field channel model needs to be taken into account, and the number of candidate codewords in the codebook increases dramatically accordingly. To tackle this problem, we propose two deep learning-based near-field beam training schemes in XL-RIS-assisted communication systems, where deep residual networks are employed to determine the optimal near-field RIS codeword. Specifically, we first propose a far-field beam-based beam training (FBT) scheme in which the received signals of all far-field RIS codewords are fed into the neural network to estimate the optimal near-field RIS codeword. In order to further reduce the pilot overhead, a partial near-field beam-based beam training (PNBT) scheme is proposed, where only the received signals corresponding to the partial near-field XL-RIS codewords are served as input to the neural network. Moreover, we further propose an improved PNBT scheme to enhance the performance of beam training by fully exploring the neural network's output. Finally, simulation results show that the proposed schemes outperform the existing beam training schemes and can reduce the beam sweeping overhead by approximately 95%.
Abstract:Extremely large-scale massive multiple-input-multiple-output (XL-MIMO) is regarded as a promising technology for next-generation communication systems. In order to enhance the beamforming gains, codebook-based beam training is widely adopted in XL-MIMO systems. However, in XL-MIMO systems, the near-field domain expands, and near-field codebook should be adopted for beam training, which significantly increases the pilot overhead. To tackle this problem, we propose a deep learning-based beam training scheme where the near-field channel model and the near-field codebook are considered. To be specific, we first utilize the received signals corresponding to the far-field wide beams to estimate the optimal near-field beam. Two training schemes are proposed, namely the proposed original and the improved neural networks. The original scheme estimates the optimal near-field codeword directly based on the output of the neural networks. By contrast, the improved scheme performs additional beam testing, which can significantly improve the performance of beam training. Finally, the simulation results show that our proposed schemes can significantly reduce the training overhead in the near-field domain and achieve beamforming gains.
Abstract:The technology for Visual Odometry (VO) that estimates the position and orientation of the moving object through analyzing the image sequences captured by on-board cameras, has been well investigated with the rising interest in autonomous driving. This paper studies monocular VO from the perspective of Deep Learning (DL). Unlike most current learning-based methods, our approach, called DeepAVO, is established on the intuition that features contribute discriminately to different motion patterns. Specifically, we present a novel four-branch network to learn the rotation and translation by leveraging Convolutional Neural Networks (CNNs) to focus on different quadrants of optical flow input. To enhance the ability of feature selection, we further introduce an effective channel-spatial attention mechanism to force each branch to explicitly distill related information for specific Frame to Frame (F2F) motion estimation. Experiments on various datasets involving outdoor driving and indoor walking scenarios show that the proposed DeepAVO outperforms the state-of-the-art monocular methods by a large margin, demonstrating competitive performance to the stereo VO algorithm and verifying promising potential for generalization.
Abstract:This paper reviews the first NTIRE challenge on quality enhancement of compressed video, with a focus on the proposed methods and results. In this challenge, the new Large-scale Diverse Video (LDV) dataset is employed. The challenge has three tracks. Tracks 1 and 2 aim at enhancing the videos compressed by HEVC at a fixed QP, while Track 3 is designed for enhancing the videos compressed by x265 at a fixed bit-rate. Besides, the quality enhancement of Tracks 1 and 3 targets at improving the fidelity (PSNR), and Track 2 targets at enhancing the perceptual quality. The three tracks totally attract 482 registrations. In the test phase, 12 teams, 8 teams and 11 teams submitted the final results of Tracks 1, 2 and 3, respectively. The proposed methods and solutions gauge the state-of-the-art of video quality enhancement. The homepage of the challenge: https://github.com/RenYang-home/NTIRE21_VEnh