Abstract:Most current federated learning frameworks are modeled as static processes, ignoring the dynamic characteristics of the learning system. Under the limited communication budget of the central server, the flexible model architecture of a large number of clients participating in knowledge transfer requires a lower participation rate, active clients have uneven contributions, and the client scale seriously hinders the performance of FL. We consider a more general and practical federation scenario and propose a system heterogeneous federation method based on data-free knowledge distillation and two-way contrast (HFedCKD). We apply the Inverse Probability Weighted Distillation (IPWD) strategy to the data-free knowledge transfer framework. The generator completes the data features of the nonparticipating clients. IPWD implements a dynamic evaluation of the prediction contribution of each client under different data distributions. Based on the antibiased weighting of its prediction loss, the weight distribution of each client is effectively adjusted to fairly integrate the knowledge of participating clients. At the same time, the local model is split into a feature extractor and a classifier. Through differential contrast learning, the feature extractor is aligned with the global model in the feature space, while the classifier maintains personalized decision-making capabilities. HFedCKD effectively alleviates the knowledge offset caused by a low participation rate under data-free knowledge distillation and improves the performance and stability of the model. We conduct extensive experiments on image and IoT datasets to comprehensively evaluate and verify the generalization and robustness of the proposed HFedCKD framework.
Abstract:The intersection of medical imaging and artificial intelligence has become an important research direction in intelligent medical treatment, particularly in the analysis of medical images using deep learning for clinical diagnosis. Despite the advances, existing keyframe classification methods lack extraction of time series features, while ultrasonic video classification based on three-dimensional convolution requires uniform frame numbers across patients, resulting in poor feature extraction efficiency and model classification performance. This study proposes a novel video classification method based on CNN and LSTM, introducing NLP's long and short sentence processing scheme into video classification for the first time. The method reduces CNN-extracted image features to 1x512 dimension, followed by sorting and compressing feature vectors for LSTM training. Specifically, feature vectors are sorted by patient video frame numbers and populated with padding value 0 to form variable batches, with invalid padding values compressed before LSTM training to conserve computing resources. Experimental results demonstrate that our variable-frame CNNLSTM method outperforms other approaches across all metrics, showing improvements of 3-6% in F1 score and 1.5% in specificity compared to keyframe methods. The variable-frame CNNLSTM also achieves better accuracy and precision than equal-frame CNNLSTM. These findings validate the effectiveness of our approach in classifying variable-frame ultrasound videos and suggest potential applications in other medical imaging modalities.
Abstract:Understanding and analyzing the spatial semantics and structure of forests is essential for accurate forest resource monitoring and ecosystem research. However, the lack of large-scale and annotated datasets has limited the widespread use of advanced intelligent techniques in this field. To address this challenge, a fully automated synthetic data generation and processing framework based on the concepts of Digital Cousins and Simulation-to-Reality (Sim2Real) is proposed, offering versatility and scalability to any size and platform. Using this process, we created the Boreal3D, the world's largest forest point cloud dataset. It includes 1000 highly realistic and structurally diverse forest plots across four different platforms, totaling 48,403 trees and over 35.3 billion points. Each point is labeled with semantic, instance, and viewpoint information, while each tree is described with structural parameters such as diameter, crown width, leaf area, and total volume. We designed and conducted extensive experiments to evaluate the potential of Boreal3D in advancing fine-grained 3D forest structure analysis in real-world applications. The results demonstrate that with certain strategies, models pre-trained on synthetic data can significantly improve performance when applied to real forest datasets. Especially, the findings reveal that fine-tuning with only 20% of real-world data enables the model to achieve performance comparable to models trained exclusively on entire real-world data, highlighting the value and potential of our proposed framework. The Boreal3D dataset, and more broadly, the synthetic data augmentation framework, is poised to become a critical resource for advancing research in large-scale 3D forest scene understanding and structural parameter estimation.
Abstract:Singing voice conversion aims to transform a source singing voice into that of a target singer while preserving the original lyrics, melody, and various vocal techniques. In this paper, we propose a high-fidelity singing voice conversion system. Our system builds upon the SVCC T02 framework and consists of three key components: a feature extractor, a voice converter, and a post-processor. The feature extractor utilizes the ContentVec and Whisper models to derive F0 contours and extract speaker-independent linguistic features from the input singing voice. The voice converter then integrates the extracted timbre, F0, and linguistic content to synthesize the target speaker's waveform. The post-processor augments high-frequency information directly from the source through simple and effective signal processing to enhance audio quality. Due to the lack of a standardized professional dataset for evaluating expressive singing conversion systems, we have created and made publicly available a specialized test set. Comparative evaluations demonstrate that our system achieves a remarkably high level of naturalness, and further analysis confirms the efficacy of our proposed system design.
Abstract:Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy. Despite its widespread adoption, most FL approaches focusing solely on privacy protection fall short in scenarios where trustworthiness is crucial, necessitating advancements in secure training, dependable decision-making mechanisms, robustness on corruptions, and enhanced performance with Non-IID data. To bridge this gap, we introduce Trustworthy Personalized Federated Learning (TPFL) framework designed for classification tasks via subjective logic in this paper. Specifically, TPFL adopts a unique approach by employing subjective logic to construct federated models, providing probabilistic decisions coupled with an assessment of uncertainty rather than mere probability assignments. By incorporating a trainable heterogeneity prior to the local training phase, TPFL effectively mitigates the adverse effects of data heterogeneity. Model uncertainty and instance uncertainty are further utilized to ensure the safety and reliability of the training and inference stages. Through extensive experiments on widely recognized federated learning benchmarks, we demonstrate that TPFL not only achieves competitive performance compared with advanced methods but also exhibits resilience against prevalent malicious attacks, robustness on domain shifts, and reliability in high-stake scenarios.
Abstract:This paper presents the T031 team's approach to the StutteringSpeech Challenge in SLT2024. Mandarin Stuttering Event Detection (MSED) aims to detect instances of stuttering events in Mandarin speech. We propose a detailed acoustic analysis method to improve the accuracy of stutter detection by capturing subtle nuances that previous Stuttering Event Detection (SED) techniques have overlooked. To this end, we introduce the Fine-Grained Contrastive Learning (FGCL) framework for MSED. Specifically, we model the frame-level probabilities of stuttering events and introduce a mining algorithm to identify both easy and confusing frames. Then, we propose a stutter contrast loss to enhance the distinction between stuttered and fluent speech frames, thereby improving the discriminative capability of stuttered feature embeddings. Extensive evaluations on English and Mandarin datasets demonstrate the effectiveness of FGCL, achieving a significant increase of over 5.0% in F1 score on Mandarin data.
Abstract:3D contrastive representation learning has exhibited remarkable efficacy across various downstream tasks. However, existing contrastive learning paradigms based on cosine similarity fail to deeply explore the potential intra-modal hierarchical and cross-modal semantic correlations about multi-modal data in Euclidean space. In response, we seek solutions in hyperbolic space and propose a hyperbolic image-and-pointcloud contrastive learning method (HyperIPC). For the intra-modal branch, we rely on the intrinsic geometric structure to explore the hyperbolic embedding representation of point cloud to capture invariant features. For the cross-modal branch, we leverage images to guide the point cloud in establishing strong semantic hierarchical correlations. Empirical experiments underscore the outstanding classification performance of HyperIPC. Notably, HyperIPC enhances object classification results by 2.8% and few-shot classification outcomes by 5.9% on ScanObjectNN compared to the baseline. Furthermore, ablation studies and confirmatory testing validate the rationality of HyperIPC's parameter settings and the effectiveness of its submodules.
Abstract:Invariance-based and generative methods have shown a conspicuous performance for 3D self-supervised representation learning (SSRL). However, the former relies on hand-crafted data augmentations that introduce bias not universally applicable to all downstream tasks, and the latter indiscriminately reconstructs masked regions, resulting in irrelevant details being saved in the representation space. To solve the problem above, we introduce 3D-JEPA, a novel non-generative 3D SSRL framework. Specifically, we propose a multi-block sampling strategy that produces a sufficiently informative context block and several representative target blocks. We present the context-aware decoder to enhance the reconstruction of the target blocks. Concretely, the context information is fed to the decoder continuously, facilitating the encoder in learning semantic modeling rather than memorizing the context information related to target blocks. Overall, 3D-JEPA predicts the representation of target blocks from a context block using the encoder and context-aware decoder architecture. Various downstream tasks on different datasets demonstrate 3D-JEPA's effectiveness and efficiency, achieving higher accuracy with fewer pretraining epochs, e.g., 88.65% accuracy on PB_T50_RS with 150 pretraining epochs.
Abstract:Multiple rotation averaging plays a crucial role in computer vision and robotics domains. The conventional optimization-based methods optimize a nonlinear cost function based on certain noise assumptions, while most previous learning-based methods require ground truth labels in the supervised training process. Recognizing the handcrafted noise assumption may not be reasonable in all real-world scenarios, this paper proposes an effective rotation averaging method for mining data patterns in a learning manner while avoiding the requirement of labels. Specifically, we apply deep matrix factorization to directly solve the multiple rotation averaging problem in unconstrained linear space. For deep matrix factorization, we design a neural network model, which is explicitly low-rank and symmetric to better suit the background of multiple rotation averaging. Meanwhile, we utilize a spanning tree-based edge filtering to suppress the influence of rotation outliers. What's more, we also adopt a reweighting scheme and dynamic depth selection strategy to further improve the robustness. Our method synthesizes the merit of both optimization-based and learning-based methods. Experimental results on various datasets validate the effectiveness of our proposed method.
Abstract:Multiview point cloud registration serves as a cornerstone of various computer vision tasks. Previous approaches typically adhere to a global paradigm, where a pose graph is initially constructed followed by motion synchronization to determine the absolute pose. However, this separated approach may not fully leverage the characteristics of multiview registration and might struggle with low-overlap scenarios. In this paper, we propose an incremental multiview point cloud registration method that progressively registers all scans to a growing meta-shape. To determine the incremental ordering, we employ a two-stage coarse-to-fine strategy for point cloud candidate retrieval. The first stage involves the coarse selection of scans based on neighbor fusion-enhanced global aggregation features, while the second stage further reranks candidates through geometric-based matching. Additionally, we apply a transformation averaging technique to mitigate accumulated errors during the registration process. Finally, we utilize a Reservoir sampling-based technique to address density variance issues while reducing computational load. Comprehensive experimental results across various benchmarks validate the effectiveness and generalization of our approach.