Abstract:Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.
Abstract:How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment methods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/
Abstract:AI workloads, particularly those driven by deep learning, are introducing novel usage patterns to high-performance computing (HPC) systems that are not comprehensively captured by standard HPC benchmarks. As one of the largest academic research centers dedicated to deep learning, Mila identified the need to develop a custom benchmarking suite to address the diverse requirements of its community, which consists of over 1,000 researchers. This report introduces Milabench, the resulting benchmarking suite. Its design was informed by an extensive literature review encompassing 867 papers, as well as surveys conducted with Mila researchers. This rigorous process led to the selection of 26 primary benchmarks tailored for procurement evaluations, alongside 16 optional benchmarks for in-depth analysis. We detail the design methodology, the structure of the benchmarking suite, and provide performance evaluations using GPUs from NVIDIA, AMD, and Intel. The Milabench suite is open source and can be accessed at github.com/mila-iqia/milabench.
Abstract:Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?" we affirmatively respond and introduce a robust zero-shot COS framework. This framework leverages the inherent local pattern bias of COS and employs a broad semantic feature space derived from salient object segmentation (SOS) for efficient zero-shot transfer. We incorporate an Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM pre-trained image encoder focuses on capturing essential low-level features, while the M-LLM generates caption embeddings processed alongside these visual cues. These embeddings are precisely aligned using MFA, enabling our framework to accurately interpret and navigate complex semantic contexts. To optimize operational efficiency, we introduce a learnable codebook that represents the M-LLM during inference, significantly reducing computational overhead. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with $F_{\beta}^w$ scores of 72.9\% on CAMO and 71.7\% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Code: https://github.com/R-LEI360725/ZSCOS-CaMF
Abstract:Multi-channel EEG signals are commonly used for the diagnosis and assessment of diseases such as epilepsy. Currently, various EEG diagnostic algorithms based on deep learning have been developed. However, most research efforts focus solely on diagnosing and classifying current signal data but do not consider the prediction of future trends for early warning. Additionally, since multi-channel EEG can be essentially regarded as the spatio-temporal signal data received by detectors at different locations in the brain, how to construct spatio-temporal information representations of EEG signals to facilitate future trend prediction for multi-channel EEG becomes an important problem. This study proposes a multi-signal prediction algorithm based on generative diffusion models (EEG-DIF), which transforms the multi-signal forecasting task into an image completion task, allowing for comprehensive representation and learning of the spatio-temporal correlations and future developmental patterns of multi-channel EEG signals. Here, we employ a publicly available epilepsy EEG dataset to construct and validate the EEG-DIF. The results demonstrate that our method can accurately predict future trends for multi-channel EEG signals simultaneously. Furthermore, the early warning accuracy for epilepsy seizures based on the generated EEG data reaches 0.89. In general, EEG-DIF provides a novel approach for characterizing multi-channel EEG signals and an innovative early warning algorithm for epilepsy seizures, aiding in optimizing and enhancing the clinical diagnosis process. The code is available at https://github.com/JZK00/EEG-DIF.
Abstract:Automatic medical report generation (MRG), which aims to produce detailed text reports from medical images, has emerged as a critical task in this domain. MRG systems can enhance radiological workflows by reducing the time and effort required for report writing, thereby improving diagnostic efficiency. In this work, we present a novel approach for automatic MRG utilizing a multimodal large language model. Specifically, we employed the 3D Vision Transformer (ViT3D) image encoder introduced from M3D-CLIP to process 3D scans and use the Asclepius-Llama3-8B as the language model to generate the text reports by auto-regressive decoding. The experiment shows our model achieved an average Green score of 0.3 on the MRG task validation set and an average accuracy of 0.61 on the visual question answering (VQA) task validation set, outperforming the baseline model. Our approach demonstrates the effectiveness of the ViT3D alignment of LLaMA3 for automatic MRG and VQA tasks by tuning the model on a small dataset.
Abstract:Diabetic macular edema (DME) is a severe complication of diabetes, characterized by thickening of the central portion of the retina due to accumulation of fluid. DME is a significant and common cause of visual impairment in diabetic patients. Center-involved DME (ci-DME) is the highest risk form of disease as fluid extends close to the fovea which is responsible for sharp central vision. Earlier diagnosis or prediction of ci-DME may improve treatment outcomes. Here, we propose an ensemble method to predict ci-DME onset within a year using ultra-wide-field color fundus photography (UWF-CFP) images provided by the DIAMOND Challenge. We adopted a variety of baseline state-of-the-art classification networks including ResNet, DenseNet, EfficientNet, and VGG with the aim of enhancing model robustness. The best performing models were Densenet 121, Resnet 152 and EfficientNet b7, and these were assembled into a definitive predictive model. The final ensemble model demonstrates a strong performance with an Area Under Curve (AUC) of 0.7017, an F1 score of 0.6512, and an Expected Calibration Error (ECE) of 0.2057 when deployed on a synthetic dataset. The performance of this ensemble model is comparable to previous studies despite training and testing in a more realistic setting, indicating the potential of UWF-CFP combined with a deep learning classification system to facilitate earlier diagnosis, better treatment decisions, and improved prognostication in ci-DME.
Abstract:Accurate and timely modeling of labor migration is crucial for various urban governance and commercial tasks, such as local policy-making and business site selection. However, existing studies on labor migration largely rely on limited survey data with statistical methods, which fail to deliver timely and fine-grained insights for time-varying regional trends. To this end, we propose a deep learning-based spatial-temporal labor migration analysis framework, DHG-SIL, by leveraging large-scale job query data. Specifically, we first acquire labor migration intention as a proxy of labor migration via job queries from one of the world's largest search engines. Then, a Disprepant Homophily co-preserved Graph Convolutional Network (DH-GCN) and an interpretable temporal module are respectively proposed to capture cross-city and sequential labor migration dependencies. Besides, we introduce four interpretable variables to quantify city migration properties, which are co-optimized with city representations via tailor-designed contrastive losses. Extensive experiments on three real-world datasets demonstrate the superiority of our DHG-SIL. Notably, DHG-SIL has been deployed as a core component of a cooperative partner's intelligent human resource system, and the system supported a series of city talent attraction reports.
Abstract:Labor market forecasting on talent demand and supply is essential for business management and economic development. With accurate and timely forecasts, employers can adapt their recruitment strategies to align with the evolving labor market, and employees can have proactive career path planning according to future demand and supply. However, previous studies ignore the interconnection between demand-supply sequences among different companies and positions for predicting variations. Moreover, companies are reluctant to share their private human resource data for global labor market analysis due to concerns over jeopardizing competitive advantage, security threats, and potential ethical or legal violations. To this end, in this paper, we formulate the Federated Labor Market Forecasting (FedLMF) problem and propose a Meta-personalized Convergence-aware Clustered Federated Learning (MPCAC-FL) framework to provide accurate and timely collaborative talent demand and supply prediction in a privacy-preserving way. First, we design a graph-based sequential model to capture the inherent correlation between demand and supply sequences and company-position pairs. Second, we adopt meta-learning techniques to learn effective initial model parameters that can be shared across companies, allowing personalized models to be optimized for forecasting company-specific demand and supply, even when companies have heterogeneous data. Third, we devise a Convergence-aware Clustering algorithm to dynamically divide companies into groups according to model similarity and apply federated aggregation in each group. The heterogeneity can be alleviated for more stable convergence and better performance. Extensive experiments demonstrate that MPCAC-FL outperforms compared baselines on three real-world datasets and achieves over 97% of the state-of-the-art model, i.e., DH-GEM, without exposing private company data.
Abstract:Atrial fibrillation (AF) is the most prevalent form of cardiac arrhythmia and is associated with increased morbidity and mortality. The effectiveness of current clinical interventions for AF is often limited by an incomplete understanding of the atrial anatomical structures that sustain this arrhythmia. Late Gadolinium-Enhanced MRI (LGE-MRI) has emerged as a critical imaging modality for assessing atrial fibrosis and scarring, which are essential markers for predicting the success of ablation procedures in AF patients. The Multi-class Bi-Atrial Segmentation (MBAS) challenge at MICCAI 2024 aims to enhance the segmentation of both left and right atria and their walls using a comprehensive dataset of 200 multi-center 3D LGE-MRIs, labelled by experts. This work presents an ensemble approach that integrates multiple machine learning models, including Unet, ResNet, EfficientNet and VGG, to perform automatic bi-atrial segmentation from LGE-MRI data. The ensemble model was evaluated using the Dice Similarity Coefficient (DSC) and 95% Hausdorff distance (HD95) on the left & right atrium wall, right atrium cavity, and left atrium cavity. On the internal testing dataset, the model achieved a DSC of 88.41%, 98.48%, 98.45% and an HD95 of 1.07, 0.95, 0.64 respectively. This demonstrates the effectiveness of the ensemble model in improving segmentation accuracy. The approach contributes to advancing the understanding of AF and supports the development of more targeted and effective ablation strategies.