Abstract:The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: https://vita-home.github.io.
Abstract:Atherosclerosis is a chronic, progressive disease that primarily affects the arterial walls. It is one of the major causes of cardiovascular disease. Magnetic Resonance (MR) black-blood vessel wall imaging (BB-VWI) offers crucial insights into vascular disease diagnosis by clearly visualizing vascular structures. However, the complex anatomy of the neck poses challenges in distinguishing the carotid artery (CA) from surrounding structures, especially with changes like atherosclerosis. In order to address these issues, we propose GAPNet, which is a consisting of a novel geometric prior deduced from.
Abstract:Cerebrovascular diseases (CVDs) remain a leading cause of global disability and mortality. Digital Subtraction Angiography (DSA) sequences, recognized as the golden standard for diagnosing CVDs, can clearly visualize the dynamic flow and reveal pathological conditions within the cerebrovasculature. Therefore, precise segmentation of cerebral arteries (CAs) and classification between their main trunks and branches are crucial for physicians to accurately quantify diseases. However, achieving accurate CA segmentation in DSA sequences remains a challenging task due to small vessels with low contrast, and ambiguity between vessels and residual skull structures. Moreover, the lack of publicly available datasets limits exploration in the field. In this paper, we introduce a DSA Sequence-based Cerebral Artery segmentation dataset (DSCA), the first publicly accessible dataset designed specifically for pixel-level semantic segmentation of CAs. Additionally, we propose DSANet, a spatio-temporal network for CA segmentation in DSA sequences. Unlike existing DSA segmentation methods that focus only on a single frame, the proposed DSANet introduces a separate temporal encoding branch to capture dynamic vessel details across multiple frames. To enhance small vessel segmentation and improve vessel connectivity, we design a novel TemporalFormer module to capture global context and correlations among sequential frames. Furthermore, we develop a Spatio-Temporal Fusion (STF) module to effectively integrate spatial and temporal features from the encoder. Extensive experiments demonstrate that DSANet outperforms other state-of-the-art methods in CA segmentation, achieving a Dice of 0.9033.
Abstract:Medical Image Synthesis (MIS) plays an important role in the intelligent medical field, which greatly saves the economic and time costs of medical diagnosis. However, due to the complexity of medical images and similar characteristics of different tissue cells, existing methods face great challenges in meeting their biological consistency. To this end, we propose the Hybrid Augmented Generative Adversarial Network (HAGAN) to maintain the authenticity of structural texture and tissue cells. HAGAN contains Attention Mixed (AttnMix) Generator, Hierarchical Discriminator and Reverse Skip Connection between Discriminator and Generator. The AttnMix consistency differentiable regularization encourages the perception in structural and textural variations between real and fake images, which improves the pathological integrity of synthetic images and the accuracy of features in local areas. The Hierarchical Discriminator introduces pixel-by-pixel discriminant feedback to generator for enhancing the saliency and discriminance of global and local details simultaneously. The Reverse Skip Connection further improves the accuracy for fine details by fusing real and synthetic distribution features. Our experimental evaluations on three datasets of different scales, i.e., COVID-CT, ACDC and BraTS2018, demonstrate that HAGAN outperforms the existing methods and achieves state-of-the-art performance in both high-resolution and low-resolution.
Abstract:In recent years, significant progress has been achieved in biphasic face photo-sketch synthesis with the development of Generative Adversarial Network (GAN). Biphasic face photo-sketch synthesis could be applied in wide-ranging fields such as digital entertainment and law enforcement. However, generating realistic photos and distinct sketches suffers from great challenges due to the low quality of sketches and complex photo variations in the real scenes. To this end, we propose a novel Semantic-Driven Generative Adversarial Network to address the above issues, cooperating with the Graph Representation Learning. Specifically, we inject class-wise semantic layouts into the generator to provide style-based spatial supervision for synthesized face photos and sketches. In addition, to improve the fidelity of the generated results, we leverage the semantic layouts to construct two types of Representational Graphs which indicate the intra-class semantic features and inter-class structural features of the synthesized images. Furthermore, we design two types of constraints based on the proposed Representational Graphs which facilitate the preservation of the details in generated face photos and sketches. Moreover, to further enhance the perceptual quality of synthesized images, we propose a novel biphasic training strategy which is dedicated to refine the generated results through Iterative Cycle Training. Extensive experiments are conducted on CUFS and CUFSF datasets to demonstrate the prominent ability of our proposed method which achieves the state-of-the-art performance.
Abstract:Medical instrument segmentation in 3D ultrasound is essential for image-guided intervention. However, to train a successful deep neural network for instrument segmentation, a large number of labeled images are required, which is expensive and time-consuming to obtain. In this article, we propose a semi-supervised learning (SSL) framework for instrument segmentation in 3D US, which requires much less annotation effort than the existing methods. To achieve the SSL learning, a Dual-UNet is proposed to segment the instrument. The Dual-UNet leverages unlabeled data using a novel hybrid loss function, consisting of uncertainty and contextual constraints. Specifically, the uncertainty constraints leverage the uncertainty estimation of the predictions of the UNet, and therefore improve the unlabeled information for SSL training. In addition, contextual constraints exploit the contextual information of the training images, which are used as the complementary information for voxel-wise uncertainty estimation. Extensive experiments on multiple ex-vivo and in-vivo datasets show that our proposed method achieves Dice score of about 68.6%-69.1% and the inference time of about 1 sec. per volume. These results are better than the state-of-the-art SSL methods and the inference time is comparable to the supervised approaches.
Abstract:One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where ''x'' denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at https://github.com/yfsong0709/EfficientGCNv1.
Abstract:Face sketch synthesis has made significant progress with the development of deep neural networks in these years. The delicate depiction of sketch portraits facilitates a wide range of applications like digital entertainment and law enforcement. However, accurate and realistic face sketch generation is still a challenging task due to the illumination variations and complex backgrounds in the real scenes. To tackle these challenges, we propose a novel Semantic-Driven Generative Adversarial Network (SDGAN) which embeds global structure-level style injection and local class-level knowledge re-weighting. Specifically, we conduct facial saliency detection on the input face photos to provide overall facial texture structure, which could be used as a global type of prior information. In addition, we exploit face parsing layouts as the semantic-level spatial prior to enforce globally structural style injection in the generator of SDGAN. Furthermore, to enhance the realistic effect of the details, we propose a novel Adaptive Re-weighting Loss (ARLoss) which dedicates to balance the contributions of different semantic classes. Experimentally, our extensive experiments on CUFS and CUFSF datasets show that our proposed algorithm achieves state-of-the-art performance.
Abstract:One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the State-Of-The-Art (SOTA) models of this task tends to be exceedingly sophisticated and over-parameterized, where the low efficiency in model training and inference has obstructed the development in the field, especially for large-scale action datasets. In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. Firstly, an MIB is designed to enrich informative skeleton features and remain compact representations at an early fusion stage. Then, inspired by the success of the ResNet architecture in Convolutional Neural Network (CNN), a ResGCN module is introduced in GCN to alleviate computational costs and reduce learning difficulties in model training while maintain the model accuracy. Finally, a PartAtt block is proposed to discover the most essential body parts over a whole action sequence and obtain more explainable representations for different skeleton action sequences. Extensive experiments on two large-scale datasets, i.e., NTU RGB+D 60 and 120, validate that the proposed baseline slightly outperforms other SOTA models and meanwhile requires much fewer parameters during training and inference procedures, e.g., at most 34 times less than DGNN, which is one of the best SOTA methods.
Abstract:Accurate and efficient catheter segmentation in 3D ultrasound (US) is essential for cardiac intervention. Currently, the state-of-the-art segmentation algorithms are based on convolutional neural networks (CNNs), which achieved remarkable performances in a standard Cartesian volumetric data. Nevertheless, these approaches suffer the challenges of low efficiency and GPU unfriendly image size. Therefore, such difficulties and expensive hardware requirements become a bottleneck to build accurate and efficient segmentation models for real clinical application. In this paper, we propose a novel Frustum ultrasound based catheter segmentation method. Specifically, Frustum ultrasound is a polar coordinate based image, which includes same information of standard Cartesian image but has much smaller size, which overcomes the bottleneck of efficiency than conventional Cartesian images. Nevertheless, the irregular and deformed Frustum images lead to more efforts for accurate voxel-level annotation. To address this limitation, a weakly supervised learning framework is proposed, which only needs 3D bounding box annotations overlaying the region-of-interest to training the CNNs. Although the bounding box annotation includes noise and inaccurate annotation to mislead to model, it is addressed by the proposed pseudo label generated scheme. The labels of training voxels are generated by incorporating class activation maps with line filtering, which is iteratively updated during the training. Our experimental results show the proposed method achieved the state-of-the-art performance with an efficiency of 0.25 second per volume. More crucially, the Frustum image segmentation provides a much faster and cheaper solution for segmentation in 3D US image, which meet the demands of clinical applications.