Abstract:Visual Language Models have demonstrated remarkable capabilities across tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in human-machine interactions. Moreover, the quality of language models depends on reasoning and prompting techniques, such as COT, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, a novel end-to-end multimodal model that uses speech instructions for reasoning in visual question answering. In addition, we investigate reasoning techniques with levels including conversational, simple, and complex speech instruction. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling intuitive interactions by allowing users to provide verbal or text instructions. To this end, we introduce a dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model ability to process and explain visual scenes from spoken input, moving beyond object recognition to reasoning-based interactions. The experiments show that SilVar achieves SOTA performance on the MMMU and ScienceQA benchmarks despite the challenge of speech-based instructions. We believe SilVar will inspire next-generation multimodal reasoning models, toward expert artificial general intelligence. Our code and dataset are available here.
Abstract:Robotic manipulators are widely used in various industries for complex and repetitive tasks. However, they remain vulnerable to unexpected hardware failures. In this study, we address the challenge of enabling a robotic manipulator to complete tasks despite joint malfunctions. Specifically, we develop a reinforcement learning (RL) framework to adaptively compensate for a non-functional joint during task execution. Our experimental platform is the Franka robot with 7 degrees of freedom (DOFs). We formulate the problem as a partially observable Markov decision process (POMDP), where the robot is trained under various joint failure conditions and tested in both seen and unseen scenarios. We consider scenarios where a joint is permanently broken and where it functions intermittently. Additionally, we demonstrate the effectiveness of our approach by comparing it with traditional inverse kinematics-based control methods. The results show that the RL algorithm enables the robot to successfully complete tasks even with joint failures, achieving a high success rate with an average rate of 93.6%. This showcases its robustness and adaptability. Our findings highlight the potential of RL to enhance the resilience and reliability of robotic systems, making them better suited for unpredictable environments. All related codes and models are published online.
Abstract:Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, a collection of small-to-large end-to-end ASR models for the medical domain, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese, together with the corresponding real-world ASR dataset. To our best knowledge, MultiMed stands as the largest and the first multilingual medical ASR dataset, in terms of total duration, number of speakers, diversity of diseases, recording conditions, speaker roles, unique medical terms, accents, and ICD-10 codes. Secondly, we establish the empirical baselines, present the first reproducible study of multilinguality in medical ASR, conduct a layer-wise ablation study for end-to-end ASR training, and provide the first linguistic analysis for multilingual medical ASR. All code, data, and models are available online https://github.com/leduckhai/MultiMed/tree/master/MultiMed
Abstract:Knowledge graphs (KGs) enhance the performance of large language models (LLMs) and search engines by providing structured, interconnected data that improves reasoning and context-awareness. However, KGs only focus on text data, thereby neglecting other modalities such as speech. In this work, we introduce wav2graph, the first framework for supervised learning knowledge graph from speech data. Our pipeline are straightforward: (1) constructing a KG based on transcribed spoken utterances and a named entity database, (2) converting KG into embedding vectors, and (3) training graph neural networks (GNNs) for node classification and link prediction tasks. Through extensive experiments conducted in inductive and transductive learning contexts using state-of-the-art GNN models, we provide baseline results and error analysis for node classification and link prediction tasks on human transcripts and automatic speech recognition (ASR) transcripts, including evaluations using both encoder-based and decoder-based node embeddings, as well as monolingual and multilingual acoustic pre-trained models. All related code, data, and models are published online.
Abstract:Vision-language models have been extensively explored across a wide range of tasks, achieving satisfactory performance; however, their application in medical imaging remains underexplored. In this work, we propose a unified framework - LiteGPT - for the medical imaging. We leverage multiple pre-trained visual encoders to enrich information and enhance the performance of vision-language models. To the best of our knowledge, this is the first study to utilize vision-language models for the novel task of joint localization and classification in medical images. Besides, we are pioneers in providing baselines for disease localization in chest X-rays. Finally, we set new state-of-the-art performance in the image classification task on the well-benchmarked VinDr-CXR dataset. All code and models are publicly available online: https://github.com/leduckhai/LiteGPT
Abstract:Precision devices play an important role in enhancing production quality and productivity in agricultural systems. Therefore, the optimization of these devices is essential in precision agriculture. Recently, with the advancements of deep learning, there have been several studies aiming to harness its capabilities for improving spray system performance. However, the effectiveness of these methods heavily depends on the size of the training dataset, which is expensive and time-consuming to collect. To address the challenge of insufficient training samples, this paper proposes an alternative solution by generating artificial images of droplets using generative adversarial networks (GAN). The GAN model is trained by using a small dataset captured by a high-speed camera and capable of generating images with progressively increasing resolution. The results demonstrate that the model can generate high-quality images with the size of $1024\times1024$. Furthermore, this research leverages recent advancements in computer vision and deep learning to develop a light droplet detector using the synthetic dataset. As a result, the detection model achieves a 16.06\% increase in mean average precision (mAP) when utilizing the synthetic dataset. To the best of our knowledge, this work stands as the first to employ a generative model for augmenting droplet detection. Its significance lies not only in optimizing nozzle design for constructing efficient spray systems but also in addressing the common challenge of insufficient data in various precision agriculture tasks. This work offers a critical contribution to conserving resources while striving for optimal and sustainable agricultural practices.
Abstract:Automated medical image segmentation is becoming increasingly crucial in modern clinical practice, driven by the growing demand for precise diagnoses, the push towards personalized treatment plans, and advancements in machine learning algorithms, especially the incorporation of deep learning methods. While convolutional neural networks (CNNs) have been prevalent among these methods, the remarkable potential of Transformer-based models for computer vision tasks is gaining more acknowledgment. To harness the advantages of both CNN-based and Transformer-based models, we propose a simple yet effective UNet-Transformer (seUNet-Trans) model for medical image segmentation. In our approach, the UNet model is designed as a feature extractor to generate multiple feature maps from the input images, and these maps are propagated into a bridge layer, which sequentially connects the UNet and the Transformer. In this stage, we employ the pixel-level embedding technique without position embedding vectors to make the model more efficient. Moreover, we applied spatial-reduction attention in the Transformer to reduce the computational/memory overhead. By leveraging the UNet architecture and the self-attention mechanism, our model not only preserves both local and global context information but also captures long-range dependencies between input elements. The proposed model is extensively experimented on five medical image segmentation datasets, including polyp segmentation, to demonstrate its efficacy. A comparison with several state-of-the-art segmentation models on these datasets shows the superior performance of seUNet-Trans.
Abstract:This work leverages the recent advancements of deep learning in image processing to find optimal locations that present the important characteristics of a field. The data for training are collected at different fields in local farms with five features: aspect, flow accumulation, slope, NDVI (normalized difference vegetation index), and yield. The soil sampling dataset is challenging because the ground truth is highly imbalanced binary images. Therefore, we approached the problem with two methods, the first approach involves utilizing a state-of-the-art model with the convolutional neural network (CNN) backbone, while the second is to innovate a deep-learning design grounded in the concepts of transformer and self-attention. Our framework is constructed with an encoder-decoder architecture with the self-attention mechanism as the backbone. In the encoder, the self-attention mechanism is the key feature extractor, which produces feature maps. In the decoder, we introduce atrous convolution networks to concatenate, fuse the extracted features, and then export the optimal locations for soil sampling. Currently, the model has achieved impressive results on the testing dataset, with a mean accuracy of 99.52%, a mean Intersection over Union (IoU) of 57.35%, and a mean Dice Coefficient of 71.47%, while the performance metrics of the state-of-the-art CNN-based model are 66.08%, 3.85%, and 1.98%, respectively. This indicates that our proposed model outperforms the CNN-based method on the soil-sampling dataset. To the best of our knowledge, our work is the first to provide a soil-sampling dataset with multiple attributes and leverage deep learning techniques to enable the automatic selection of soil-sampling sites. This work lays a foundation for novel applications of data science and machine-learning technologies to solve other emerging agricultural problems.