Abstract:Captioning images is a challenging scene-understanding task that connects computer vision and natural language processing. While image captioning models have been successful in producing excellent descriptions, the field has primarily focused on generating a single sentence for 2D images. This paper investigates whether integrating depth information with RGB images can enhance the captioning task and generate better descriptions. For this purpose, we propose a Transformer-based encoder-decoder framework for generating a multi-sentence description of a 3D scene. The RGB image and its corresponding depth map are provided as inputs to our framework, which combines them to produce a better understanding of the input scene. Depth maps could be ground truth or estimated, which makes our framework widely applicable to any RGB captioning dataset. We explored different fusion approaches to fuse RGB and depth images. The experiments are performed on the NYU-v2 dataset and the Stanford image paragraph captioning dataset. During our work with the NYU-v2 dataset, we found inconsistent labeling that prevents the benefit of using depth information to enhance the captioning task. The results were even worse than using RGB images only. As a result, we propose a cleaned version of the NYU-v2 dataset that is more consistent and informative. Our results on both datasets demonstrate that the proposed framework effectively benefits from depth information, whether it is ground truth or estimated, and generates better captions. Code, pre-trained models, and the cleaned version of the NYU-v2 dataset will be made publically available.
Abstract:Deep Neural Networks (DNNs) show a significant impact on medical imaging. One significant problem with adopting DNNs for skin cancer classification is that the class frequencies in the existing datasets are imbalanced. This problem hinders the training of robust and well-generalizing models. Data Augmentation addresses this by using existing data more effectively. However, standard data augmentation implementations are manually designed and produce only limited reasonably alternative data. Instead, Generative Adversarial Networks (GANs) is utilized to generate a much broader set of augmentations. This paper proposes a novel enhancement for the progressive generative adversarial networks (PGAN) using self-attention mechanism. Self-attention mechanism is used to directly model the long-range dependencies in the feature maps. Accordingly, self-attention complements PGAN to generate fine-grained samples that comprise clinically-meaningful information. Moreover, the stabilization technique was applied to the enhanced generative model. To train the generative models, ISIC 2018 skin lesion challenge dataset was used to synthesize highly realistic skin lesion samples for boosting further the classification result. We achieve an accuracy of 70.1% which is 2.8% better than the non-augmented one of 67.3%.