Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhongfu Ye

A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

Oct 11, 2023

Rashid Khan, Bingding Huang, Haseeb Hassan, Asim Zaman, Zhongfu Ye

Figure 1 for A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

Figure 2 for A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

Figure 3 for A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

Figure 4 for A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

Abstract:Image captioning is a challenging task involving generating a textual description for an image using computer vision and natural language processing techniques. This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism. Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate descriptive sentences. To improve performance, we integrate the Bahdanau attention model with the GRU decoder to enable learning to focus on specific image parts. We evaluate our approach using the MSCOCO and Flickr30k datasets and show that it achieves competitive scores compared to state-of-the-art methods. Our proposed framework can bridge the gap between computer vision and natural language and can be extended to specific domains.

* 15pages, 10 figures, 5 tables. 2023 the 5th International Conference on Robotics and Computer Vision (ICRCV 2023). arXiv admin note: substantial text overlap with arXiv:2203.01594

Via

Access Paper or Ask Questions

Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Dec 07, 2022

Genshun Wan, Tan Liu, Hang Chen, Jia Pan, Cong Liu, Zhongfu Ye

Figure 1 for Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Figure 2 for Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Figure 3 for Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Figure 4 for Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Abstract:Self-supervised learning (SSL) models have achieved considerable improvements in automatic speech recognition (ASR). In addition, ASR performance could be further improved if the model is dedicated to audio content information learning theoretically. To this end, we propose a progressive multi-scale self-supervised learning (PMS-SSL) method, which uses fine-grained target sets to compute SSL loss at top layer while uses coarse-grained target sets at intermediate layers. Furthermore, PMS-SSL introduces multi-scale structure into multi-head self-attention for better speech representation, which restricts the attention area into a large scope at higher layers while restricts the attention area into a small scope at lower layers. Experiments on Librispeech dataset indicate the effectiveness of our proposed method. Compared with HuBERT, PMS-SSL achieves 13.7% / 12.7% relative WER reduction on test other evaluation subsets respectively when fine-tuned on 10hours / 100hours subsets.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism

Mar 03, 2022

Rashid Khan, M Shujah Islam, Khadija Kanwal, Mansoor Iqbal, Md. Imran Hossain, Zhongfu Ye

Figure 1 for A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism

Figure 2 for A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism

Figure 3 for A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism

Figure 4 for A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism

Abstract:Image captioning is a fast-growing research field of computer vision and natural language processing that involves creating text explanations for images. This study aims to develop a system that uses a pre-trained convolutional neural network (CNN) to extract features from an image, integrates the features with an attention mechanism, and creates captions using a recurrent neural network (RNN). To encode an image into a feature vector as graphical attributes, we employed multiple pre-trained convolutional neural networks. Following that, a language model known as GRU is chosen as the decoder to construct the descriptive sentence. In order to increase performance, we merge the Bahdanau attention model with GRU to allow learning to be focused on a specific portion of the image. On the MSCOCO dataset, the experimental results achieve competitive performance against state-of-the-art approaches.

* Information Technology and Control 2022
* 16 PAGES, 8 figures, 1 TABLE

Via

Access Paper or Ask Questions