Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haojun Ai

Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Jan 13, 2025

Jiliang Hu, Zuchao Li, Mengjia Shen, Haojun Ai, Sheng Li, Jun Zhang

Figure 1 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Figure 2 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Figure 3 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Figure 4 for Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

Abstract:Spoken language understanding (SLU) is a structure prediction task in the field of speech. Recently, many works on SLU that treat it as a sequence-to-sequence task have achieved great success. However, This method is not suitable for simultaneous speech recognition and understanding. In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. We conduct experiments on name entity recognition and intent classification using the Chinese dataset AISHELL-NER and the English dataset SLURP. The results show that our proposed method not only outperforms the traditional sequence-to-sequence method in both transcription and extraction capabilities but also achieves state-of-the-art performance on the two datasets.

* 5 pages, 2 figures, accepted by ICASSP 2025

Via

Access Paper or Ask Questions

VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Oct 01, 2024

Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, Hai Zhao

Figure 1 for VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Figure 2 for VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Figure 3 for VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Figure 4 for VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Abstract:The image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model's speech recognition ability. Its performance not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.

* 14 pages, 6 figures, accepted by EMNLP 2024

Via

Access Paper or Ask Questions

Hypergraph based Understanding for Document Semantic Entity Recognition

Jul 09, 2024

Qiwei Li, Zuchao Li, Ping Wang, Haojun Ai, Hai Zhao

Figure 1 for Hypergraph based Understanding for Document Semantic Entity Recognition

Figure 2 for Hypergraph based Understanding for Document Semantic Entity Recognition

Figure 3 for Hypergraph based Understanding for Document Semantic Entity Recognition

Figure 4 for Hypergraph based Understanding for Document Semantic Entity Recognition

Abstract:Semantic entity recognition is an important task in the field of visually-rich document understanding. It distinguishes the semantic types of text by analyzing the position relationship between text nodes and the relation between text content. The existing document understanding models mainly focus on entity categories while ignoring the extraction of entity boundaries. We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time. It can conduct a more detailed analysis of the document text representation analyzed by the upstream model and achieves a better performance of semantic information. We apply this method on the basis of GraphLayoutLM to construct a new semantic entity recognition model HGALayoutLM. Our experiment results on FUNSD, CORD, XFUND and SROIE show that our method can effectively improve the performance of semantic entity recognition tasks based on the original model. The results of HGALayoutLM on FUNSD and XFUND reach the new state-of-the-art results.

Via

Access Paper or Ask Questions

Error Model of Radio Fingerprint and PDR Fusion Indoor Localization

Jan 05, 2020

Haojun Ai, Kaifeng Tang, Sheng Zhang, Yuhong Yang

Figure 1 for Error Model of Radio Fingerprint and PDR Fusion Indoor Localization

Figure 2 for Error Model of Radio Fingerprint and PDR Fusion Indoor Localization

Figure 3 for Error Model of Radio Fingerprint and PDR Fusion Indoor Localization

Abstract:Multi-source fusion positioning is one of the technical frameworks for obtaining sufficient indoor positioning accuracy. In order to evaluate the effect of multi-source fusion positioning, it is necessary to establish a fusion error model. In this paper, we first use the least squares method to fuse the radio fingerprint and the PDR positioning, and then apply the variance propagation laws to calculate the error distribution of indoor multi-source localization methods. Based on the fusion error model, we developed an indoor positioning simulation system. The system can give a better positioning source layout scheme under a given condition, and can evaluate the signal strength distribution and the error distribution.

Via

Access Paper or Ask Questions