Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Spontaneous Informal Speech Dataset for Punctuation Restoration

Sep 17, 2024

Xing Yi Liu, Homayoon Beigi

Figure 1 for Spontaneous Informal Speech Dataset for Punctuation Restoration

Figure 2 for Spontaneous Informal Speech Dataset for Punctuation Restoration

Figure 3 for Spontaneous Informal Speech Dataset for Punctuation Restoration

Figure 4 for Spontaneous Informal Speech Dataset for Punctuation Restoration

Share this with someone who'll enjoy it:

Abstract:Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset building and model runs.

* Recognition Technologies, Inc. Technical Report, 2024 * 8 pages, 7 tables, 1 figure, Recognition Technologies, Inc. Technical Report

View paper on

Share this with someone who'll enjoy it:

Title:Spontaneous Informal Speech Dataset for Punctuation Restoration

Paper and Code