Automatic detection and correction of disfluencies in English spontaneous speech
Spontaneous speech is optimized for human-human communication, but it presents major challenges to intelligent systems interacting with people. Lack of sequential turns, presence of non-verbal cues and disfluencies (i.e., disruptions to fluent speech) are some properties of spontaneous speech that intelligent systems need to handle in order to achieve a humanlike understanding of speech. Of the challenges associated with modeling spontaneous speech, this thesis will focus on detecting and correcting speech disfluencies. I first begin by considering the challenge of learning disfluency-related features in recurrent and convolutional neural networks. Although state-of-the-art disfluency detection results have been reported for deep neural networks, their performance still heavily depends on hand-crafted features. My first contribution is a novel neural layer that rectifies the architectural limitations that prevent pre-Transformer models from learning appropriate features from words alone. Extensive evaluations indicate that the proposed model is able to automatically learn these features and its performance is competitive to complex approaches in the literature. Disfluencies are problematic for conventional syntactic parsers, which typically fail to find any disfluency nodes at all. I address this problem by investigating a multi-task learning model for joint disfluency detection and constituency parsing. I show that modern deep neural syntactic parsers, unlike conventional parsers, not only can detect disfluencies as part of parsing task but their performance is also better than specialized disfluency detection systems. I also demonstrate that syntactic information helps the neural syntactic parser detect disfluencies more accurately. Lack of large-scale labeled data is a bottleneck for improving the performance of disfluency detection models. The existing data augmentation techniques usually generate artificial synthetic disfluencies which hardly resemble gold natural data. I address the scarcity of human-labeled data by exploring a self-training technique. I show conventional methods for improving system performance (self-training and ensembling) produce a competitive or better result than all of the specialized methods for exploiting unlabelled data mentioned in the prior work. Despite these advances, disfluency detection models do not generalize well to ASR outputs. This limitation severely hinders the use of these models in real applications. To address this problem, I propose the end-to-end task of speech recognition and disfluency removal. I show end-to-end ASR models do learn to directly generate fluent transcripts out of disfluent speech. I also introduce two new metrics for evaluating the performance of end-to-end models. The findings of this research serve as a starting point for future research into end-to-end ASR and disfluency removal systems.