Friendly VOiCES: A Kinder Experience

John Berkowitz
Gab41
Published in
7 min readMay 19, 2020

--

Over the past two years we’ve worked in collaboration with SRI International, to build the Voices Obscured in Complex Environmental Settings (VOiCES) dataset. See here, here, here, and here for previous Gab41 posts on VOiCES. Most available speech audio datasets have focused on recording situations with close-range mics and without noise. However, real-world applications rarely conform to such ideal conditions. The goal behind the VOiCES dataset is to provide a large corpus of speech recorded in realistic acoustic conditions, across a variety of microphone distances, and in the presence of varying background distractor noise. We believe such a dataset will be invaluable for researchers working on automatic speech recognition (ASR), speaker verification, and audio denoising. For example, it is important for designers of ASR systems to quantify how modest amounts of reverberation and background noise affect the performance of a system that has been trained on “clean” audio before deployment.

We are proud to announce the final release of VOiCES, including all four room’s recordings. In addition to increasing the amount of recordings included in the dataset, we also have made several modifications and added features to the dataset to increase ease of use. In this post we will review some of these added features and review an example of an application of using VOiCES to improve robustness in an ASR system.

Reorganized Files, Metadata, and Starter Code

In addition to reorganizing the data directory structure into train and test sets that avoid overlap in speakers and transcript, we also have restructured directories to separate recordings by speaker, distractor type, and room. As the full dataset is around 420 GiB in size, we also provide a 28 GiB subset of the dataset, VOiCES_devkit, that is designed to maximize diversity in speaker set and retain all variation in room and distractor type.

To avoid forcing each user to write their own custom scripts for parsing and looping over files, we include index files that aggregate useful information about each recording including file paths to the recording audio and the original librispeech source audio; metadata such as mic number, speaker id, and distractor type; and orthographic transcript.

Example rows of an index file, after loading into a Pandas DataFrame

These prebuilt index files facilitate creating different sub-slices of the dataset and/or conversion into the necessary format for input into other machine learning pipelines, as we demonstrate later in this post. Further information on these index files as well other precomputed data can be found on the VOiCES Readme page.

We also have set up a repository containing useful scripts, utilities, and starter code for working with VOiCES. Included are scripts for rebuilding the aforementioned index files and transforming an index file into other formats, as well as an example implementation of a PyTorch DataLoader for speaker verification with VOiCES.

Evaluating and Fine-tuning an End-to-End ASR Model on VOiCES

As an example of the utility of the dataset, we evaluated an ASR model on VOiCES and explored several approaches to fine-tuning the model to increase robustness. The model we will focus on is QuartzNet. Several factors make Quartznet useful for this demonstration:

  1. QuartzNet is trained end-to-end from the raw waveform to a prediction over the transcript. Thus, we do not need to compute alignments between the waveform and the transcript, a difficult task for noisy audio. Despite this, QuartzNet still achieves close to state of the art performance on the subset of Librispeech that is used for VOiCES’s source audio.
  2. QuartzNet’s acoustic model is trained directly through CTC loss, separately from any kind of autoregressive language model such as n-grams or transformers (though such models may be integrated to improve performance at inference time). Thus we may attribute changes in performance on VOiCES to improvments in the acoustic model and not overfitting to the distribution of transcripts.
  3. Training and doing inference with QuartzNet is made easier using NVIDIA NeMo. NeMo provides scripts for training and finetuning QuartzNet on new speech audio, pretrained QuartzNet checkpoints, and facilitates building a custom pipeline for programmatic inference.

Using NeMo, we have put together a pipeline for doing batch inference on VOiCES that aggregates predictions and word error rate (WER) for each file. We compared the performance of three QuartzNet 15x5 models on the test sets of Librispeech-Clean and VOiCES:

  1. The checkpoint that has been pre-trained on Librispeech
  2. The pre-trained checkpoint fined tuned for 5 epochs on VOiCES-train alone.
  3. The pre-trained checkpoint fined tuned for 5 epochs on the a mixture of VOiCES-train and Librispeech Train-clean 100.

All fine tuning included data augmentation with SpecAugment and used the Novograd Optimizer.

Results of fine-turning QuartzNet 15x5 (WER %)

Unsurprisingly, a model trained only on clean audio performs much worse on VOiCES. Additionally, fine-tuning only on VOiCES leads to a dramatic drop in performance on clean Librispeech audio in exchange for only a modest improvement on VOiCES. Finally, we note that fine-tuning with a mixture of VOiCES and Librispeech avoids this type of overfitting and achieves roughly the same increase in robustness.

One of the main benefits of the VOiCES dataset is that each “clean” audio sample is used as source audio for every combination of room, distractor sound, and microphone. This makes it easier to isolate out the effects of different variables on the performance of a speech processing system. For instance, we can plot the average WER as a function of distractor type.

WER as a function of distractor type and fine-tuning

As the figure shows, the most harmful type of distractor sound is “babble”, followed by “music”, “television”, and then “none.” While it is unsurprising that babble is the most harmful distractor sound, since the ASR system wouldn’t be able to tell the “source” speech from the “distractor” speech, it is notable that QuartzNet only achieves ~30% WER with no added distractors, even after fine-tuning. In this setting, the only distortion comes from the reverberation added by the room and low level incidental noise such as HVAC. In previous competitions using VOiCES, the most successful approaches utilized sophisticated pre-processing techniques such as explicit de-reverberation, and the results above would support the necessity of doing so.

On a related note, another benefit of the VOiCES dataset is the broad distribution of speech quality present in the recordings. Depending on the presence and type of distractor sounds, distance from the mic to source and distractor speakers, and microphone type, the intelligibility of the original audio can range from very clear to very poor. This makes it possible to quantify the correlation between objective measures of speech quality and the performance of a speech processing system. Because we have access to the original source audio for every recording in VOiCES, we can compute intrusive as well as unintrusive measures of speech quality. We computed five popular objective measures for each recording: Narrowband Perceptual Evaluation of Speech Quality (pesq nb), Wideband Perceptual Evaluation of Speech Quality (pesq wb), Short Time Objective Intelligibility Measure (STOI), Speech Intelligibility in Bits (SIIB), and Normalized speech-to-reverberation modulation energy ratio (SRMR). More details about how we computed these measures and how to download the results are available here. We then computed two non-parametric measures of correlation, Kendall-Tau and Spearman Rank, between the QuartzNet WER and each of these measures.

Kendall-Tau correlation between WER and Speech Quality Measures
Spearman’s rank correlation between WER and Speech Quality Measures

We note that all five measures are highly predictive, particularly SRMR and SIIB, and that fine-tuning QuartzNet slightly decreases the sensitivity to speech quality. In the future, we hope that VOiCES can serve as a useful resource for developing and testing new measures of speech quality and intelligibility. If you have a new metric of measure you’d like to benchmark on VOiCES, let us know!

VOiCES Looking Forward

The final release of VOiCES is available on the AWS registry of open data (see here for download instructions). We’re currently working with SRI international to build VOiCES II, a dataset that will feature real conversations between multiple speakers with multi-modal recordings. We encourage anyone interested in collaborating to visit our repository of starter code and add feature requests or help us add VOiCES to your speech audio toolkit.

Lab41 is a Silicon Valley challenge lab where experts from the U.S. Intelligence Community (IC), academia, industry, and In-Q-Tel come together to gain a better understanding of how to work with — and ultimately use — big data.

Learn more at lab41.org and follow us on Twitter: @_lab41

--

--