Kaldi recipe for Mozilla's Common Voice corpus

Recently, Mozilla published a first version of their Common Voice corpus. It consists of speech prompts from an unknown number of speakers (no speaker IDs are provided due to privacy concerns, see this forum thread). About 254 hours have been validated by multiple listeners. The data has been split into pre-defined training, development and test sets.

I created an initial Kaldi recipe for the corpus as a research exercise. It has since been merged into the Kaldi master branch. Using the pre-defined split into training, development, and test sets, and without any tuning of parameters, a system using a time-delay neural network (TDNN) based acoustic model and a 3-gram language model achieved a WER of 4.27% on the test set; however, given the near-complete overlap in spoken utterances between the training, development, and test sets, these results are not very meaningful in terms of generalization to other datasets. See this issue for an ongoing discussion on how the data collection effort could be improved.

more ...

Interspeech 2016 special event "Speaker Comparison for Forensic and Investigative Applications II"

I am going to be contributing to the special event titled "Speaker Comparison for Forensic and Investigative Applications II" at Interspeech 2016, held on September 10 at 10:00 am in the Grand Ballroom of the Hyatt Regency, San Francisco.

I am also presenting a paper on likelihood ratio calculation in acoustic-phonetic forensic voice comparison at Interspeech on Friday 9 at 2:00 pm in Room Seacliff BCD.

more ...

Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01)

Geoff Morrison and I are running a Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01).

There is increasing pressure on forensic laboratories to validate the performance of forensic analysis systems before they are used to assess strength of evidence for presentation in court. Different forensic voice comparison systems may use different approaches, and even among systems using the same general approach there can be substantial differences in operational details. From case to case, the relevant population, speaking styles, and recording conditions can be highly variable, but it is common to have relatively poor recording conditions and mismatches between the known- and questioned-speaker recordings. In order to validate a system intended for use in casework, a forensic laboratory needs to evaluate the degree of validity and reliability of the system under forensically realistic conditions. We have released a set of training and test data representative of the relevant population and reflecting the conditions of an actual forensic voice comparison case, and operational forensic laboratories and research laboratories are invited to use these data to train and test their systems. The details below include the rules for the evaluation, a description of the data, and a description of the evaluation metrics and graphics. The name of the evaluation is: forensic_eval_01

Papers reporting on the results of the evaluation of each system will be published in a Virtual Special Issue (VSI) of Speech Communication.

Details (draft of introductory paper for the VSI).

more ...