Tuesday, February 28, 2006

Some resources for noise-robust and channel-robust speech processing

The original of this page can be accessed here. I copy to my site for easy references.

This page collects links to software and data resources related to research on automatic speech recognition (ASR) that is robust to background noise and convolutional distortions such as reverberation. Some of the links pointed to by this page are also relevant to research on enhancing speech for human listening. If you would like to suggest more links for this page, you are invited to contact the page's maintainer, David Gelbart at ICSI.

If you use software or other resources pointed to by this page, please respect the license terms (and, when applicable, patent rights). If the use contributes to an academic publication, the maintainer suggest mentioning this in the publication and referencing the original (not this page of pointers) source by giving a publication or URL reference that will allow others to obtain the resource. This serves four purposes: (1) it alerts readers of the publication to the availability of the resource, (2) it provides a precise specification of that aspect of the work described in the publication, (3) it assigns credit where credit is due, and (4) it shows readers of the publications that sharing resources with others leads to public recognition, encouraging future sharing.

Successful approaches to robust ASR may combine more than one robustness technique. Because of the simple data flow of much signal processing code, different tools can often be used together simply by running them in sequence, using pipes or intermediate files. Two convenient choices for intermediate file formats are HTK feature files, and waveforms. Many of the tools online here operate on HTK feature files, or can output HTK feature files. The HTK format is a useful intermediate file format for feature files because it is simple to read, write, and convert to other formats, and because of the popularity of HTK. Also, some algorithms can be used with other tools without any modification to those other tools by having the algorithms run speech-enhancement-style, outputting processed waveforms which the
other tools treat as they would any other audio input file. Using processed waveforms as an intermediate format also allows listening, waveform plotting, and spectrogram plotting, which may lead to useful insights. (If using processed waveforms as an intermediate format, it may be worthwhile to store these processed waveforms in floating point, rather than the usual 16-bit integer storage format, to reduce roundoff error and eliminate the risk of overflow/saturation error. Since algorithms may change the scale of waveforms, there is a risk of overflow or underflow with a 16-bit integer format even if the original waveforms were well scaled for that format.)

Enhancement/compensation software for ASR and human listening:

Software for ASR:

Software for signal quality measurement:

Software and data for reproducing or simulating acoustic conditions:

Other:


VOICEBOX

The VOICEBOX Matlab toolbox for audio processing includes a noise reduction routine (specsubm), routines to read and write audio files from Matlab, and many other things.

Beamforming Toolkit

The Karlsruhe beamforming toolkit: "btk is a toolkit that provides a basis for the implementation of
powerful beamforming algorithms. btk uses Python as a scripting language for ease of control and modification. The capacity to efficiently perform advanced numerical computations is provided by
Numeric Python (NumPy), the GNU Scientific Library (GSL), as well as a few extra algorithms we've implemented ourselves."

Qualcomm-ICSI-OGI front end, speech detection, and noise reduction

This archive contains source code and documentation for the Qualcomm-ICSI-OGI noise-robust front end described in the ICSLP 2002 paper by Adami et al. The archive also contains tools for using the speech detection, Wiener filter noise reduction, or nonspeech frame dropping features of the front end independently of other features. The noise reduction can be used independently of other components to produce noise-reduced waveforms.

Matlab noise reduction tools by Patrick Wolfe

Matlab source code for various noise reduction algorithms is available here.

Trausti Kristjansson

Trausti Kristjansson created this page (while at the University of Toronto) which provides Matlab source code for (1) spectral subtraction noise removal, (2) the Algonquin variational inference algorithm for removing noise and channel effects, and (3) the Recognition Analyzer diagnostic tool which displays features, HTK log likelihoods, and HTK state sequences and can create resynthesized audio from MFCC features.

Marc Ferras' code for multi-microphone speech enhancement

This page provides source code for several blind multi-microphone speech enhancement techniques. These were implemented by Marc Ferras while pursuing his masters thesis on
multi-microphone signal processing for automatic speech recognition in meeting rooms.

The RESPITE CASA Toolkit

The RESPITE CASA Toolkit is a toolkit for Computational Auditory Scene Analysis (CASA).
This includes a tutorial on using the toolkit for missing data speech recognition.

Seneff auditory model

This page has source code for an implementation of Stephanie Seneff's auditory model
front end for ASR.

RASTA and MSG

C/C++ implementations of the RASTA and MSG (modulation-filtered spectrogram) algorithms for robust feature extraction are available as part of this ICSI speech software package. There is also this older page for RASTA at ICSI. There is a MATLAB implementation of RASTA at Dan Ellis' Matlab page.

MVA (Mean, Variance, ARMA)

This page provides source code for this technique proposed by Chia-Ping Chen and Jeff Bilmes which post-processes noisy cepstra by doing mean and variance normalization (M for mean, V for variance) and bandpass modulation filtering (A for ARMA).

Gabor filter analysis for speech recognition

This page provides articles, filter definitions, software tools, and discussion related to automatic speech recognition (ASR) with Gabor filters. A Matlab package for feature selection using the Feature Finding Neural Networks (FFNN) approach proposed by Tino Gramß (Gramss) is available as well. (This FFNN package was used to select Gabor filters for ASR.)

Objective Speech Quality Assessment

The CSLU Robust Speech Processing Laboratory software repository page hosts the Objective Speech Quality Assessment package (developed by Bryan Pellom, referred to in an ICSLP 98 paper by Hansen and Pellom) which calculates various metrics of speech quality based on comparing clean audio with noisy or noise-reduced audio.

NIST Speech Quality Assurance Package (SPQA)

The SPQA package includes SNR measurement tools which do not require a clean audio reference.

FaNT tool for adding noise or telephone characteristics to speech

The FaNT (Filtering and Noise-adding Tool) tool can be used to add noise to speech recordings at a desired SNR (signal-to-noise ratio). The SNR can be calculated using frequency weighting (G.712 or A-weighting), and the speech energy is calculated following ITU recommendation P.56. The tool can also be used to filter speech with certain frequency characteristics defined by the ITU for telephone equipment. This tool was used to create the noisy data for the popular AURORA 2 speech recognition corpus.

Acoustic impulse responses

This page, created by James Hopgood, is a collection of acoustic impulse responses for simulating convolutional distortion. The focus is on hands-free / far-field acoustic conditions.
Some past speech recognition work (by Shire, Kingsbury, Avendano, Palomaki, Morgan, Chen, Gelbart, possibly others) has been done using a set of impulse responses collected using the varechoic chamber at Bell Labs. It is planned to make these available on Hopgood's web page. Until they are available there, a download link has been placed here.

More acoustic impulse responses are available as part of the Sound Scene Database in Real Acoustical Environments from the Real World Computing Partnership, here. The site noisevault.com has acoustic impulse responses as well as links to software and documents regarding impulse response measurement and acoustic simulation; it seems aimed at audio engineers and audio engineering hobbyists.

Room acoustics simulator

The AudioGroup at the University of Patras have placed public domain room acoustics simulators online here.

Additive Noise Sources

The CSLU Robust Speech Processing Laboratory software repository page hosts a package named Additive Noise Sources which contains noise files for use in simulating additive noise.

NOISEX noises

This page at Rice has a set of downloadable noises. I think these are from the NOISEX-92 collection, but I don't know if this is the complete collection. I am not trying to give a comprehensive list of corpora on this page, but this page in the comp.speech FAQ has some links.

ShATR multiple simultaneous speaker corpus

Here. "ShATR is a corpus of overlapped speech collected by the University of Sheffield Speech and Hearing Research Group in collaboration with ATR in order to support research into computational auditory scene analysis. The task involved four participants working in pairs to solve two crosswords. A fifth participant acted as a hint-giver. Eight channels of audio data were recorded from the following sensors: one close microphone per speaker, one omnidirectional microphone, and the two channels of a binaurally-wired mannekin. Around 41% of the corpus contains overlapped speakers. In addition, a variety of other audio data was collected from each participant. The entire corpus, which has a duration of around 37 minutes, has been segmented and transcribed at 5 levels, from subtasks down to phones. In addition, all nonspeech sounds have been marked."

A brief list of resources that are not specific to noise and channel robustness

WaveSurfer speech visualization tool (view waveforms, spectrograms, formant tracks, pitch tracks) and other KTH-hosted software, HTK recognizer, SPHINX recognizer , ICSI speech software package (link above), ISIP recognizer and ISIP Foundation Classes for speech processing, CSLR SONIC recognizer, CMU-Cambridge Statistical Language Modeling toolkit,

SRILM - The SRI Language Modeling Toolkit, some more links to tools here.

A list of phonetics tutorials and speech processing tutorials and software