Some resources for noise-robust and channel-robust speech processing
The original of this page can be accessed here. I copy to my site for easy references.
This page collects links to software and data resources related to research on automatic speech recognition (ASR) that is robust to background noise and convolutional distortions such as reverberation. Some of the links pointed to by this page are also relevant to research on enhancing speech for human listening. If you would like to suggest more links for this page, you are invited to contact the page's maintainer, David Gelbart at ICSI.
If you use software or other resources pointed to by this page, please respect the license terms (and, when applicable, patent rights). If the use contributes to an academic publication, the maintainer suggest mentioning this in the publication and referencing the original (not this page of pointers) source by giving a publication or URL reference that will allow others to obtain the resource. This serves four purposes: (1) it alerts readers of the publication to the availability of the resource, (2) it provides a precise specification of that aspect of the work described in the publication, (3) it assigns credit where credit is due, and (4) it shows readers of the publications that sharing resources with others leads to public recognition, encouraging future sharing.
Successful approaches to robust ASR may combine more than one robustness technique. Because of the simple data flow of much signal processing code, different tools can often be used together simply by running them in sequence, using pipes or intermediate files. Two convenient choices for intermediate file formats are HTK feature files, and waveforms. Many of the tools online here operate on HTK feature files, or can output HTK feature files. The HTK format is a useful intermediate file format for feature files because it is simple to read, write, and convert to other formats, and because of the popularity of HTK. Also, some algorithms can be used with other tools without any modification to those other tools by having the algorithms run speech-enhancement-style, outputting processed waveforms which the
other tools treat as they would any other audio input file. Using processed waveforms as an intermediate format also allows listening, waveform plotting, and spectrogram plotting, which may lead to useful insights. (If using processed waveforms as an intermediate format, it may be worthwhile to store these processed waveforms in floating point, rather than the usual 16-bit integer storage format, to reduce roundoff error and eliminate the risk of overflow/saturation error. Since algorithms may change the scale of waveforms, there is a risk of overflow or underflow with a 16-bit integer format even if the original waveforms were well scaled for that format.)
Enhancement/compensation software for ASR and human listening:
- Voicebox Matlab toolbox
- Karlsruhe Beamforming Toolkit
- Qualcomm-ICSI-OGI front end, speech detection, and noise reduction
- Patrick Wolfe's Matlab noise reduction code
- Trausti Kristjansson's page
- Marc Ferras' code for multi-microphone speech enhancement
Software for ASR:
- The RESPITE CASA Toolkit
- Seneff auditory model
- RASTA and MSG
- MVA (Mean-Variance-ARMA)
- Gabor filter analysis
Software for signal quality measurement:
Software and data for reproducing or simulating acoustic conditions:
- FaNT tool for adding noise or telephone characteristics to speech
- Acoustic impulse responses
- Univ. of Patras room acoustics simulators
- Additive Noise Sources
- NOISEX noises
- ShATR recordings
Other:
VOICEBOX
The VOICEBOX Matlab toolbox for audio processing includes a noise reduction routine (specsubm), routines to read and write audio files from Matlab, and many other things.
Beamforming Toolkit
The Karlsruhe beamforming toolkit: "btk is a toolkit that provides a basis for the implementation of
powerful beamforming algorithms. btk uses Python as a scripting language for ease of control and modification. The capacity to efficiently perform advanced numerical computations is provided by
Numeric Python (NumPy), the GNU Scientific Library (GSL), as well as a few extra algorithms we've implemented ourselves."
Qualcomm-ICSI-OGI front end, speech detection, and noise reduction
This archive contains source code and documentation for the Qualcomm-ICSI-OGI noise-robust front end described in the ICSLP 2002 paper by Adami et al. The archive also contains tools for using the speech detection, Wiener filter noise reduction, or nonspeech frame dropping features of the front end independently of other features. The noise reduction can be used independently of other components to produce noise-reduced waveforms.Matlab noise reduction tools by Patrick Wolfe
Matlab source code for various noise reduction algorithms is available here.Trausti Kristjansson
Trausti Kristjansson created this page (while at the University of Toronto) which provides Matlab source code for (1) spectral subtraction noise removal, (2) the Algonquin variational inference algorithm for removing noise and channel effects, and (3) the Recognition Analyzer diagnostic tool which displays features, HTK log likelihoods, and HTK state sequences and can create resynthesized audio from MFCC features.Marc Ferras' code for multi-microphone speech enhancement
This page provides source code for several blind multi-microphone speech enhancement techniques. These were implemented by Marc Ferras while pursuing his masters thesis on
multi-microphone signal processing for automatic speech recognition in meeting rooms.
The RESPITE CASA Toolkit
The RESPITE CASA Toolkit is a toolkit for Computational Auditory Scene Analysis (CASA).This includes a tutorial on using the toolkit for missing data speech recognition.
Seneff auditory model
This page has source code for an implementation of Stephanie Seneff's auditory modelfront end for ASR.
RASTA and MSG
C/C++ implementations of the RASTA and MSG (modulation-filtered spectrogram) algorithms for robust feature extraction are available as part of this ICSI speech software package. There is also this older page for RASTA at ICSI. There is a MATLAB implementation of RASTA at Dan Ellis' Matlab page.MVA (Mean, Variance, ARMA)
This page provides source code for this technique proposed by Chia-Ping Chen and Jeff Bilmes which post-processes noisy cepstra by doing mean and variance normalization (M for mean, V for variance) and bandpass modulation filtering (A for ARMA).Gabor filter analysis for speech recognition
This page provides articles, filter definitions, software tools, and discussion related to automatic speech recognition (ASR) with Gabor filters. A Matlab package for feature selection using the Feature Finding Neural Networks (FFNN) approach proposed by Tino Gramß (Gramss) is available as well. (This FFNN package was used to select Gabor filters for ASR.) Objective Speech Quality Assessment
The CSLU Robust Speech Processing Laboratory software repository page hosts the Objective Speech Quality Assessment package (developed by Bryan Pellom, referred to in an ICSLP 98 paper by Hansen and Pellom) which calculates various metrics of speech quality based on comparing clean audio with noisy or noise-reduced audio.
NIST Speech Quality Assurance Package (SPQA)
The SPQA package includes SNR measurement tools which do not require a clean audio reference.
FaNT tool for adding noise or telephone characteristics to speech
The FaNT (Filtering and Noise-adding Tool) tool can be used to add noise to speech recordings at a desired SNR (signal-to-noise ratio). The SNR can be calculated using frequency weighting (G.712 or A-weighting), and the speech energy is calculated following ITU recommendation P.56. The tool can also be used to filter speech with certain frequency characteristics defined by the ITU for telephone equipment. This tool was used to create the noisy data for the popular AURORA 2 speech recognition corpus.
Acoustic impulse responses
This page, created by James Hopgood, is a collection of acoustic impulse responses for simulating convolutional distortion. The focus is on hands-free / far-field acoustic conditions.
Some past speech recognition work (by Shire, Kingsbury, Avendano, Palomaki, Morgan, Chen, Gelbart, possibly others) has been done using a set of impulse responses collected using the varechoic chamber at Bell Labs. It is planned to make these available on Hopgood's web page. Until they are available there, a download link has been placed here.
More acoustic impulse responses are available as part of the Sound Scene Database in Real Acoustical Environments from the Real World Computing Partnership, here. The site noisevault.com has acoustic impulse responses as well as links to software and documents regarding impulse response measurement and acoustic simulation; it seems aimed at audio engineers and audio engineering hobbyists.
Room acoustics simulator
The AudioGroup at the University of Patras have placed public domain room acoustics simulators online here.Additive Noise Sources
The CSLU Robust Speech Processing Laboratory software repository page hosts a package named Additive Noise Sources which contains noise files for use in simulating additive noise.
NOISEX noises
This page at Rice has a set of downloadable noises. I think these are from the NOISEX-92 collection, but I don't know if this is the complete collection. I am not trying to give a comprehensive list of corpora on this page, but this page in the comp.speech FAQ has some links.
ShATR multiple simultaneous speaker corpus
Here. "ShATR is a corpus of overlapped speech collected by the University of Sheffield Speech and Hearing Research Group in collaboration with ATR in order to support research into computational auditory scene analysis. The task involved four participants working in pairs to solve two crosswords. A fifth participant acted as a hint-giver. Eight channels of audio data were recorded from the following sensors: one close microphone per speaker, one omnidirectional microphone, and the two channels of a binaurally-wired mannekin. Around 41% of the corpus contains overlapped speakers. In addition, a variety of other audio data was collected from each participant. The entire corpus, which has a duration of around 37 minutes, has been segmented and transcribed at 5 levels, from subtasks down to phones. In addition, all nonspeech sounds have been marked."
A brief list of resources that are not specific to noise and channel robustness
WaveSurfer speech visualization tool (view waveforms, spectrograms, formant tracks, pitch tracks) and other KTH-hosted software, HTK recognizer, SPHINX recognizer , ICSI speech software package (link above), ISIP recognizer and ISIP Foundation Classes for speech processing, CSLR SONIC recognizer, CMU-Cambridge Statistical Language Modeling toolkit,SRILM - The SRI Language Modeling Toolkit, some more links to tools here.
A list of phonetics tutorials and speech processing tutorials and software


1 Comments:
Hello, I'm glad you like the list. You'll probably also like the Resource listings at www.isca-students.org.
Post a Comment
<< Home