Voice Activity Detection - Implementations

Implementations

One early standard VAD is that developed by British Telecom for use in the Pan-European digital cellular mobile telephone service in 1991. It uses inverse filtering trained on non-speech segments to filter out background noise, so that it can then more reliably use a simple power-threshold to decide if a voice is present.
The G.729 standard calculates the following features for its VAD: line spectral frequencies, full-band energy, low-band energy (<1 kHz), and zero-crossing rate. It applies a simple classification using a fixed decision boundary in the space defined by these features, and then applies smoothing and adaptive correction to improve the estimate.
The GSM standard includes two VAD options developed by ETSI. Option 1 computes the SNR in nine bands and applies a threshold to these values. Option 2 calculates different parameters: channel power, voice metrics, and noise power. It then thresholds the voice metrics using a threshold that varies according to the estimated SNR.
The Speex audio compression library uses a procedure named Improved Minima Controlled Recursive Averaging, which uses a smoothed representation of spectral power and then looks at the minima of a smoothed periodogram. From version 1.2 it was replaced by a kludge in words of the author.

Read more about this topic: Voice Activity Detection