Speech/non-speech segmentation

The decoder can only handle audio containing solely speech. Some silence models may be trained (like silence, lip smack, etc) but when audio contains other sources like for example music, jingles or formula-one cars, it is a good idea to segment the audio before feeding it to the decoder. Shout provides this possibility with the shout_segment application. This application can cluster audio in three categories: "speech", "silence" and "audible non-speech".

Refer to my thesis for detailed information on how the application works (see menu). In short this is what happens:

  • Using a speech/silence AM, an initial segmentation is created.
  • Iteratively, using the high confidence fragments of the initial segmentation, three new GMMs are trained: "speech", "silence" and "audible non-speech"
  • Using BIC, the application checks if the "speech" and "audible non-speech" models are actually different from each other. If they are identical, all three models are discarded and a new set of GMMs, only containing "speech" and "silence", is trained iteratively.
  • A final Viterbi run determines the final alignment and clustering.

Training the GMM for the initial training

In a future implementation I'm planning to change the first step where models are needed for creating the initial segmentation to a method that does not require models at all. But for now, the speech and non-speech models needs to be trained. Training these GMMs is done the same as training phones, except that it is not needed to perform multiple training iterations (especially when the initial alignment is the final alignment of the training step for phones).

Training new speech/non-speech models is done in three steps:

  • Create an hypothesis file in native shout format. Alignments in other formats need to be re-formatted.
  • Make a training set using the alignment.
  • Run shout_train_am and shout_train_finish. The resulting AM can be used in the first step to create a better alignment.

Create an alignment of the training audio

For creating the alignment see the section on training acoustic models.

shout_maketrainset

You can create a training set with the application make_trainset. Be sure to choose SAD for the type of trainset you want to create. Run shout_maketrainset with the -h option for usage of this application.

Note 1: The feature vectors used for creating these models are different from the phone feature vectors. That's why it is not possible to use the training directory for phones.

Note 2: Not all phone occurrences are used for training. Only the once that are not directly next to a silence (non-speech) phone.

shout_train_am and shout_train_finish

Once you have a training set for your SAD-AM, you can run shout_train_am to train your SIL and SPEECH models (run the application twice) and after that you can run shout_train_finish to generate a binary AM file from the two models.