Refer to my thesis for detailed information on how the application works (see menu). In short this is what happens:
- Using a speech/silence AM, an initial segmentation is created.
- Iteratively, using the high confidence fragments of the initial segmentation, three new GMMs are trained: "speech", "silence" and "audible non-speech"
- Using BIC, the application checks if the "speech" and "audible non-speech" models are actually different from each other. If they are identical, all three models are discarded and a new set of GMMs, only containing "speech" and "silence", is trained iteratively.
- A final Viterbi run determines the final alignment and clustering.
training phones, except that it is not needed to perform multiple training iterations (especially when the initial alignment is the final alignment of the training step for phones).
Training new speech/non-speech models is done in three steps:
- Create an hypothesis file in native shout format. Alignments in other formats need to be re-formatted.
- Make a training set using the alignment.
- Run shout_train_am and shout_train_finish. The resulting AM can be used in the first step to create a better alignment.
training acoustic models.
Note 1: The feature vectors used for creating these models are different from the phone feature vectors. That's why it is not possible to use the training directory for phones.
Note 2: Not all phone occurrences are used for training. Only the once that are not directly next to a silence (non-speech) phone.