- Create an hypothesis file in native shout format. Alignments in other formats need to be re-formatted.
- Make a training set using the alignment file.
- Run shout_train_am for each phone and run shout_train_finish to combine all phones in a single file. The resulting AM can be used in the first step to create a better alignment.
A shout alignment can be created by calling shout with a meta-data file in the following format
- SPEAKER [label] [VTLN factor] [begin time] [length] <NA> <NA> [SPK ID] <NA> <s> <s> [word based transcription] Note that this file format is basically the RTTM format from NIST.
The <s> symbol represents silence. The first two symbols in the transcription are not used to align the audio, but only to create a language model history. When two <s> symbols are added, the language model history is reset to 'start of sentence'.
Alternatively, shout_maketrainset can use a phone-set file and a file with a list of audio-metadata-hyp files.
In order to create acoustic phone models, make sure to set the type to PHONE in shout_maketrainset.
Once all phones are trained, you can use shout_train_finish to combine the models into one single binary file.