The one and only reason why all other applications are developed is being able to decode! That's why the decoder, the hart of the toolkit, is just called... Shout!
Run ./shout with the output meta-data file of shout_vtln (or of shout_cluster if no VTLN is needed). Next is a short description of the most important parameters. Please run ./shout -h for more help.
The decoder needs a language model file (lm), acoustic model file (amp) and a lexical tree file (dct). All files should be binary files created by the shout toolkit.
The search space of the decoder is restricted using five parameters. If these paramters are not assigned a value, the default values (shown when shout is started with -cc) will be used.
The five search restriction parameters:
- BEAM (floating point number)
- STATE_BEAM (floating point number)
- END_STATE_BEAM (floating point number)
- HISTOGRAM_STATE_PRUNING (positive number)
- HISTOGRAM_PRUNING (positive number)
The most likely paths in the jungle of feature vectors are calculated using a language model and acoustic models. The scaling between the two types of models influences the outcome of the trip through this jungle. This scaling is set using three parameters in the formula:
Score(LM_SCALE,TRANS_PENALTY,SIL_PENALTY) = ln(AMSCORE) + LM_SCALE*lm(LMSCORE) + TRANS_PENALTY*NR_WORDS + SIL_PENALTY*NR_SIL
Shout has implemented an efficient method of incorporating the LM score in the search. This method, Language Model Look-Ahead, is switched on by default, but it can be toggled on or off in the configuration file.
- LM_SCALE (floating point number)
- TRANS_PENALTY (floating point number)
- SIL_PENALTY (floating point number)
- LMLA (1=on, 0=off)
You can specify a special background dictionary if you want to perform alignment with OOV marking. For performing alignment instead of ASR, simply set the forced-alignment parameter (see ./shout -h). Make sure to add the utterance to align in the meta-data file, starting with <s> <s>. See the training use-case
for more information.
- XML (output will be in XML format)
Shout can generate lattices and output them in PSFG format. If this is wanted (decoding will be a bit slower) the lattice parameter should be used, with a path to the directory where the lattice files need to be written to. One lattice will be created for each line in the meta-data file. The lattice files will be named after the label of each audio file (defined in the meta-data file).