Preparing your dictionary and language model for Shout

For decoding, three binary files are needed: a lexical tree file, a language model file and an acoustic model file. The acoustic models need to be trained by Shout. The language model and lexical tree are created using the applications shout_dct2lextree and shout_lm2bin.

Preparing a dictionary

The application shout_dct2lextree needs two input files and will output a binary 'lexical tree' file. The input phone list file consists of the entire list of phones. The first two lines of this format define the total number of phone and non-speech models. It uses the following syntax:
  • "Number of phones:" [number of phone models]
  • "Number of SIL's:" [number of non-speech models]
  • One non-speech model name per line (times the specified number of non-speech models)
  • One phone model name per line (times the specified number of phone models)

The pronunciation dictionary contains one word per line followed by a string of phone or non-speech model names separated by one or more spaces. Make sure that the first two lines in your DCT are as follows:

  • <s> SIL
  • </s> SIL

Preparing a language model

The application 'shout_lm2bin' needs a lexical tree file (the output file of shout_dct2lextree) and an ARPA language model. It will create a binary language model file fit for shout. Currently, unigram, bigram, trigram and four-gram language models are supported. During compile time the system can be optimized by setting a maximum depth (from bi- to four-grams) with the parameter LM_NGRAM_DEPTH in standard.h The standard setting of LM_NGRAM_DEPTH is for trigrams.