Speech Recognition

Automatic Speech Recognition (ASR) systems offer a natural user interface to digital content. The ASR engine consists of several key components which allow speech to be converted into text. The acoustic model functions as a reference model for the target language. It is trained on speech data from a large number of speakers which ensures coverage of a wide range of variation including variation in age, gender and accent to accommodate a wide range of potential end-users. In the recognition phase, the incoming speech is continuously compared with the reference model and the supported vocabulary and the most likely hypothesis is returned as a recognition result.

The pronunciation dictionary functions as a link from the supported vocabulary to the acoustic model. When the speech application supports a dynamically changing vocabulary, which often is not covered by the engine's pronunciation dictionary, the success of the speech recognition process depends on the quality of the automatically generated pronunciations from the Grapheme-to-Phoneme (G2P) tool. The G2P often makes mistakes in the generated pronunciation hypothesis and this has a direct impact on the user experience. At best, erroneous G2P output makes it more difficult for users to get speech commands recognized but if often means that recognition of specific words becomes impossible.

Phonetic Labs provides the solution.

Phonetic Data in the cloud - always available, always up-to-date