Aeiouadô is a machine-readable pronunciation dictionary for Brazilian Portuguese, based on the dialect of São Paulo (city). It was designed primarily for Speech Technologies, such as Automatic Speech Recognition Systems and Speech Synthesizers. However, it may also be used by linguists, speech therapists, lexicographers, students of Brazilian Portuguese as a second language, and whoever is interested in the sound structure of Brazilian Portuguese.
The dictionary makes use of a hybrid approach for converting graphemes into phonemes, based on both manual transcription rules and machine learning algorithms. It makes use of a word list compiled from the Portuguese Wikipedia dump. Wikipedia articles were transformed into plain text, tokenized and word types were extracted. A language identification tool was developed to detect loanwords among data. Words' syllable boundaries and stress were identified. The transcription task was carried out in a two-step process: i) words are submitted to a set of transcription rules, in which predictable graphemes (mostly consonants) are transcribed; ii) a machine learning classifier is used to predict the transcription of the remaining graphemes (mostly vowels). The method was evaluated through 5-fold cross-validation; results show a F1-score of 0.98.
System Architecture for Building the Dictionary
More information can be found in the paper to appear in the Proceedings of the Interspeech 2014:
Gustavo Mendonca, Sandra Aluisio (2014). Using a hybrid approach to build a pronunciation dictionary for Brazilian Portuguese. To appear in: Proceedings INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association. ISCA. Singapure, August 25-29, 2014.