Development, training and evaluation of part-of-speech taggers for Portuguese

An Interinstitutional Center for Research and Development in Computational Linguistics

My MsC project

Development, training and evaluation of part-of-speech taggers for Portuguese

Starting Time: 1998

Status: concluded on 2000

Goals
The main purpose of this project was to study different methods for tagging Brazilian texts with the NILC tagset with the aim of choosing the best method for different kinds of text writen on Brazilian Portuguese.

Results

1) A 100,000-word corpus of Brazilian Portuguese, composed by texts from journalistic, literary and didactic genres, has been tagged and has been manually corrected to be used as training-test corpus.

download the full corpus

download the journalistic texts download the literary texts download the didactic texts

download the tagset

2) Four taggers available in the WWW have been trained on this 100,000-word corpus, namely Unigram (Treetagger), N-gram (Treetagger), transformation-based (TBL) and Maximum-Entropy tagging (MXPOST). The latter displayed the best accuracy (88.73%), which is still much lower than the state-of-the-art accuracy for English. The low accuracy is attributed to the reduced size of the training corpus.

3) A symbolic tagger has been developed _ PoSiTagger (Aires, 2000). download rules files

4) Twelve methods of combination were used, four of which led to an improvement over the MXPOST accuracy. The best result (89.42%) was obtained with a majority-wins voting strategy.

Team
Rachel Virgínia Xavier Aires - MSc Student

Sandra Maria Aluísio - Supervisor

Marcio Luis Barse Andreeta - a student who worked on the tagging of the training-test corpus and on the codification of several tools for combining the taggers and evaluate them, [1998-2000]

Ronaldo Teixeira Martins - The linguist who wrote the version of the NILC tagset used on this project

Denise Khun - The linguist who wrote most of the PoSiTagger rules, [June 2000]

Ana Raquel Marchi - The linguist who worked on the correction of the training-test corpus, [June 2000]

Financial Support
CNPQ

Intelligenesis/Webmind: 1999-2000

Finep (PADCT-CE, Proc. 88-98-059100-02-01): 1999-2000

After my MsC Project

Since November 2000, I have made available a Portuguese model for MXPOST that was trained using a much simpler tagset than the one used on my MsC project. We considered only 27 tags plus punctuation marks tags, achieving 97% accuracy. Even though we have used a 10 folders cross-validation test strategy, the accuracy should not be generalized to texts in general. We must remember that the corpus used during the training is small ~ 100,000 words, and for this reason it is not a representative model of the Portuguese language in general. It was showed on the MsC project that the precision is different for each of the three genres studied and that the journalistic genre is the one with less ambiguity and the easiest one to tag. Since November 2000, I have made available a Portuguese model for MXPOST that was trained using a much simpler tagset than the one used on this MsC project. We considered only 27 tags plus punctuation marks tags, achieving 97% accuracy. Even though we have used a 10 folders cross-validation test strategy, the accuracy should not be generalized to texts in general. We must remember that the corpus used during the training is small _ 100,000 words, and for that reason it is not a representative model of the Portuguese language in general. It was showed on the MsC project that the precision is different for each of the tree genres studied, and that the journalistic used on most tests on the literature, is the one with less ambiguity and the easiest one to tag. You can download the model, the tagset and the evaluation results per tag.

In another project _ LacioWeb, MXPOST, TreeTagger and TBL have been trained with the MAC-MORPHO Corpus, a corpus with 1,221,468 words. The best precision achieved for MXPOST was 96.98% with 22 tags plus contraction and punctuation tags.

Contact
Rachel Aires:raires@icmc.sc.usp.br

Related Publications

Aires, R. V. X.; Aluísio, S. M. (2001). Implementação, Adaptação, Combinação e Avaliação de Etiquetadores para o Português do Brasil. In VI Workshop de Teses e Dissertações defendidas do ICMC/USP. 2001.

Aires, R. V. X.; Aluísio, S. M.; Kuhn, D. C. S.; Andreeta, M. L. B.; Oliveira Jr., O. N. (2000). Combining Multiple Classifiers to Improve Part of Speech Tagging: A Case Study for Brazilian Portuguese. (SBIA'2000) Atibaia, SP, November, 20-22. download ps file

Aires, R. V. X. (2000). Implementação, Adaptação, Combinação e Avaliação de Etiquetadores para o Português do Brasil. MsC Thesis . October, 2000. download ps file

Aires, R.V.X.; Aluísio, S.M.(2000). Implementação, Adaptação e Avaliação de Etiquetadores para o Português do Brasil. In V Workshop de Teses e Dissertações em Andamento do ICMC/USP. São Carlos, 2000. p.109-110.

Aluísio, S.M.; Aires, R.V. (2000). Etiquetação de um Corpus e Construção de um Etiquetador de Português. Relatórios Técnicos do ICMC-USP, 107 (NILC-TR-00-2). March, 2000, 18p.

Aires, R.V.X.; Aluísio, S.M.(1999). Um Etiquetador para o Português do Brasil. In IV Workshop de Teses e Dissertações em Andamento do ICMC/USP. São Carlos, 1999. p.57-58.