Núcleo Interinstitucional de Lingüística Computacional

An Interinstitutional Center for Research and Development in Computational Linguistics

My MsC project

Development, training and evaluation of part-of-speech taggers for Portuguese 

 

Starting Time: 1998

Status: concluded on 2000

Goals
The main purpose of this project was to study different methods for tagging Brazilian texts with the NILC tagset with the aim of choosing the best method for different kinds of text writen on Brazilian Portuguese. 


Results

1) A 100,000-word corpus of Brazilian Portuguese, composed by texts from journalistic, literary and didactic genres, has been tagged and has been manually corrected to be used as training-test corpus.

download the full corpus

download the journalistic texts     download the literary texts     download the didactic texts

download the tagset

2) Four taggers available in the WWW have been trained on this 100,000-word corpus, namely Unigram (Treetagger), N-gram (Treetagger), transformation-based (TBL) and Maximum-Entropy tagging (MXPOST). The latter displayed the best accuracy (88.73%), which is still much lower than the state-of-the-art accuracy for English. The low accuracy is attributed to the reduced size of the training corpus. 

3) A symbolic tagger has been developed _ PoSiTagger (Aires, 2000). download rules files

4) Twelve methods of combination were used, four of which led to an improvement over the MXPOST accuracy. The best result (89.42%) was obtained with a majority-wins voting strategy.

Team
Rachel Virgínia Xavier Aires -
MSc Student

Sandra Maria Aluísio - Supervisor

Marcio Luis Barse Andreeta - a student who worked on the tagging of the training-test corpus and on the codification of several tools for combining the taggers and evaluate them, [1998-2000]

Ronaldo Teixeira Martins - The linguist who wrote the version of the NILC tagset used on this project

Denise Khun - The linguist who wrote most of the PoSiTagger rules, [June 2000]

Ana Raquel Marchi - The linguist who worked on the correction of the training-test corpus, [June 2000]

Financial Support
CNPQ

Intelligenesis/Webmind: 1999-2000

Finep (PADCT-CE, Proc. 88-98-059100-02-01): 1999-2000

 

After my MsC Project


Contact
Rachel Aires:raires@icmc.sc.usp.br


Related Publications

Aires, R. V. X.; Aluísio, S. M. (2001). Implementação, Adaptação, Combinação e Avaliação de Etiquetadores para o Português do Brasil. In VI Workshop de Teses e Dissertações defendidas do ICMC/USP. 2001.

Aires, R. V. X.; Aluísio, S. M.; Kuhn, D. C. S.; Andreeta, M. L. B.; Oliveira Jr., O. N. (2000). Combining Multiple Classifiers to Improve Part of Speech Tagging: A Case Study for Brazilian Portuguese. (SBIA'2000) Atibaia, SP, November, 20-22. download ps file

Aires, R. V. X. (2000). Implementação, Adaptação, Combinação e Avaliação de Etiquetadores para o Português do Brasil. MsC Thesis . October, 2000. download ps file

Aires, R.V.X.; Aluísio, S.M.(2000). Implementação, Adaptação e Avaliação de Etiquetadores para o Português do Brasil. In V Workshop de Teses e Dissertações em Andamento do ICMC/USP. São Carlos, 2000. p.109-110.

Aluísio, S.M.; Aires, R.V. (2000). Etiquetação de um Corpus e Construção de um Etiquetador de Português. Relatórios Técnicos do ICMC-USP, 107 (NILC-TR-00-2). March, 2000, 18p.

Aires, R.V.X.; Aluísio, S.M.(1999). Um Etiquetador para o Português do Brasil. In IV Workshop de Teses e Dissertações em Andamento do ICMC/USP. São Carlos, 1999. p.57-58.