|
|
My MsC project
Development, training and evaluation of part-of-speech taggers for Portuguese |
Starting Time: 1998
Status: concluded on 2000
Goals
The main purpose of this project was to study different
methods for tagging Brazilian texts with the NILC tagset with the aim of
choosing the best method for different kinds of text writen on Brazilian
Portuguese.
Results
1) A 100,000-word corpus of Brazilian Portuguese, composed by texts from journalistic, literary and didactic genres, has been tagged and has been manually corrected to be used as training-test corpus.
download the journalistic texts download the literary texts download the didactic texts
2)
Four taggers available in the WWW have been trained on this 100,000-word corpus, namely Unigram (Treetagger), N-gram (Treetagger), transformation-based (TBL) and Maximum-Entropy tagging (MXPOST). The latter displayed the best accuracy (88.73%), which is still much lower than the state-of-the-art accuracy for English. The low accuracy is attributed to the reduced size of the training corpus.3) A symbolic tagger has been developed _ PoSiTagger (Aires, 2000). download rules files
4) Twelve methods of combination were used, four of which led to an improvement over the MXPOST accuracy. The best result (89.42%) was obtained with a majority-wins voting strategy.
Team
Rachel Virgínia Xavier Aires - MSc
Student
Sandra Maria Aluísio -
SupervisorMarcio Luis Barse Andreeta - a student who worked on the tagging of the training-test corpus and on the codification of several tools for combining the taggers and evaluate them, [1998-2000]
Ronaldo Teixeira Martins - The linguist who wrote the version of the NILC tagset used on this project
Denise Khun - The linguist who wrote most of the PoSiTagger rules, [June 2000]
Ana Raquel Marchi - The linguist who worked on the correction of the training-test corpus, [June 2000]
Financial Support
CNPQ
Intelligenesis/Webmind: 1999-2000
Finep (PADCT-CE, Proc. 88-98-059100-02-01): 1999-2000
After my MsC Project
Contact
Rachel Aires:raires@icmc.sc.usp.br
Related Publications
Aires, R. V. X.; Aluísio, S. M.; Kuhn, D. C. S.; Andreeta, M. L. B.; Oliveira Jr., O. N. (2000). Combining Multiple Classifiers to Improve Part of Speech Tagging: A Case Study for Brazilian Portuguese. (SBIA'2000) Atibaia, SP, November, 20-22.
download ps fileAires, R. V. X. (2000). Implementação, Adaptação, Combinação e Avaliação de Etiquetadores para o Português do Brasil. MsC Thesis . October, 2000. download ps file
Aires, R.V.X.; Aluísio, S.M.(2000). Implementação, Adaptação e Avaliação de
Etiquetadores para o Português do Brasil. In V Workshop de Teses e Dissertações em Andamento do ICMC/USP. São Carlos, 2000. p.109-110.
Aluísio, S.M.; Aires, R.V. (2000). Etiquetação de um Corpus e Construção de um Etiquetador de Português. Relatórios Técnicos do ICMC-USP, 107 (NILC-TR-00-2). March, 2000, 18p. Aires, R.V.X.; Aluísio, S.M.(1999). Um Etiquetador para o Português do Brasil. In IV Workshop de Teses e
Dissertações em Andamento do ICMC/USP. São Carlos, 1999. p.57-58.