NILC´s Taggers



Related Project: Lacio-WEB

Starting Time: 1998 (as Rachel Aires’ MSc project)


The main purpose of this project is to study and evaluate several state-of-art taggers available in the WWW with the aim of choosing the best (or the best combination of them) for tagging corpora of Brazilian texts with the NILC tagset.

To implement Part of Speech (POS) taggers for Brazilian Portuguese which use empirical and symbolic methods and whose performances are compatible to the state of the art in this area.

Current Status

A 104,966-word corpus has been tagged by current versions of the taggers and has been manually corrected to incrementally produce larger training corpus and improved taggers.

download the full corpus

download the journalistic texts     download the literary texts     download the didactic texts

download the tagset


- Three different POS taggers available on WWW (Tree Tagger (Shmid, 1995), MXPOST (Ratnaparcki, 1996), and TBL Tagger (Brill, 1995)) have been trained with a 104,966-word corpus of Brazilian Portuguese texts, and a symbolic rule-based tagger derived from ReGra's lexicon and disambiguation rules has been developed at NILC (PoSiTagger).

- NILC Corpus has been tagged corpus with full and partial NILC Tagset.

·         - Since November 2000, Rachel Aires have made available a Portuguese model for MXPOST that was trained using a much simpler tagset than the one used on her MsC project. It was considered only 27 tags plus punctuation marks tags, achieving 97% accuracy. Even though it was used a 10 folders cross-validation test strategy, the accuracy should not be generalized to texts in general. It must be remembered that the corpus used during the training is small ~ 100,000 words, and for this reason it is not a representative model of the Portuguese language in general. It was showed on the MsC project that the precision is different for each of the three genres studied and that the journalistic genre is the one with less ambiguity and the easiest one to tag. You can download the trained tagger, the tagset and the evaluation results per tag

You can also download 3 trained taggers resulted of Lácio-Web Project:
    Trained MACMORPHO files for MXPOST
    Trained MACMORPHO files for TreeTagger
    Trained MACMORPHO files for Brill Tagger (TBL)


Rachel Virgínia Xavier Aires - MSc Student

Sandra Maria Aluísio - Supervisor

Marcio Luis Barse Andreeta - a student who worked on the tagging of the training-test corpus and on the codification of several tools for combining the taggers and evaluate them, [1998-2000]

Ronaldo Teixeira Martins - The linguist who wrote the version of the NILC tagset used on this project

Denise Khun - The linguist who wrote most of the PoSiTagger rules, [June 2000]

Ana Raquel Marchi - The linguist who worked on the correction of the training-test corpus, [June 2000]

Finantial Support
Itautec-Philco S.A.



: 1999-2000

Finep (PADCT-CE, Proc. 88-98-059100-02-01): 1999-2000

PADCT/Finep - Itautec-Philco (2000-2001)

FAPESP (2001-2003)

Sandra Maria Aluísio: sandra@icmc.usp.br

