Núcleo Interinstitucional de Lingüística Computacional



An Interinstitutional Center for Research and Development in Computational Linguistics


Compilation of Brazilian Portuguese Corpora and Implementation of Corpora Tools for Linguistics Analysis

Period: Jan. 2002 to May. 2004


To create linguistics and computational basic resources (such as corpora and associated tools) for the implementation of Brazilian Portuguese processing applications, which are necessary for increasing, organizing, manipulating and searching information in the Web.


The Lacio-Web (LW) project aims at compiling corpora which are freely accessible for both non-expert users interested in the Brazilian Portuguese language and expert users who pursue theoretical and practical linguistic studies and develop computational linguistics tools (e.g. taggers, parsers, sentence and word aligners, automatic term extraction tools, and automatic summarizers) and applications such as computer systems for natural language information retrieval, machine translation and grammar checking.

The LW project comprises six corpora: 1) a reference corpus called Lacio-Ref; 2) Mac-Morpho, a gold standard portion from Lacio-Ref, comprising 1,1 million words, which was manually-validated for morpho-syntactical tags; 3) an automatically-annotated portion of the Lacio-Ref with lemmas, POS and syntactic tags which are used by the parser Curupira developed at NILC; 4) a deviation corpus composed of non-revised texts (Lacio-Dev); and 5) parallel and 6) comparable Portuguese-English corpora called, respectively, Par-C and Comp_C.


Sandra Maria Aluísio - ICMC-USP (coordinator)
Marcelo Finger - IME-USP (vice-coordinator)
Stella Tagnin - FFCHL-USP
Cláudia Monteiro Peixoto - IME-SP
Rachel Xavier Aires - ICMC-USP
Maria das Graças Volpe Nunes - ICMC-USP
Osvaldo Novais de Oliveira Jr. - IFSC-USP
Bento Carlos Dias-da-Silva - FCL-Unesp
Jorge Augusto Teles - FATEC-TQ
Jorge Marques Pelizzoni - ICMC-USP
Ana Raquel Marchi - Unesp/IBILCE - SP
Lucélia Helena de Oliveira - FCL - Unesp
Regiana Manenti - UFSCar-SP
Vanessa Marquiafável - UFSCar - SP
Gisele Montilha - FCL – Unesp

Leandro Henrique Mendonça de Oliveira – ICMC-USP

Luiz Carlos Genoves Junior – ICMC-USP

Aline Maria Pacífico Manfrin – UFSCar – SP

Betânia Carvalho de Morais – FFLCH – USP

Edvan Pereira de Brito – FFLCH - USP

Finantial Support
CNPq, Modalidade AI, Edital Tecnologias para Desenvolvimento e Pesquisa em Conteúdos Digitais (2001).

Sandra Maria Aluísio:

Related Publications

Aluisio, S.M., Pinheiro, G.M., Finger, Nunes, M.G.V., Tagnin, S.E. The Lacio-Web Project: overview and issues in Brazilian Portuguese corpora creation. In: Proceedings of the Corpus Linguistics 2003, Dawn Archer, Paul Rayson, Andrew Wilson and Tony McEnery (eds.), UCREL Technical Papers, Vol 16, Part 1, Special Issue (2003) 14-21. Also published in Cadernos de Computação, Vol 4 – Number 1, May 2003, Volpe Nunes, M. das G. and Carvalho, A.C.P.L.F. (eds), University of São Paulo – ICMC. download.zip

Pinheiro, G. M.; Aluísio, S. M. (2003). Corpus Nilc: descrição e análise crítica com vistas ao projeto Lacio-Web. Série de Relatórios do NILC.download