|
|
An Interinstitutional
Center for Research and Development in Computational Linguistics |
Lacio-Web
Compilation of Brazilian Portuguese Corpora and
Implementation of Corpora Tools for Linguistics Analysis |
Period: Jan. 2002 to May. 2004
Goals
To create linguistics and computational basic
resources (such as corpora and associated tools) for the implementation of
Brazilian Portuguese processing applications, which are necessary for
increasing, organizing, manipulating and searching information in the Web.
Features
The Lacio-Web (LW) project aims at compiling corpora
which are freely accessible for both non-expert users interested in the
Brazilian Portuguese language and expert users who pursue theoretical and
practical linguistic studies and develop computational linguistics tools (e.g.
taggers, parsers, sentence and word aligners, automatic term extraction tools,
and automatic summarizers) and applications such as computer systems for
natural language information retrieval, machine translation and grammar
checking.
The LW project comprises six corpora: 1) a reference
corpus called Lacio-Ref; 2) Mac-Morpho, a gold standard portion from Lacio-Ref,
comprising 1,1 million words, which was manually-validated for morpho-syntactical tags; 3) an automatically-annotated
portion of the Lacio-Ref with lemmas, POS and
syntactic tags which are used by the parser Curupira
developed at NILC; 4) a deviation corpus composed of non-revised texts (Lacio-Dev); and 5) parallel and 6) comparable
Portuguese-English corpora called, respectively, Par-C and Comp_C.
Team
Sandra Maria Aluísio - ICMC-USP (coordinator)
Marcelo Finger - IME-USP (vice-coordinator)
Stella Tagnin - FFCHL-USP
Cláudia Monteiro Peixoto - IME-SP
Rachel Xavier Aires - ICMC-USP
Maria das Graças Volpe Nunes - ICMC-USP
Osvaldo Novais de Oliveira
Jr. - IFSC-USP
Bento Carlos Dias-da-Silva - FCL-Unesp
Jorge Augusto Teles -
FATEC-TQ
Jorge Marques Pelizzoni - ICMC-USP
Ana Raquel Marchi - Unesp/IBILCE
- SP
Lucélia Helena de Oliveira - FCL - Unesp
Regiana Manenti - UFSCar-SP
Vanessa Marquiafável - UFSCar
- SP
Gisele Montilha - FCL – Unesp
Leandro Henrique Mendonça de Oliveira –
ICMC-USP
Luiz Carlos Genoves
Junior – ICMC-USP
Aline Maria Pacífico Manfrin
– UFSCar – SP
Betânia Carvalho de Morais –
FFLCH – USP
Edvan Pereira de Brito –
FFLCH - USP
Finantial Support
CNPq, Modalidade AI, Edital Tecnologias para Desenvolvimento e Pesquisa em
Conteúdos Digitais (2001).
Contact
Sandra Maria Aluísio: sandra@icmc.usp.br
Related Publications
Aluisio, S.M., Pinheiro, G.M., Finger, Nunes, M.G.V., Tagnin, S.E. The Lacio-Web Project: overview and issues in Brazilian
Portuguese corpora creation. In: Proceedings of the Corpus Linguistics 2003,
Dawn Archer, Paul Rayson, Andrew Wilson and Tony McEnery (eds.), UCREL Technical Papers, Vol
16, Part 1, Special Issue (2003) 14-21. Also published in Cadernos
de Computação, Vol 4 –
Number 1, May 2003, Volpe Nunes, M. das G. and Carvalho, A.C.P.L.F. (eds),
Pinheiro, G. M.; Aluísio, S. M. (2003). Corpus Nilc: descrição e análise crítica com vistas ao projeto Lacio-Web. Série de Relatórios do NILC.download
.zip