CORPORA

An Interinstitutional Center for Research and Development in Computational Linguistics

NILC's CORPORA

Nilc´s Corpora

Starting Time: 1993

Goals
To build corpora for supporting NLP researches, especially on Brazilian Portuguese.

Corpus NILC: a 40 million-word corpus consisting of prose texts in Brazilian Portuguese, divided into corrected texts, uncorrected texts and semi-corrected texts. This corpus supports mainly the project ReGra and it is available for access in http://acdc.linguateca.pt/acesso/ in its text and (POS) tagged (by Bick's tagger) form.
Here is the list of words and their frequencies collected from the corrected texts.

Corpus NILC annotated by PALAVRAS syntactic parser: composed by Corpus NILC corrected texts syntactically annotated by Eckhard Bick's PALAVRAS parser, amounting to 24 million words (33 million tokens). It can be consulted and downloaded at Linguateca (in AC/DC project).

CorpusDT: a corpus of scientific texts in Brazilian Portuguese to support the project SciPo, which consists of authentic thesis and dissertations on Computer Science and is being annotated using XML.

Parallel Corpora: bilingual (Brazilian Portuguese and English) corpora of parallel texts from different domains: scientific, law and journalistic developed to support project PESA.

Parallel Corpora from Revista Pesquisa FAPESP: Portuguese-English and Portuguese-Spanish bilingual collections of the online issues of the scientific news Brazilian magazine Revista Pesquisa FAPESP.

CorpusGIS: a corpus of grammatically inadequate sentences in Brazilian Portuguese to support testing of grammar checker ReGra. Syntactic tags are used to promote an automatic testing process.

RHETALHO: This is a rhetorically annotated corpus according to RST (Rhetorical Structure Theoy) (Mann and Thompson, 1987), including scientific and news texts.

CSTNews: a discourse (RST and CST) annotated corpus of news texts. Built for summarization purposes, the corpus comes with single and multi-document summaries, among other annotations.

CorpusTéMario: This corpus comprises 100 newspaper texts, along with both their manual summaries and ideal extracts (these have been automatically generated).

Brazilian Portuguese Treebank: a corpus consisting of a part of corpus NILC which was parsed by Erick Bick's parser, PALAVRA, and which is being used as the corpus of examples for the empirical parsing methods in project PAPO. Another treebank will also be produced, in the near future, as a result of the implementation of a robust Portuguese parser (derived from ReGra's parser).

Lácio-Web: Corpus freely available on the Web containing texts of written contemporary Brazilian Portuguese together with a set of computational tools. The texts have been collected, selected and marked in a way that allows for easy interchange, navigation and analysis of their content. The tools comprise word counting programs, concordancers and part-of-speech taggers.

Team
Sandra Maria Aluísio

Maria das Graças Volpe Nunes

Rachel Ayres

Andreia G. Bonfante

Ronaldo Teixeira Martins

Osvaldo N. Oliveira Jr.

Lucia H. Machado Rino

Valéria D. Feltrim

Thiago A. S. Pardo

Wilker Aziz

Finantial Support
Itautec-Philco S.A.

PADCT/Finep - Itautec-Philco (2000-2001)

FAPESP (2000-2002)

CNPq

Contact
Sandra Maria Aluísio: sandra@icmc.usp.br

Related Publications

Aires, R.V.X.; Aluísio, S.M. Criação de um corpus com 1.000.000 de palavras etiquetado morfossintaticamente. Série de Relatórios do NILC. NILC-TR-01-8, Outubro 2001, 14p.

Martins, M.S; Caseli, H.M.; Nunes, M.G.V. A construção de um corpus de textos paralelos inglês-português. Série de Relatórios do NILC. NILC-TR-01-5, Setembro, 2001.

Feltrim, V.D.; Nunes, M.G.V.; Aluísio, S.M. Um corpus de textos científicos em Português para a análise da Estrutura Esquemática. Série de Relatórios do NILC. NILC-TR-01-4, Julho, 2001.

Kuhn, D.; Abarca, E.; Nunes, M.G.V. Corpus NILC - Situação em Maio/2000. (NILC-TR-00-7). Junho 2000, 32p.

Aluísio, S.M.; Aires, R.V. Etiquetação de um Corpus e Construção de um Etiquetador de Português. Relatórios Técnicos do ICMC-USP, 107 (NILC-TR-00-2). Março 2000, 18p.