PESA

An Interinstitutional Center for Research and Development in Computational Linguistics

Parallel Corpora

Corpora of Brazilian Portuguese and English Parallel Texts

Description
Bilingual (Brazilian Portuguese and English) corpora of parallel texts from different domains: scientific, law and journalistic, developed to support project PESA. They are:

CorpusPE: composed of 65 pairs of digitized academic parallel texts (Abstracts) on Computer Science. They were divided in two groups: one with 65 pairs of authentic (non-revised) texts; other with the same 65 pairs, but revised by a human translator (pre-edited corpus) to remove gramatical and translation errors.
CorpusALCA: composed of 4 pairs of parallel and digitized official documents of the Free Trade Area of the Americas (FTAA) available on the Web.
CorpusNYT: composed of 7 pairs of parallel and digitized articles from "The New York Times" available on the Web in English and Brazilian Portuguese (BP).

Status

Number of Words

Authentic CorpusPE

Pre-edited CorpusPE
CorpusALCA
CorpusNYT

Brazilian Portuguese

11349

11306

11217

5410

English

10083

10186

10852

5185

Total

21432

21492

22069

10595

Features
The corpora above were processed and divided in three classes of corpora: test corpora, POS-tagged corpora and reference corpora.

Test Corpora
Texts in the test corpora were tagged with text (<text> and </text>), paragraphs ( and ) and sentences (<s> and </s>) boundaries. They were used as input for automatic sentence alignment methods. An extract of parallel texts used for testing is:

Brazilian Portuguese English

<text lang=pt id=quali3R>
<s>Este trabalho propõe uma modelagem lingüística dos itens lexicais do português do Brasil, uma modelagem relacional e sua implementação na forma de uma Base de Dados Lexicais.</s><s>O recurso de PLN resultante favorece padronização, centralização e reutilização dos dados, facilitando o que é considerado uma das etapas mais difíceis no processo de desenvolvimento: a aquisição de conhecimento lingüístico necessário.</s>

</text> <text lang=en id=quali3A>
<s>This dissertation proposes a linguistic modeling of lexical items of Brazilian Portuguese, a relational modeling and its implementation in the form of a Lexical Database.</s><s>The resulting NLP resource favors the standardization, centralization, and reuse of data, aiming at facilitating one of the most difficult stages in the development process: the linguistic knowledge acquisition.</s>

</text>

POS-Tagged Corpora
Some of these texts were also POS-tagged, as shown bellow.

Brazilian Portuguese English

<text lang=pt id=quali3R>
<s>Este PRON trabalho N propõe VERB uma ART modelagem N lingüística N dos PREP+ART itens ADJ lexicais N do PREP+ART português N do PREP+ART Brasil NP, uma ART modelagem N relacional ADJ e CONJ sua PRON implementação N na PREP+ART forma N de PREP uma ART Base N de PREP Dados N Lexicais ADJ.</s><s>O ART recurso N de PREP PLN NP resultante ADJ favorece VERB padronização N, centralização N e CONJ reutilização N dos PREP+ART dados N, facilitando VERB o ART que PRON é VERB considerado VERB uma ART das PREP+ART etapas N mais ADV difíceis ADJ no PREP+ART processo N de PREP desenvolvimento N: a ART aquisição N de PREP conhecimento N lingüístico N necessário ADJ.</s>

</text> <text lang=en id=quali3A>
<s>This DT dissertation NN proposes VBZ a DT linguistic JJ modeling NN of IN lexical JJ items NNS of IN Brazilian JJ Portuguese NP, a DT relational JJ modeling NN and CC its PP$ implementation NN in IN the DT form NN of IN a DT Lexical JJ Database NN.</s><s>The DT resulting VBG NLP NN resource NN favors VBZ the DT standardization NN, centralization NN, and CC reuse NN of IN data NNS, aiming VBG at IN facilitating VBG one CD of IN the DT most RBS difficult JJ stages NNS in IN the DT development NN process NN: the DT linguistic JJ knowledge NN acquisition NN.</s>

</text>

Reference Corpora
Finally, all texts in test corpora were manually aligned to be used as reference to evaluate the automatic sentence alignment methods. An extract of sentence aligned parallel texts is shown bellow. Some attributes were inserted in the initial sentence tag (<s>) to indicate the alignment between source and target sentences. Sentence's id is its identification while sentence's corresp has the ids (possibly none) of all sentences which are its translation.

Brazilian Portuguese English

<text lang=pt id=quali3R>
<s id=quali3R.1.s1 corresp=quali3A.1.s1>Este trabalho propõe uma modelagem lingüística dos itens lexicais do português do Brasil, uma modelagem relacional e sua implementação na forma de uma Base de Dados Lexicais.</s><s id=quali3R.1.s2 corresp=quali3A.1.s2>O recurso de PLN resultante favorece padronização, centralização e reutilização dos dados, facilitando o que é considerado uma das etapas mais difíceis no processo de desenvolvimento: a aquisição de conhecimento lingüístico necessário.</s>

</text> <text lang=en id=quali3A>
<s id=quali3A.1.s1 corresp=quali3R.1.s1>This dissertation proposes a linguistic modeling of lexical items of Brazilian Portuguese, a relational modeling and its implementation in the form of a Lexical Database.</s><s id=quali3A.1.s2 corresp=quali3R.1.s2>The resulting NLP resource favors the standardization, centralization, and reuse of data, aiming at facilitating one of the most difficult stages in the development process: the linguistic knowledge acquisition.</s>

</text>

Future Work
Future works include compilation of more texts to be included in CorpusALCA and CorpusNYT and other texts from different domains and languages (possibly Spanish). These corpora will benefit researches in Machine Translation - Brazilian Portuguese, English and Spanish.

Team

(2001)
Helena de Medeiros Caseli (MSc Student)
Maria das Graças Volpe Nunes (supervisor)
Monica Saddy Martins (Translator)

(2002-2003)
Helena de Medeiros Caseli (MSc Student)
Maria das Graças Volpe Nunes (supervisor)

Contact
Helena de Medeiros Caseli helename@icmc.usp.br

Related Publications

Caseli, H.M. Alinhamento sentencial de textos paralelos português-inglês. Dissertação de Mestrado. ICMC-USP, Abril, 2003. download pdf file

Caseli, H.M.; Feltrim, V.D.; Nunes, M.G.V. TagAlign: Uma ferramenta de pré-processamento de textos. Série de Relatórios do NILC. NILC-TR-02-09, Junho 2002. download zip file.

Caseli, H.M.; Nunes, M.G.V. A construção dos recursos lingüísticos do projeto PESA. Série de Relatórios do NILC. NILC-TR-02-07, Junho 2002.download zip file

Caseli, H.M. Alinhamento sentencial de textos paralelos Português-Inglês. Monografia de Qualificação. ICMC-USP, Fevereiro, 2002.download zip file

Martins, M.S; Caseli, H.M.; Nunes, M.G.V. A construção de um corpus de textos paralelos inglês-português. Série de Relatórios do NILC. NILC-TR-01-05, Setembro, 2001.download zip file

Oliveira Jr., O. N.; Marchi, A. R.; Martins, M. S.; Martins, R. T. A Critical Analysis of the Performance of English-Portuguese-English MT Systems. V Encontro para o processamento computacional da Língua Portuguesa Escrita e Falada (PROPOR'2000) Atibaia, SP, 20 a 22 Novembro 2000.download zip file

Number of Words	Authentic CorpusPE	Pre-edited CorpusPE	CorpusALCA	CorpusNYT
Brazilian Portuguese	11349	11306	11217	5410
English	10083	10186	10852	5185
Total	21432	21492	22069	10595