Núcleo Interinstitucional de Lingüística Computacional
An Interinstitutional Center for Research and Development in Computational Linguistics


A Functional Parser for Brazilian Portuguese

Starting Time: 2002

Current Status
The version 1.0 of CURUPIRA is currently being tested. A new version (2.0) is now under development.

CURUPIRA aims at providing the set of all possible syntactic analyses for any sentence written in Brazilian Portuguese.

Project's Features
CURUPIRA is a general-purpose robust parser for Brazilian Portuguese. It parses sentences in a top-down left-to-right manner through a context-free constrained-relaxed functional grammar for standard written Brazilian Portuguese and a broad-coverage extensive lexicon for Brazilian Portuguese. The latter is a set of 1.5 million free forms (including inflected and derived forms), comprising morpho-syntactic information (as part-of-speech, number, person, gender, tense, aspect, transitivity, etc.). The former is a hand-made grammar that can be defined by the 5-uple <S, V, T, P, W>, where 'S' stands for the initial symbol (i.e., any sequence of words between two sentence boundaries, mainly punctuation marks); 'V' stands for non-terminal vocabulary (a tag set of syntactic functions as close as possible to NGB - the official Brazilian grammar terminology); 'T' stands for terminal vocabulary (a tag set of morpho-syntactic information ascribed to entries in the lexicon); 'P' stands for the set of production rules, written in a special formalism; and 'W' stands for the priority of application of production rules.

CURUPIRA was a former part of ReGra - the grammar and style checker developed by NILC - and thus primarily driven to parse strings of words irrespective of their grammaticality. No government, agreement and other dependency relations are checked. Except for function words (as articles and prepositions), no lexical disambiguation is carried out either. Decisions on the best part-of-speech candidate are taken by the parser itself, as it fulfils the highly-ranked syntactic structures first.

The input of CURUPIRA can be either an isolated sentence or a text (that is going to be splitted in many sentences) and should follow the standard written Brazilian Portuguese syntax for better results. Topicalizations, clefts and syntactic inversions cannot be handled by the tool in this first version. The output of CURUPIRA follows the special notation that has been developed by NILC.

Expected Results
CURUPIRA is not committed to generating the right syntactic tree for a given sentence, but the most common surface combinations for any sequence of morpho-syntactic classes. No semantic interpretation and syntactic disambiguation is carried out and parse results are expected to be ranked solely according to the priority application of rules (that is supposed to provide the most appropriate tree for checking purposes).

Team (2004)
Maria Graças Volpe Nunes (coordinator)
Ricardo Hasegawa
Ronaldo Teixeira Martins

Ricardo Hasegawa: rh@icmc.usp.br
Ronaldo Martins: rtmartin@.uol.com.br

Related Publications
Martins, R. T.; Hasegawa, R.; Nunes, M.G.V. Curupira: um parser functional para o português. NILC-TR-02-26, Dezembro 2002. download zip file

Martins, R. T.; Hasegawa, R.; Nunes, M.G.V. Curupira: a functional parser for Brazilian Portuguese. In Nuno J. Mamede, Jorge Baptista, Isabel Trancoso, Maria das Graças Volpe Nunes (Eds.): Computational Processing of the Portuguese Language, 6th International Workshop, PROPOR 2003, Faro, Portugal, June 26-27, 2003. Proceedings. Lecture Notes in Computer Science 2721 Springer 2003, ISBN 3-540-40436-8