Núcleo Interinstitucional de Lingüística Computacional
An Interinstitutional Center for Research and Development in Computational Linguistics

LINGUARUDO

Using stylistic features of pages in presenting Web search results according to user search intention –

an instantiation for the Portuguese language

Starting Time: 2001

Current Status

Ongoing project which is part of a PhD program (2001-2004) that was carried on partially (from January 2002 - December 2003) at the Oslo node of Linguateca at SINTEF and now is been developed at NILC.

Goals
To define an approach to present the results of IR systems, which takes into account not only the document topic but also the focus the user expects for the results. Here, the expected focus is selected from a taxonomy of user’s needs, three content types, and four criteria (axes) explained below.

This approach will be evaluated with real users in a prototype of a search application for desktop.

Project's Features

This approach explores the use of stylistic features of pages in Portuguese to: present the results of IR systems according to user’s needs; rank the results according to the content of the web pages, or to rank them according to four axes; all of them selected by the user. We also make use of segmentation of texts and automatic summarization to improve the presentation.

 

Our taxonomy of user’s need is composed of seven categories, which are based on what the user wants:

1 - A definition of something or to learn how or why something happens. For example, “what are the northern lights?”

2 - To learn how to do something or how something is usually done. For example, “find a recipe of his favourite cake”, “learn how to make gift boxes”, or “how to install Linux on his computer”.

3 - A comprehensive presentation about a given topic, such as “a panorama of 20th century American literature”.

4 - To read news about a specific subject. For example, “what is the current news about the situation in Israel?”

5 - To find information about someone or some company or organization. For example, the user wants “to know more about his blind date” or “to find the contact information of someone he/she met in a conference”.

6 - To find a specific web page that he/she wants to visit, but does not remember its URL.

7 - To find URLs where he/she can have access to a given online service. For example, “he/she wants to buy new clothes” or “he/she wants to download a new version of software”.

 

In the classification by page content, we considered that pages can have tree types of content: informative, transactional or navigational.

 

With regard to other style/pages criteria, we offer four axes: formal/informal, short/elaborated, contextualized or not, involved/detached.

Expected Results
1) Presenting more accurate results to the user, we expect to have an alternative way to show the results of IR systems that preserves the user of looking at many results that are relevant considering the query topic, but are not relevant for the user on the moment he is posing his/her query.  Making the time spent on finding answers shorter, and the system operation and the relation among the given results clearer.

2) The fact that we apply the classification schemes in the end of the search process (in the presentation of the results that have been already generated), makes possible to other systems to use the results of this project without great modifications.

3) A prototype of a desktop search tool for Portuguese. Since we have only texts in Portuguese in our corpus, we can not guarantee that it will work well for other languages.

4) A review of IR from the point of view of NLP, illustrating our solutions for the Portuguese language.

Why the project objectives have changed?

On the beginning of this project its goal was to develop a linguistic motivated approach to information retrieval to Portuguese. This approach was supposed to allow queries in natural language instead of using keywords and to use NLP tools and techniques to explore features of the Portuguese language during the queries interpretation, matching and the presentation of the results.

To achieve that goal, we studied IR in general, not only the works which have used NLP techniques or tools. Even considering only the resources and tools we knew were already developed for Portuguese on the identification of what could be used, we concluded that, in each part, we would have to apply a set of techniques and   techniques in an isolated fashion. To interpret queries we would use: a morphological analyser, a tagger, a parser, a spell checker, a multi-word extraction algorithm, extraction of relationships (such as "kind of" and "part of"), a thesaurus and patterns of questions in Portuguese. To match queries with indexed pages, a morphological analyser, a tagger, a parser, a multi-word extraction algorithm, extraction of relationships (such as "kind of" and "part of"),  a stemmer and a thesaurus. Finally, to present the results, we would use stylistic features, an automatic text classification algorithm, a text segmentation algorithm and automatic summarization methods.

While we were collecting the material/methods we would use in our experiments, we realized that exploring IR as a whole is even more ambitious than we first though, and that to do that we would not be able to explore and evaluate deeply all the techniques and resources for each part. The results presentation was chosen to be explored because:

- Much work has already been done to develop efficient indexing techniques. The search engines are able to index many pages, building large databases fast.

- While we believe that queries in natural language are also ambiguous, they can give us much more information about the user real needs than keywords can. However, we do not believe that the systems already in use would change its modus operandi to support natural language queries.

- Much effort is being put on the interpretation of queries, for example, on queries expansion. For this phase there are several works from NLP professionals, even for Portuguese.

- The search engines are good in finding relevant results, the matching process works well, however not all the web pages that are about a topic gives the focus the user wants. Given the large number of results, poor queries, and the fact that users don’t use many resources available for them in the search engine’s interface as feedback, for example, it is important to find a clear and easy way to present the given focus of each result for each query topic.

- The presentation of results is also the easiest phase to modify in systems already in use, making the results of this project useful to systems already developed.

Team
Rachel Virgínia Xavier Aires (PhD Student)

Sandra Maria Aluísio (supervisor)

Diana Santos (supervisor)

Financial Support
Fundação para Computação Científica Nacional (FCCN) through Fundação para a Ciência e Tecnologia with the grant POSI/PLP/43931/2001 and co-financed by POSI: since September 2001.

Contact
Rachel Aires: raires@icmc.usp.br

Related Publications

Aires, R.; Manfrin, A.; Aluísio, S.; Santos, D. (2004) What is my Style? Using Stylistic Features of Portuguese Web Texts to classify Web pages according to Users'Needs. To appear in Proceedings of LREC 2004. Lisbon - Portugal. download pdf file download corpus download trainingfile

Aires, R.V.X.; Aluísio, S. M.; Quaresma, P.; Santos, D.; Silva, M. (2003). An initial proposal for cooperative evaluation on information retrieval in Portuguese. In PROPOR 2003 – 6th Workshop on Computational Processing of the Portuguese Language, Faro - Portugal, June 2003, p. 227-234. (c) Springer-Verlag.

Aires, R. V. X. (2003). Linguarudo – Uma arquitetura lingüisticamente motivada para recuperação de informação de textos em português. Qualificação de Doutorado. ICMC-USP, March, 2003, 86p. download pdf file

Aires, R. V. X; Aluísio, S. M. (2003). Como incrementar a qualidade dos resultados das máquinas de busca: da análise de logs à interação em português. Revista Ciência da Informação, vol 32, n. 1, p. 5-16, jan./abr. 2003. download pdf file

Aires, R. V. X; Santos, D. (2002). Measuring the Web in Portuguese. In Euroweb 2002 conference, Oxford, UK, p. 198-199, December 2002. download poster  download poster abstract

Aires, R. V. X; Aluísio, S. M. (2002). Eu falo português. E daí? Poster in IHC 2002 – 5th Symposium on Human Factors in Computer Systems, Fortaleza - CE, October, 2002.

Links

Resources used on this project

Esta página em Português
Linguateca
SINTEF Information & Communication Technology
NILC - Núcleo Interinstitucional de Lingüística Computacional
ICMC - USP

Last Update:  15/05/2004