School of Computing Research Colloquia

Automatic genre identication on the web

Noushin Rezapour Asheghi (Artificial Intelligence group, School of Computing)

Abstract: Writers vary their style, structure and vocabulary significantly depending on the genre of the document they write. For example, an interview, a news article and an editorial for a newspaper on the same topic will be presented very differently because they have different purposes or structures or both. The purpose of news articles is to inform people and therefore they are written in an informative style, whereas, the editorials' main purpose is to express opinion and thus they are presented in an argumentative style. Meanwhile, the structure of interviews, which are usually in the form of face-to-face questions and answers, distinguishes them from both news articles and editorials.

Automatic Genre Identification (AGI) classes documents into genres that encapsulate the main communicative function of a text. AGI is important for at least two reasons:
(i) Genres differ widely with regards to their style, structure and vocabulary and therefore other Natural Language Processing tools need to be modified when carried over from one genre to the other. Thus, AGI can help to choose more appropriate language models for a variety of tasks such as part-of-speech tagging, or word sense disambiguation.
(ii) Many web-based applications are limited to particular genres and therefore presuppose (automatic) identification of these genres. Examples are genre-specific search engines (see Google Scholar that only retrieves academic papers as an example) or genre-specific summarization (such as news summarization).

My work focuses specifically on genres on the Web. In my talk I will present two results of my work. Firstly, I will present the rest reliably annotated web genre corpus, which was achieved via crowd sourcing. Second, I will discuss experiments in AGI on this corpus, comparing a range of surface and structural features. The results show that surface features such as bag of words and character n-grams have more discriminational power than the structural features such as part-of-speech tags and part-of-speech n-grams.