Electronic encyclopedia for enriching document representation for information retrieval tasks

Researcher:
Prof. Shaul Markovitch | Computer Science

Categories:

Information and Computer Science

The Technology

Computerized categorization of text documents has many real world applications. One example is enabling a computer to filter email messages by detecting the messages that are relevant to the categories of interest to the receiver. Another example is news or message routing, wherein a computer can route messages and documents to the recipients that deal with the details relayed in the messages. Other applications are automatic document organization and automatic information retrieval. Search engines can use computerized categorization to parse a query and to find the most related responses. The standard approach for computerized categorization is to build a classifier engine from a large set of documents that is referred to as a training set. The training set contains a collection of documents that were previously categorized, for example by human reviewers. Typically a set of categories is defined and the reviewers determine which category or categories each document belongs to. The categories may be distinct or may be interconnected, for example the categories may have a hierarchical structure, wherein categories are subdivided to subcategories. An example of such a set is Reuters-21578. We apply machine learning techniques to Wikipedia, the largest encyclopedia to date, which surpasses in scope many conventional encyclopedias and provides a cornucopia of world knowledge. Each Wikipedia article represents a concept, and documents to be categorized are represented in the rich feature space of words and relevant Wikipedia concepts. Empirical results confirm that this knowledge-intensive representation brings text categorization to a qualitatively new level of performance across a diverse collection of datasets.

Advantages

  • knowledge-intensive representation that brings text categorization to a qualitatively new level of performance across a diverse collection of datasets

Applications and Opportunities

  • Text categorization (such as email filtering, news routing), information retrieval (such as web search), question answering
arrow Business Development Contacts
Dr. Arkadiy Morgenshtein
Director of Business Development, ICT