Automatic Text Categorization (TC) is a complex and useful task for many naturallanguage applications, and is usually performed through the use of a set of manuallyclassified documents, a training collection. Term-based representation of documents hasfound widespread use in TC. However, one of the main shortcomings of such methods isthat they largely disregard lexical semantics and, as a consequence, are not sufficientelyrobust with respect to variations in word usage. We shall design, implement, and evaluatea new, text classification algorithm. Our main idea is to find a series of projections ofthe training data by using a new modifided LSI algorithm, project all training instancesto the low-dimensional subspace found in the previous step, induce a binary search onthe projected low-dimensional data. Our conclusion is that, with all its simplicity andefficiency, our approach is comparable (and sometimes superior) to SVM in terms ofaccuracy

Biancalana, C., Micarelli, A. (2004). Text Categorization with Modified LSI.

Text Categorization with Modified LSI

MICARELLI, Alessandro
2004-01-01

Abstract

Automatic Text Categorization (TC) is a complex and useful task for many naturallanguage applications, and is usually performed through the use of a set of manuallyclassified documents, a training collection. Term-based representation of documents hasfound widespread use in TC. However, one of the main shortcomings of such methods isthat they largely disregard lexical semantics and, as a consequence, are not sufficientelyrobust with respect to variations in word usage. We shall design, implement, and evaluatea new, text classification algorithm. Our main idea is to find a series of projections ofthe training data by using a new modifided LSI algorithm, project all training instancesto the low-dimensional subspace found in the previous step, induce a binary search onthe projected low-dimensional data. Our conclusion is that, with all its simplicity andefficiency, our approach is comparable (and sometimes superior) to SVM in terms ofaccuracy
2004
Biancalana, C., Micarelli, A. (2004). Text Categorization with Modified LSI.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/272278
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact