Automatic Text Categorization (TC) is a complex and useful task for many naturallanguage applications, and is usually performed through the use of a set of manuallyclassified documents, a training collection. Term-based representation of documents hasfound widespread use in TC. However, one of the main shortcomings of such methods isthat they largely disregard lexical semantics and, as a consequence, are not sufficientelyrobust with respect to variations in word usage. We shall design, implement, and evaluatea new, text classification algorithm. Our main idea is to find a series of projections ofthe training data by using a new modifided LSI algorithm, project all training instancesto the low-dimensional subspace found in the previous step, induce a binary search onthe projected low-dimensional data. Our conclusion is that, with all its simplicity andefficiency, our approach is comparable (and sometimes superior) to SVM in terms ofaccuracy
Biancalana, C., Micarelli, A. (2004). Text Categorization with Modified LSI.
Text Categorization with Modified LSI
MICARELLI, Alessandro
2004-01-01
Abstract
Automatic Text Categorization (TC) is a complex and useful task for many naturallanguage applications, and is usually performed through the use of a set of manuallyclassified documents, a training collection. Term-based representation of documents hasfound widespread use in TC. However, one of the main shortcomings of such methods isthat they largely disregard lexical semantics and, as a consequence, are not sufficientelyrobust with respect to variations in word usage. We shall design, implement, and evaluatea new, text classification algorithm. Our main idea is to find a series of projections ofthe training data by using a new modifided LSI algorithm, project all training instancesto the low-dimensional subspace found in the previous step, induce a binary search onthe projected low-dimensional data. Our conclusion is that, with all its simplicity andefficiency, our approach is comparable (and sometimes superior) to SVM in terms ofaccuracyI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.