Text Categorization Using Only Fragments of Documents

Pilászy, István – Dobrowiecki, Tadeusz

Kulcsszavak: machine learning, text categorization, classifier ensembles

In this paper we presented a lot of experiments that examine how the particular parts of the documents do contribute to the performance of a classifier. We evaluated text classifiers on two very different text corpora. We conclude that some parts of the text are more important from the point of text classification performance. Giving higher weights to more important parts can increase the performance of the classifier. The question, that which parts are more or less important depends on the nature of the documents in the corpora. Some tasks that remains to be done
- More text corpora should be investigated.
- In section 6.4 we optimized the number of features to be kept independent from the section. However, it could be optimized for each section.
- Splitting the documents into parts of 50 words, to examine what if the parts are of equal size not only inside a document, but among the documents too.
- When splitting documents into k equal parts, we may combine the classifiers resulted from different k values.