Towards a Universal Text Classifier: Transfer Learning from Encyclopedic Knowledge

12:00pm, March 31, Tuesday, 2009, ST2, 430

Speaker

Pu Wang
PhD student
Department of Computer Science
GMU

Abstract

Document classification is a key task for many text mining applications. However, traditional text classification requires labeled data to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available, and often too expensive to obtain. In this work, we propose a universal text classifier, which does not require any labeled training document. Our approach simulates the capability of people to classify documents based on background knowledge. As such, we build a classifier that can effectively group documents based on their content, under the guidance of few words describing the classes of interest. Background knowledge is modeled using encyclopedic knowledge, namely Wikipedia. Wikipedia's articles related to the specific problem domain at hand are selected, and used during the learning process for predicting labels of test documents. The universal text classifier can also be used to perform document retrieval, in which the pool of test documents may or may not be relevant to the topics of interest for the user. In our experiments with real data we test the feasibility of our approach for both the classification and retrieval tasks. The results demonstrate the advantage of incorporating background knowledge through Wikipedia, and the effectiveness of modeling such knowledge via probabilistic topic modeling. The accuracy achieved by the universal text classifier is comparable to that of a supervised learning technique for transfer learning.

Short Bio

Pu Wang is a PhD student in the Department of Computer Science at George Mason University. He received a Masters degree in Computer Science from Beijing University in 2007, and was an intern at Microsoft Research Asia from May 2006 to July 2007. His research interest is in machine learning, currently focusing on graph learning.