Classification, Clustering and Data Mining of Biological Data

GRAND Seminar 12:00 noon, April 21, Thur., 2011, ENGR 4201

Peter Revesz
http://www.cse.unl.edu/~revesz/
Department of Computer Science and Engineering
University of Nebraska-Lincoln
and
Jefferson Science Fellow
U.S. Department of State

Host:

Alexander Brodsky

Abstract:

The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are currently over 1100 molecular biology databases dispersed throughout the Internet. However, very few of them integrate data from multiple sources. To assist in the functional and evolutionary analysis of the abundant number of novel proteins, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database that integrates data from various biological sources. PROFESS is freely available at http://cse.unl.edu/~profess/. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. Using PROFESS, we were able to quantify homologous protein evolution and determine whether bacterial protein structures are subject to random drift after divergence from a common ancestor. After relevant data have been mined, they may be classified or clustered for further analysis. Data classification is usually achieved using machine-learning techniques. However, in many problems the raw data are already classified according to a set of features but need to be reclassified. Data reclassification is usually achieved using data integration methods that require the raw data, which may not be available or sharable because of privacy and legal concerns. We introduce general classification integration} and reclassification methods that create new classes by combining in a flexible way the existing classes without requiring access to the raw data. The flexibility is achieved by representing any linear classification in a constraint database. We also considered temporal data classification where the input is a temporal database that describes measurements over a period of time in history while the predicted class is expected to occur in the future. We experimented with the proposed classification methods on five datasets covering the automobile, meteorological and medical areas and showed significant improvements over existing methods.

Bio:

Peter Revesz holds a Ph.D. degree in Computer Science from Brown University. He was a postdoctoral fellow at the University of Toronto before joining the University of Nebraska-Lincoln, where he is a professor in the Department of Computer Science and Engineering. His current research interests are bioinformatics, geoinformatics and databases, in particular constraint, genome, spatial and temporal databases, and data mining. He is the author of the textbook Introduction to Databases: From Biological to Spatio-Temporal (Springer, 2010). He held visiting appointments at the IBM T.J. Watson Research Center, INRIA, the University of Hasselt, the Max Planck Institute for Computer Science, the University of Athens, and the U.S. Department of State. He is a recipient of an Alexander von Humboldt, a J. William Fulbright, and a Jefferson Science Fellowship.