From recognizing biological sequences, to identifying search keywords: A feature generation framework

12:00 noon, September 22, Tuesday, 2009, ENGR 4201

Host

Amarda Shehu

Speaker

Rezarta Islamaj
Research Fellow
National Center for Biotechnology Information (NCBI)
NIH

Abstract

The set of attributes or features selected to model an entity is very important for correct classification. In this talk I will present an integrated process, which I refer to as feature generation. This method allows the user to construct informative features based on domain knowledge, and to search a large space of potential features effectively.

I applied this approach to the problem of splice-site prediction and obtained new predictive models for these biological signals for two different organisms. These models have achieved significant improvements in accuracy over existing, state-of-the-art approaches. In each case, the identified sets of features were used to discover biologically interesting motifs. They are available to the public through an easy-to-use website, SplicePort (http://www.spliceport.org). Spliceport can be used to predict new splice sites from user-input sequences, and to browse the whole collection of features for biologically significant signals.

I also applied this approach to the problem of keyword identification for effective document retrieval. The automatic identification of ?clickable? words in the title and abstract of articles is of central importance in improving the retrieval quality of the search engine. It is also important to authors as it increases the chances that their article will get better visibility. PubMed (http://www.ncbi.nlm.nih.gov/PubMed), a free Web service provided by the U.S. National Library of Medicine, provides daily access to over 19 million biomedical citations for millions of users. The current retrieval algorithm in PubMed finds all the articles that match the terms in the user query and presents them in reverse chronological order. I studied PubMed log data for the clickthrough activities of users after they have issued a query. Linking the query terms to the clicked articles, I built a novel machine learning model that identifies "keywords" that are preferred by users to access a particular article.

Short Bio

Dr. Rezarta Islamaj received her Ph.D. degree in Computer Science from University of Maryland at College Park in 2007. Her research focused on applying machine learning and data mining approaches to computational biology problems. Specifically she worked on construction, selection and discovery of appropriate motifs to model biological signals for accurate classification and prediction.

Currently, she is a Research Fellow at the Computational Biology Branch, National Center for Biotechnology Information (NCBI). NCBI is part of the National Library of Medicine at NIH. Her current research focuses on understanding user search behaviours when using NCBI databases, specifically PubMed, in order to improve retrieval quality and efficiency.