
|
CS/ISE Seminar
Monday, April 2, 2007 Optimal K-mer Superstrings for Protein Identification and DNA Assay DesignNathan Edwards, Ph.D.Center for Bioinformatics and Computational Biology University of Maryland, College Park AbstractThe enumeration and counting of short, fixed-length words (k-mers) from very large sets of sequences is a fundamental algorithmic component of many bioinformatic analyses. We use de Bruijn graphs and a variant of the classical Chinese Postman Problem to construct minimum length k-mer superstrings that are compatible with existing bioinformatics tools. Using these superstrings, we achieve substantial running time improvements for search engines that identify proteins from tandem mass spectrometry data, enabling the discovery of novel protein variants in cancer cell-lines. The superstring construction can be used as a compact representation of k-mer sets and, when combined with de Bruijn graph edge counts, can be used to represent unique or non-unique k-mers. The extension to inexact k-mer counting makes it possible to provide an oracle for inexact k-mer uniqueness, a critical computational bottleneck in DNA assay design. Correctly counting inexact k-mer matches has required the development of new string matching techniques using a form of locality sensitive hashing, called gapped seeds. The lossless gapped seed set design problem has connections to coding theory and statistical designs, and has many natural combinatorial formulations. I'll introduce these formulations, their application to inexact string matching, and algorithms for finding good solutions.Speaker BioDr. Edwards received his Ph.D. in Operations Research from Cornell University in 2000, studying approximation algorithms for the multi-level facility location problem with Dr. David Shmoys. Upon graduation, Dr. Edwards joined Celera Genomics' Informatics Research group, where he worked on the tandem mass spectrometry search engine, SCOPE, used to identify proteins from millions of spectra generated by Celera's high-throughput proteomics laboratory. Moving to Appied Biosystems in 2002, he worked on tools for DNA assay design, the statistical analysis of proteomics profiles, and the construction of protein sequence databases for peptide identification from tandem mass spectra. At the University of Maryland, College Park, since 2004, Dr. Edwards works on the identification of novel peptide biomarkers in cancer by tandem mass spectrometry, the rapid identification of microorganisms by mass spectrometry, and uniqueness oracles for DNA assay design. |