BMC Bioinformatics

Syndicate content
The latest research articles published by BMC Bioinformatics
Updated: 1 year 50 weeks ago

QCScreen: a software tool for data quality control in LC-HRMS based metabolomics

Fri, 10/23/2015 - 19:00
Background: Metabolomics experiments often comprise large numbers of biological samples resulting in huge amounts of data. This data needs to be inspected for plausibility before data evaluation to detect putative sources of error e.g. retention time or mass accuracy shifts. Especially in liquid chromatography-high resolution mass spectrometry (LC-HRMS) based metabolomics research, proper quality control checks (e.g. for precision, signal drifts or offsets) are crucial prerequisites to achieve reliable and comparable results within and across experimental measurement sequences. Software tools can support this process. Results: The software tool QCScreen was developed to offer a quick and easy data quality check of LC-HRMS derived data. It allows a flexible investigation and comparison of basic quality-related parameters within user-defined target features and the possibility to automatically evaluate multiple sample types within or across different measurement sequences in a short time. It offers a user-friendly interface that allows an easy selection of processing steps and parameter settings. The generated results include a coloured overview plot of data quality across all analysed samples and targets and, in addition, detailed illustrations of the stability and precision of the chromatographic separation, the mass accuracy and the detector sensitivity. The use of QCScreen is demonstrated with experimental data from metabolomics experiments using selected standard compounds in pure solvent. The application of the software identified problematic features, samples and analytical parameters and suggested which data files or compounds required closer manual inspection. Conclusions: QCScreen is an open source software tool which provides a useful basis for assessing the suitability of LC-HRMS data prior to time consuming, detailed data processing and subsequent statistical analysis. It accepts the generic mzXML format and thus can be used with many different LC-HRMS platforms to process both multiple quality control sample types as well as experimental samples in one or more measurement sequences.
Categories: Journal Articles

Whole genome SNP genotype piecemeal imputation

Thu, 10/22/2015 - 19:00
Background: Despite ongoing reductions in the cost of sequencing technologies, whole genome SNP genotype imputation is often used as an alternative for obtaining abundant SNP genotypes for genome wide association studies. Several existing genotype imputation methods can be efficient for this purpose, while achieving various levels of imputation accuracy. Recent empirical results have shown that the two-step imputation may improve accuracy by imputing the low density genotyped study animals to a medium density array first and then to the target density. We are interested in building a series of staircase arrays that lead the low density array to the high density array or even the whole genome, such that genotype imputation along these staircases can achieve the highest accuracy. Results: For genotype imputation from a lower density to a higher density, we first show how to select untyped SNPs to construct a medium density array. Subsequently, we determine for each selected SNP those untyped SNPs to be imputed in the add-one two-step imputation, and lastly how the clusters of imputed genotype are pieced together as the final imputation result. We design extensive empirical experiments using several hundred sequenced and genotyped animals to demonstrate that our novel two-step piecemeal imputation always achieves an improvement compared to the one-step imputation by the state-of-the-art methods Beagle and FImpute. Using the two-step piecemeal imputation, we present some preliminary success on whole genome SNP genotype imputation for genotyped animals via a series of staircase arrays. Conclusions: From a low SNP density to the whole genome, intermediate pseudo-arrays can be computationally constructed by selecting the most informative SNPs for untyped SNP genotype imputation. Such pseudo-array staircases are able to impute more accurately than the classic one-step imputation.
Categories: Journal Articles

PDB-Explorer: a web-based interactive map of the protein data bank in shape space

Thu, 10/22/2015 - 19:00
Background: The RCSB Protein Data Bank (PDB) provides public access to experimentally determined 3D-structures of biological macromolecules (proteins, peptides and nucleic acids). While various tools are available to explore the PDB, options to access the global structural diversity of the entire PDB and to perceive relationships between PDB structures remain very limited. Methods: A 136-dimensional atom pair 3D-fingerprint for proteins (3DP) counting categorized atom pairs at increasing through-space distances was designed to represent the molecular shape of PDB-entries. Nearest neighbor searches examples were reported exemplifying the ability of 3DP-similarity to identify closely related biomolecules from small peptides to enzyme and large multiprotein complexes such as virus particles. The principle component analysis was used to obtain the visualization of PDB in 3DP-space. Results: The 3DP property space groups proteins and protein assemblies according to their 3D-shape similarity, yet shows exquisite ability to distinguish between closely related structures. An interactive website called PDB-Explorer is presented featuring a color-coded interactive map of PDB in 3DP-space. Each pixel of the map contains one or more PDB-entries which are directly visualized as ribbon diagrams when the pixel is selected. The PDB-Explorer website allows performing 3DP-nearest neighbor searches of any PDB-entry or of any structure uploaded as protein-type PDB file. All functionalities on the website are implemented in JavaScript in a platform-independent manner and draw data from a server that is updated daily with the latest PDB additions, ensuring complete and up-to-date coverage. The essentially instantaneous 3DP-similarity search with the PDB-Explorer provides results comparable to those of much slower 3D-alignment algorithms, and automatically clusters proteins from the same superfamilies in tight groups. Conclusion: A chemical space classification of PDB based on molecular shape was obtained using a new atom-pair 3D-fingerprint for proteins and implemented in a web-based database exploration tool comprising an interactive color-coded map of the PDB chemical space and a nearest neighbor search tool. The PDB-Explorer website is freely available at www.cheminfo.org/pdbexplorer and represents an unprecedented opportunity to interactively visualize and explore the structural diversity of the PDB.ᅟGraphical abstractᅟMaps of PDB in 3DP-space color-coded by heavy atom count and shape.
Categories: Journal Articles

Iterative reconstruction of three-dimensional models of human chromosomes from chromosomal contact data

Thu, 10/22/2015 - 19:00
Background: The entire collection of genetic information resides within the chromosomes, which themselves reside within almost every cell nucleus of eukaryotic organisms. Each individual chromosome is found to have its own preferred three-dimensional (3D) structure independent of the other chromosomes. The structure of each chromosome plays vital roles in controlling certain genome operations, including gene interaction and gene regulation. As a result, knowing the structure of chromosomes assists in the understanding of how the genome functions. Fortunately, the 3D structure of chromosomes proves possible to construct through computational methods via contact data recorded from the chromosome. We developed a unique computational approach based on optimization procedures known as adaptation, simulated annealing, and genetic algorithm to construct 3D models of human chromosomes, using chromosomal contact data. Results: Our models were evaluated using a percentage-based scoring function. Analysis of the scores of the final 3D models demonstrated their effective construction from our computational approach. Specifically, the models resulting from our approach yielded an average score of 80.41 %, with a high of 91 %, across models for all chromosomes of a normal human B-cell. Comparisons made with other methods affirmed the effectiveness of our strategy. Particularly, juxtaposition with models generated through the publicly available method Markov chain Monte Carlo 5C (MCMC5C) illustrated the outperformance of our approach, as seen through a higher average score for all chromosomes. Our methodology was further validated using two consistency checking techniques known as convergence testing and robustness checking, which both proved successful. Conclusions: The pursuit of constructing accurate 3D chromosomal structures is fueled by the benefits revealed by the findings as well as any possible future areas of study that arise. This motivation has led to the development of our computational methodology. The implementation of our approach proved effective in constructing 3D chromosome models and proved consistent with, and more effective than, some other methods thereby achieving our goal of creating a tool to help advance certain research efforts. The source code, test data, test results, and documentation of our method, Gen3D, are available at our sourceforge site at: http://sourceforge.net/projects/gen3d/.
Categories: Journal Articles

A large-scale conformation sampling and evaluation server for protein tertiary structure prediction and its assessment in CASP11

Thu, 10/22/2015 - 19:00
Background: With more and more protein sequences produced in the genomic era, predicting protein structures from sequences becomes very important for elucidating the molecular details and functions of these proteins for biomedical research. Traditional template-based protein structure prediction methods tend to focus on identifying the best templates, generating the best alignments, and applying the best energy function to rank models, which often cannot achieve the best performance because of the difficulty of obtaining best templates, alignments, and models. Methods: We developed a large-scale conformation sampling and evaluation method and its servers to improve the reliability and robustness of protein structure prediction. In the first step, our method used a variety of alignment methods to sample relevant and complementary templates and to generate alternative and diverse target-template alignments, used a template and alignment combination protocol to combine alignments, and used template-based and template-free modeling methods to generate a pool of conformations for a target protein. In the second step, it used a large number of protein model quality assessment methods to evaluate and rank the models in the protein model pool, in conjunction with an exception handling strategy to deal with any additional failure in model ranking. Results: The method was implemented as two protein structure prediction servers: MULTICOM-CONSTRUCT and MULTICOM-CLUSTER that participated in the 11th Critical Assessment of Techniques for Protein Structure Prediction (CASP11) in 2014. The two servers were ranked among the best 10 server predictors. Conclusions: The good performance of our servers in CASP11 demonstrates the effectiveness and robustness of the large-scale conformation sampling and evaluation. The MULTICOM server is available at: http://sysbio.rnet.missouri.edu/multicom_cluster/.
Categories: Journal Articles

Phylogenomics and sequence-structure-function relationships in the GmrSD family of Type IV restriction enzymes

Thu, 10/22/2015 - 19:00
Background: GmrSD is a modification-dependent restriction endonuclease that specifically targets and cleaves glucosylated hydroxymethylcytosine (glc-HMC) modified DNA. It is encoded either as two separate single-domain GmrS and GmrD proteins or as a single protein carrying both domains. Previous studies suggested that GmrS acts as endonuclease and NTPase whereas GmrD binds DNA. Methods: In this work we applied homology detection, sequence conservation analysis, fold recognition and homology modeling methods to study sequence-structure-function relationships in the GmrSD restriction endonucleases family. We also analyzed the phylogeny and genomic context of the family members. Results: Results of our comparative genomics study show that GmrS exhibits similarity to proteins from the ParB/Srx fold which can have both NTPase and nuclease activity. In contrast to the previous studies though, we attribute the nuclease activity also to GmrD as we found it to contain the HNH endonuclease motif. We revealed residues potentially important for structure and function in both domains. Moreover, we found that GmrSD systems exist predominantly as a fused, double-domain form rather than as a heterodimer and that their homologs are often encoded in regions enriched in defense and gene mobility-related elements. Finally, phylogenetic reconstructions of GmrS and GmrD domains revealed that they coevolved and only few GmrSD systems appear to be assembled from distantly related GmrS and GmrD components. Conclusions: Our study provides insight into sequence-structure-function relationships in the yet poorly characterized family of Type IV restriction enzymes. Comparative genomics allowed to propose possible role of GmrD domain in the function of the GmrSD enzyme and possible active sites of both GmrS and GmrD domains. Presented results can guide further experimental characterization of these enzymes.
Categories: Journal Articles

AlloPred: prediction of allosteric pockets on proteins using normal mode perturbation analysis

Thu, 10/22/2015 - 19:00
Background: Despite being hugely important in biological processes, allostery is poorly understood and no universal mechanism has been discovered. Allosteric drugs are a largely unexplored prospect with many potential advantages over orthosteric drugs. Computational methods to predict allosteric sites on proteins are needed to aid the discovery of allosteric drugs, as well as to advance our fundamental understanding of allostery. Results: AlloPred, a novel method to predict allosteric pockets on proteins, was developed. AlloPred uses perturbation of normal modes alongside pocket descriptors in a machine learning approach that ranks the pockets on a protein. AlloPred ranked an allosteric pocket top for 23 out of 40 known allosteric proteins, showing comparable and complementary performance to two existing methods. In 28 of 40 cases an allosteric pocket was ranked first or second. The AlloPred web server, freely available at http://www.sbg.bio.ic.ac.uk/allopred/home, allows visualisation and analysis of predictions. The source code and dataset information are also available from this site. Conclusions: Perturbation of normal modes can enhance our ability to predict allosteric sites on proteins. Computational methods such as AlloPred assist drug discovery efforts by suggesting sites on proteins for further experimental study.
Categories: Journal Articles

VEGAWES: variational segmentation on whole exome sequencing for copy number detection

Mon, 09/28/2015 - 19:00
Background: Copy number variations are important in the detection and progression of significant tumors and diseases. Recently, Whole Exome Sequencing is gaining popularity with copy number variations detection due to low cost and better efficiency. In this work, we developed VEGAWES for accurate and robust detection of copy number variations on WES data. VEGAWES is an extension to a variational based segmentation algorithm, VEGA: Variational estimator for genomic aberrations, which has previously outperformed several algorithms on segmenting array comparative genomic hybridization data. Results: We tested this algorithm on synthetic data and 100 Glioblastoma Multiforme primary tumor samples. The results on the real data were analyzed with segmentation obtained from Single-nucleotide polymorphism data as ground truth. We compared our results with two other segmentation algorithms and assessed the performance based on accuracy and time. Conclusions: In terms of both accuracy and time, VEGAWES provided better results on the synthetic data and tumor samples demonstrating its potential in robust detection of aberrant regions in the genome.
Categories: Journal Articles

Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors

Mon, 09/28/2015 - 19:00
Background: The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences. Results: In this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories. Conclusions: In quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics.
Categories: Journal Articles

methylPipe and compEpiTools: a suite of R packages for the integrative analysis of epigenomics data

Mon, 09/28/2015 - 19:00
Background: Numerous methods are available to profile several epigenetic marks, providing data with different genome coverage and resolution. Large epigenomic datasets are then generated, and often combined with other high-throughput data, including RNA-seq, ChIP-seq for transcription factors (TFs) binding and DNase-seq experiments. Despite the numerous computational tools covering specific steps in the analysis of large-scale epigenomics data, comprehensive software solutions for their integrative analysis are still missing. Multiple tools must be identified and combined to jointly analyze histone marks, TFs binding and other -omics data together with DNA methylation data, complicating the analysis of these data and their integration with publicly available datasets. Results: To overcome the burden of integrating various data types with multiple tools, we developed two companion R/Bioconductor packages. The former, methylPipe, is tailored to the analysis of high- or low-resolution DNA methylomes in several species, accommodating (hydroxy-)methyl-cytosines in both CpG and non-CpG sequence context. The analysis of multiple whole-genome bisulfite sequencing experiments is supported, while maintaining the ability of integrating targeted genomic data. The latter, compEpiTools, seamlessly incorporates the results obtained with methylPipe and supports their integration with other epigenomics data. It provides a number of methods to score these data in regions of interest, leading to the identification of enhancers, lncRNAs, and RNAPII stalling/elongation dynamics. Moreover, it allows a fast and comprehensive annotation of the resulting genomic regions, and the association of the corresponding genes with non-redundant GeneOntology terms. Finally, the package includes a flexible method based on heatmaps for the integration of various data types, combining annotation tracks with continuous or categorical data tracks. Conclusions: methylPipe and compEpiTools provide a comprehensive Bioconductor-compliant solution for the integrative analysis of heterogeneous epigenomics data. These packages are instrumental in providing biologists with minimal R skills a complete toolkit facilitating the analysis of their own data, or in accelerating the analyses performed by more experienced bioinformaticians.
Categories: Journal Articles

NetBenchmark: a bioconductor package for reproducible benchmarks of gene regulatory network inference

Mon, 09/28/2015 - 19:00
Background: In the last decade, a great number of methods for reconstructing gene regulatory networks from expression data have been proposed. However, very few tools and datasets allow to evaluate accurately and reproducibly those methods. Hence, we propose here a new tool, able to perform a systematic, yet fully reproducible, evaluation of transcriptional network inference methods. Results: Our open-source and freely available Bioconductor package aggregates a large set of tools to assess the robustness of network inference algorithms against different simulators, topologies, sample sizes and noise intensities. Conclusions: The benchmarking framework that uses various datasets highlights the specialization of some methods toward network types and data. As a result, it is possible to identify the techniques that have broad overall performances.
Categories: Journal Articles

PROKARYO: an illustrative and interactive computational model of the lactose operon in the bacterium <it>Escherichia coli</it>

Mon, 09/28/2015 - 19:00
Background: We are creating software for agent-based simulation and visualization of bio-molecular processes in bacterial and eukaryotic cells. As a first example, we have built a 3-dimensional, interactive computer model of an Escherichia coli bacterium and its associated biomolecular processes. Our illustrative model focuses on the gene regulatory processes that control the expression of genes involved in the lactose operon. Prokaryo, our agent-based cell simulator, incorporates cellular structures, such as plasma membranes and cytoplasm, as well as elements of the molecular machinery, including RNA polymerase, messenger RNA, lactose permease, and ribosomes. Results: The dynamics of cellular ’agents’ are defined by their rules of interaction, implemented as finite state machines. The agents are embedded within a 3-dimensional virtual environment with simulated physical and electrochemical properties. The hybrid model is driven by a combination of (1) mathematical equations (DEQs) to capture higher-scale phenomena and (2) agent-based rules to implement localized interactions among a small number of molecular elements. Consequently, our model is able to capture phenomena across multiple spatial scales, from changing concentration gradients to one-on-one molecular interactions.We use the classic gene regulatory mechanism of the lactose operon to demonstrate our model’s resolution, visual presentation, and real-time interactivity. Our agent-based model expands on a sophisticated mathematical E. coli metabolism model, through which we highlight our model’s scientific validity. Conclusion: We believe that through illustration and interactive exploratory learning a model system like Prokaryo can enhance the general understanding and perception of biomolecular processes. Our agent-DEQ hybrid modeling approach can also be of value to conceptualize, illustrate, and—eventually—validate cell experiments in the wet lab.
Categories: Journal Articles

CATCHing putative causative variants in consanguineous families

Mon, 09/28/2015 - 07:00
Background: Consanguinity is an important risk factor for autosomal recessive (AR) disorders. Extended genomic regions identical by descent (IBD) in the offspring of consanguineous parents give rise to recessive disorders with identical (homozygous) pathogenic variants in both alleles. However, many clinical phenotypes presenting in the offspring of consanguineous couples are still of unknown etiology. Nowadays advances in High Throughput Sequencing provide an excellent opportunity to achieve a molecular diagnosis or to identify novel candidate genes. Results: To exploit all available information from the family structure we developed CATCH, an algorithm that combines genotyped SNPs of all family members for the optimal detection of Runs Of Homozygosity (ROH) and exome sequencing data from one affected individual to identify putative causative variants in consanguineous families. Conclusions: CATCH proved to be effective in discovering known or putative new causative variants in 43 out of 50 consanguineous families. Among them, novel variants causative of familial thrombocytopenia, sclerosis bone dysplasia and the first homozygous loss-of-function mutation in FGFR3 in human causing severe skeletal deformities, tall stature and hearing impairment were identified.
Categories: Journal Articles

Systematic noise degrades gene co-expression signals but can be corrected

Thu, 09/24/2015 - 07:00
Background: In the past decade, the identification of gene co-expression has become a routine part of the analysis of high-dimensional microarray data. Gene co-expression, which is mostly detected via the Pearson correlation coefficient, has played an important role in the discovery of molecular pathways and networks. Unfortunately, the presence of systematic noise in high-dimensional microarray datasets corrupts estimates of gene co-expression. Removing systematic noise from microarray data is therefore crucial. Many cleaning approaches for microarray data exist, however these methods are aimed towards improving differential expression analysis and their performances have been primarily tested for this application. To our knowledge, the performances of these approaches have never been systematically compared in the context of gene co-expression estimation. Results: Using simulations we demonstrate that standard cleaning procedures, such as background correction and quantile normalization, fail to adequately remove systematic noise that affects gene co-expression and at times further degrade true gene co-expression. Instead we show that a global version of removal of unwanted variation (RUV), a data-driven approach, removes systematic noise but also allows the estimation of the true underlying gene-gene correlations. We compare the performance of all noise removal methods when applied to five large published datasets on gene expression in the human brain. RUV retrieves the highest gene co-expression values for sets of genes known to interact, but also provides the greatest consistency across all five datasets. We apply the method to prioritize epileptic encephalopathy candidate genes. Conclusions: Our work raises serious concerns about the quality of many published gene co-expression analyses. RUV provides an efficient and flexible way to remove systematic noise from high-dimensional microarray datasets when the objective is gene co-expression analysis. The RUV method as applicable in the context of gene-gene correlation estimation is available as a BioconductoR-package: RUVcorr.
Categories: Journal Articles

2D and 3D similarity landscape analysis identifies PARP as a novel off-target for the drug Vatalanib

Thu, 09/24/2015 - 07:00
Background: Searching for two-dimensional (2D) structural similarities is a useful tool to identify new active compounds in drug-discovery programs. However, as 2D similarity measures neglect important structural and functional features, similarity by 2D might be underestimated. In the present study, we used combined 2D and three-dimensional (3D) similarity comparisons to reveal possible new functions and/or side-effects of known bioactive compounds. Results: We utilised more than 10,000 compounds from the SuperTarget database with known inhibition values for twelve different anti-cancer targets. We performed all-against-all comparisons resulting in 2D similarity landscapes. Among the regions with low 2D similarity scores are inhibitors of vascular endothelial growth factor receptor (VEGFR) and inhibitors of poly ADP-ribose polymerase (PARP). To demonstrate that 3D landscape comparison can identify similarities, which are untraceable in 2D similarity comparisons, we analysed this region in more detail. This 3D analysis showed the unexpected structural similarity between inhibitors of VEGFR and inhibitors of PARP. Among the VEGFR inhibitors that show similarities to PARP inhibitors was Vatalanib, an oral “multi-targeted” small molecule protein kinase inhibitor being studied in phase-III clinical trials in cancer therapy. An in silico docking simulation and an in vitro HT universal colorimetric PARP assay confirmed that the VEGFR inhibitor Vatalanib exhibits off-target activity as a PARP inhibitor, broadening its mode of action. Conclusion: In contrast to the 2D-similarity search, the 3D-similarity landscape comparison identifies new functions and side effects of the known VEGFR inhibitor Vatalanib.
Categories: Journal Articles

htsint: a Python library for sequencing pipelines that combines data through gene set generation

Wed, 09/23/2015 - 19:00
Background: Sequencing technologies provide a wealth of details in terms of genes, expression, splice variants, polymorphisms, and other features. A standard for sequencing analysis pipelines is to put genomic or transcriptomic features into a context of known functional information, but the relationships between ontology terms are often ignored. For RNA-Seq, considering genes and their genetic variants at the group level enables a convenient way to both integrate annotation data and detect small coordinated changes between experimental conditions, a known caveat of gene level analyses. Results: We introduce the high throughput data integration tool, htsint, as an extension to the commonly used gene set enrichment frameworks. The central aim of htsint is to compile annotation information from one or more taxa in order to calculate functional distances among all genes in a specified gene space. Spectral clustering is then used to partition the genes, thereby generating functional modules. The gene space can range from a targeted list of genes, like a specific pathway, all the way to an ensemble of genomes. Given a collection of gene sets and a count matrix of transcriptomic features (e.g. expression, polymorphisms), the gene sets produced by htsint can be tested for ‘enrichment’ or conditional differences using one of a number of commonly available packages. Conclusion: The database and bundled tools to generate functional modules were designed with sequencing pipelines in mind, but the toolkit nature of htsint allows it to also be used in other areas of genomics. The software is freely available as a Python library through GitHub at https://github.com/ajrichards/htsint.
Categories: Journal Articles

Subtype prediction in pediatric acute myeloid leukemia: classification using differential network rank conservation revisited

Wed, 09/23/2015 - 07:00
Background: One of the most important application spectrums of transcriptomic data is cancer phenotype classification. Many characteristics of transcriptomic data, such as redundant features and technical artifacts, make over-fitting commonplace. Promising classification results often fail to generalize across datasets with different sources, platforms, or preprocessing. Recently a novel differential network rank conservation (DIRAC) algorithm to characterize cancer phenotypes using transcriptomic data. DIRAC is a member of a family of algorithms that have shown useful for disease classification based on the relative expression of genes. Combining the robustness of this family’s simple decision rules with known biological relationships, this systems approach identifies interpretable, yet highly discriminate networks. While DIRAC has been briefly employed for several classification problems in the original paper, the potentials of DIRAC in cancer phenotype classification, and especially robustness against artifacts in transcriptomic data have not been fully characterized yet. Results: In this study we thoroughly investigate the potentials of DIRAC by applying it to multiple datasets, and examine the variations in classification performances when datasets are (i) treated and untreated for batch effect; (ii) preprocessed with different techniques. We also propose the first DIRAC-based classifier to integrate multiple networks. We show that the DIRAC-based classifier is very robust in the examined scenarios. To our surprise, the trained DIRAC-based classifier even translated well to a dataset with different biological characteristics in the presence of substantial batch effects that, as shown here, plagued the standard expression value based classifier. In addition, the DIRAC-based classifier, because of the integrated biological information, also suggests pathways to target in specific subtypes, which may enhance the establishment of personalized therapy in diseases such as pediatric AML. In order to better comprehend the prediction power of the DIRAC-based classifier in general, we also performed classifications using publicly available datasets from breast and lung cancer. Furthermore, multiple well-known classification algorithms were utilized to create an ideal test bed for comparing the DIRAC-based classifier with the standard gene expression value based classifier. We observed that the DIRAC-based classifier greatly outperforms its rival. Conclusions: Based on our experiments with multiple datasets, we propose that DIRAC is a promising solution to the lack of generalizability in classification efforts that uses transcriptomic data. We believe that superior performances presented in this study may motivate other to initiate a new aline of research to explore the untapped power of DIRAC in a broad range of cancer types.
Categories: Journal Articles