The latest research articles published by BMC Bioinformatics
Updated: 3 weeks 2 days ago
Background: Numerous tools have been developed to predict the fitness effects (i.e., neutral, deleterious, or beneficial) of genetic variants on corresponding proteins. However, prediction in terms of whether a variant causes the variant bearing protein to lose the original function or gain new function is also needed for better understanding of how the variant contributes to disease/cancer. To address this problem, the present work introduces and computationally defines four types of functional outcome of a variant: gain, loss, switch, and conservation of function. The deployment of multiple hidden Markov models is proposed to computationally classify mutations by the four functional impact types. Results: The functional outcome is predicted for over a hundred thyroid stimulating hormone receptor (TSHR) mutations, as well as cancer related mutations in oncogenes or tumor suppressor genes. The results show that the proposed computational method is effective in fine grained prediction of the functional outcome of a mutation, and can be used to help elucidate the molecular mechanism of disease/cancer causing mutations. The program is freely available at http://bioinformatics.cs.vt.edu/zhanglab/HMMvar/download.php. Conclusion: This work is the first to computationally define and predict functional impact of mutations, loss, switch, gain, or conservation of function. These fine grained predictions can be especially useful for identifying mutations that cause or are linked to cancer.
Background: Pathway analysis methods, in which differentially expressed genes are mapped to databases of reference pathways and relative enrichment is assessed, help investigators to propose biologically relevant hypotheses. The last generation of pathway analysis methods takes into account the topological structure of a pathway, which helps to increase both specificity and sensitivity of the findings. Simultaneously, the RNA-Seq technology is gaining popularity and becomes widely used for gene expression profiling. Unfortunately, majority of topological pathway analysis methods remains without implementation and if an implementation exists, it is limited in various factors. Results: We developed a new R/Bioconductor package ToPASeq offering uniform interface to seven distinct topology-based pathway analysis methods, of which three we implemented de-novo and four were adjusted from existing implementations. Apart this, ToPASeq offers a set of tailored visualization functions and functions for importing and manipulating pathways and their topologies, facilitating the application of the methods on different species. The package can be used to compare the differential expression of pathways between two conditions on both gene expression microarray and RNA-Seq data. The package is written in R and is available from Bioconductor 3.2 using AGPL-3 license. Conclusion: ToPASeq is a novel package that offers seven distinct methods for topology-based pathway analysis, which are easily applicable on microarray as well as RNA-Seq data, both in human and other species. At the same time, it provides specific tools for visualization of the results.
Background: ChIP-seq experiments are widely used to detect and study DNA-protein interactions, such as transcription factor binding and chromatin modifications. However, downstream analysis of ChIP-seq data is currently restricted to the evaluation of signal intensity and the detection of enriched regions (peaks) in the genome. Other features of peak shape are almost always neglected, despite the remarkable differences shown by ChIP-seq for different proteins, as well as by distinct regions in a single experiment. Results: We hypothesize that statistically significant differences in peak shape might have a functional role and a biological meaning. Thus, we design five indices able to summarize peak shapes and we employ multivariate clustering techniques to divide peaks into groups according to both their complexity and the intensity of their coverage function. In addition, our novel analysis pipeline employs a range of statistical and bioinformatics techniques to relate the obtained peak shapes to several independent genomic datasets, including other genome-wide protein-DNA maps and gene expression experiments. To clarify the meaning of peak shape, we apply our methodology to the study of the erythroid transcription factor GATA-1 in K562 cell line and in megakaryocytes. Conclusions: Our study demonstrates that ChIP-seq profiles include information regarding the binding of other proteins beside the one used for precipitation. In particular, peak shape provides new insights into cooperative transcriptional regulation and is correlated to gene expression.
Background: Set-level classification of gene expression data has received significant attention recently. In this setting, high-dimensional vectors of features corresponding to genes are converted into lower-dimensional vectors of features corresponding to biologically interpretable gene sets. The dimensionality reduction brings the promise of a decreased risk of overfitting, potentially resulting in improved accuracy of the learned classifiers. However, recent empirical research has not confirmed this expectation. Here we hypothesize that the reported unfavorable classification results in the set-level framework were due to the adoption of unsuitable gene sets defined typically on the basis of the Gene ontology and the KEGG database of metabolic networks. We explore an alternative approach to defining gene sets, based on regulatory interactions, which we expect to collect genes with more correlated expression. We hypothesize that such more correlated gene sets will enable to learn more accurate classifiers. Methods: We define two families of gene sets using information on regulatory interactions, and evaluate them on phenotype-classification tasks using public prokaryotic gene expression data sets. From each of the two gene-set families, we first select the best-performing subtype. The two selected subtypes are then evaluated on independent (testing) data sets against state-of-the-art gene sets and against the conventional gene-level approach. Results: The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. Conclusion: Novel gene sets defined on the basis of regulatory interactions improve set-level classification of gene expression data. The experimental scripts and other material needed to reproduce the experiments are available at http://ida.felk.cvut.cz/novelgenesets.tar.gz.
Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data
Background: Recently, rapid improvements in technology and decrease in sequencing costs have made RNA-Seq a widely used technique to quantify gene expression levels. Various normalization approaches have been proposed, owing to the importance of normalization in the analysis of RNA-Seq data. A comparison of recently proposed normalization methods is required to generate suitable guidelines for the selection of the most appropriate approach for future experiments. Results: In this paper, we compared eight non-abundance (RC, UQ, Med, TMM, DESeq, Q, RPKM, and ERPKM) and two abundance estimation normalization methods (RSEM and Sailfish). The experiments were based on real Illumina high-throughput RNA-Seq of 35- and 76-nucleotide sequences produced in the MAQC project and simulation reads. Reads were mapped with human genome obtained from UCSC Genome Browser Database. For precise evaluation, we investigated Spearman correlation between the normalization results from RNA-Seq and MAQC qRT-PCR values for 996 genes. Based on this work, we showed that out of the eight non-abundance estimation normalization methods, RC, UQ, Med, TMM, DESeq, and Q gave similar normalization results for all data sets. For RNA-Seq of a 35-nucleotide sequence, RPKM showed the highest correlation results, but for RNA-Seq of a 76-nucleotide sequence, least correlation was observed than the other methods. ERPKM did not improve results than RPKM. Between two abundance estimation normalization methods, for RNA-Seq of a 35-nucleotide sequence, higher correlation was obtained with Sailfish than that with RSEM, which was better than without using abundance estimation methods. However, for RNA-Seq of a 76-nucleotide sequence, the results achieved by RSEM were similar to without applying abundance estimation methods, and were much better than with Sailfish. Furthermore, we found that adding a poly-A tail increased alignment numbers, but did not improve normalization results. Conclusion: Spearman correlation analysis revealed that RC, UQ, Med, TMM, DESeq, and Q did not noticeably improve gene expression normalization, regardless of read length. Other normalization methods were more efficient when alignment accuracy was low; Sailfish with RPKM gave the best normalization results. When alignment accuracy was high, RC was sufficient for gene expression calculation. And we suggest ignoring poly-A tail during differential gene expression analysis.
Background: Functional annotation of genes and gene products is a major challenge in the post-genomic era. Nowadays, gene function curation is largely based on manual assignment of Gene Ontology (GO) annotations to genes by using published literature. The annotation task is extremely time-consuming, therefore there is an increasing interest in automated tools that can assist human experts. Results: Here we introduce GOTA, a GO term annotator for biomedical literature. The proposed approach makes use only of information that is readily available from public repositories and it is easily expandable to handle novel sources of information. We assess the classification capabilities of GOTA on a large benchmark set of publications. The overall performances are encouraging in comparison to the state of the art in multi-label classification over large taxonomies. Furthermore, the experimental tests provide some interesting insights into the potential improvement of automated annotation tools. Conclusions: GOTA implements a flexible and expandable model for GO annotation of biomedical literature. The current version of the GOTA tool is freely available at http://gota.apice.unibo.it.
Extended notions of sign consistency to relate experimental data to signaling and regulatory network topologies
Background: A rapidly growing amount of knowledge about signaling and gene regulatory networks is available in databases such as KEGG, Reactome, or RegulonDB. There is an increasing need to relate this knowledge to high-throughput data in order to (in)validate network topologies or to decide which interactions are present or inactive in a given cell type under a particular environmental condition. Interaction graphs provide a suitable representation of cellular networks with information flows and methods based on sign consistency approaches have been shown to be valuable tools to (i) predict qualitative responses, (ii) to test the consistency of network topologies and experimental data, and (iii) to apply repair operations to the network model suggesting missing or wrong interactions. Results: We present a framework to unify different notions of sign consistency and propose a refined method for data discretization that considers uncertainties in experimental profiles. We furthermore introduce a new constraint to filter undesired model behaviors induced by positive feedback loops. Finally, we generalize the way predictions can be made by the sign consistency approach. In particular, we distinguish strong predictions (e.g. increase of a node level) and weak predictions (e.g., node level increases or remains unchanged) enlarging the overall predictive power of the approach. We then demonstrate the applicability of our framework by confronting a large-scale gene regulatory network model of Escherichia coli with high-throughput transcriptomic measurements. Conclusion: Overall, our work enhances the flexibility and power of the sign consistency approach for the prediction of the behavior of signaling and gene regulatory networks and, more generally, for the validation and inference of these networks
Background: Orientation and the degree of isotropy are important in many biological systems such as the sarcomeres of cardiomyocytes and other fibrillar structures of the cytoskeleton. Image based analysis of such structures is often limited to qualitative evaluation by human experts, hampering the throughput, repeatability and reliability of the analyses. Software tools are not readily available for this purpose and the existing methods typically rely at least partly on manual operation. Results: We developed CytoSpectre, an automated tool based on spectral analysis, allowing the quantification of orientation and also size distributions of structures in microscopy images. CytoSpectre utilizes the Fourier transform to estimate the power spectrum of an image and based on the spectrum, computes parameter values describing, among others, the mean orientation, isotropy and size of target structures. The analysis can be further tuned to focus on targets of particular size at cellular or subcellular scales. The software can be operated via a graphical user interface without any programming expertise. We analyzed the performance of CytoSpectre by extensive simulations using artificial images, by benchmarking against FibrilTool and by comparisons with manual measurements performed for real images by a panel of human experts. The software was found to be tolerant against noise and blurring and superior to FibrilTool when analyzing realistic targets with degraded image quality. The analysis of real images indicated general good agreement between computational and manual results while also revealing notable expert-to-expert variation. Moreover, the experiment showed that CytoSpectre can handle images obtained of different cell types using different microscopy techniques. Finally, we studied the effect of mechanical stretching on cardiomyocytes to demonstrate the software in an actual experiment and observed changes in cellular orientation in response to stretching. Conclusions: CytoSpectre, a versatile, easy-to-use software tool for spectral analysis of microscopy images was developed. The tool is compatible with most 2D images and can be used to analyze targets at different scales. We expect the tool to be useful in diverse applications dealing with structures whose orientation and size distributions are of interest. While designed for the biological field, the software could also be useful in non-biological applications.
DiSNPindel: improved intra-individual SNP and InDel detection in direct amplicon sequencing of a diploid
Background: Amplicon re-sequencing based on the automated Sanger method remains popular for detection of single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (InDels) for a spectrum of genetics applications. However, existing software tools for detecting intra-individual SNPs and InDels in direct amplicon sequencing of diploid samples are insufficient in analyzing single traces and their accuracy is still limited. Results: We developed a novel computation tool, named DiSNPindel, to improve the detection of intra-individual SNPs and InDels in direct amplicon sequencing of a diploid. Neither reference sequence nor additional sample was required. Using two real datasets, we demonstrated the usefulness of DiSNPindel in its ability to improve largely the true SNP and InDel discovery rates and reduce largely the missed and false positive rates as compared with existing detection methods. Conclusions: The software DiSNPindel presented here provides an efficient tool for intra-individual SNP and InDel detection in diploid amplicon sequencing. It will also be useful for identification of DNA variations in expressed sequence tag (EST) re-sequencing.
Background: Reconstruction of neuron anatomy structure is a challenging and important task in neuroscience. However, few algorithms can automatically reconstruct the full structure well without manual assistance, making it essential to develop new methods for this task. Methods: This paper introduces a new pipeline for reconstructing neuron anatomy structure from 3-D microscopy image stacks. This pipeline is initialized with a set of seeds that were detected by our proposed Sliding Volume Filter (SVF), given a non-circular cross-section of a neuron cell. Then, an improved open curve snake model combined with a SVF external force is applied to trace the full skeleton of the neuron cell. A radius estimation method based on a 2D sliding band filter is developed to fit the real edge of the cross-section of the neuron cell. Finally, a surface reconstruction method based on non-parallel curve networks is used to generate the neuron cell surface to finish this pipeline. Results: The proposed pipeline has been evaluated using publicly available datasets. The results show that the proposed method achieves promising results in some datasets from the DIgital reconstruction of Axonal and DEndritic Morphology (DIADEM) challenge and new BigNeuron project. Conclusion: The new pipeline works well in neuron tracing and reconstruction. It can achieve higher efficiency, stability and robustness in neuron skeleton tracing. Furthermore, the proposed radius estimation method and applied surface reconstruction method can obtain more accurate neuron anatomy structures.
Background: Metabolomics experiments often comprise large numbers of biological samples resulting in huge amounts of data. This data needs to be inspected for plausibility before data evaluation to detect putative sources of error e.g. retention time or mass accuracy shifts. Especially in liquid chromatography-high resolution mass spectrometry (LC-HRMS) based metabolomics research, proper quality control checks (e.g. for precision, signal drifts or offsets) are crucial prerequisites to achieve reliable and comparable results within and across experimental measurement sequences. Software tools can support this process. Results: The software tool QCScreen was developed to offer a quick and easy data quality check of LC-HRMS derived data. It allows a flexible investigation and comparison of basic quality-related parameters within user-defined target features and the possibility to automatically evaluate multiple sample types within or across different measurement sequences in a short time. It offers a user-friendly interface that allows an easy selection of processing steps and parameter settings. The generated results include a coloured overview plot of data quality across all analysed samples and targets and, in addition, detailed illustrations of the stability and precision of the chromatographic separation, the mass accuracy and the detector sensitivity. The use of QCScreen is demonstrated with experimental data from metabolomics experiments using selected standard compounds in pure solvent. The application of the software identified problematic features, samples and analytical parameters and suggested which data files or compounds required closer manual inspection. Conclusions: QCScreen is an open source software tool which provides a useful basis for assessing the suitability of LC-HRMS data prior to time consuming, detailed data processing and subsequent statistical analysis. It accepts the generic mzXML format and thus can be used with many different LC-HRMS platforms to process both multiple quality control sample types as well as experimental samples in one or more measurement sequences.
Background: Despite ongoing reductions in the cost of sequencing technologies, whole genome SNP genotype imputation is often used as an alternative for obtaining abundant SNP genotypes for genome wide association studies. Several existing genotype imputation methods can be efficient for this purpose, while achieving various levels of imputation accuracy. Recent empirical results have shown that the two-step imputation may improve accuracy by imputing the low density genotyped study animals to a medium density array first and then to the target density. We are interested in building a series of staircase arrays that lead the low density array to the high density array or even the whole genome, such that genotype imputation along these staircases can achieve the highest accuracy. Results: For genotype imputation from a lower density to a higher density, we first show how to select untyped SNPs to construct a medium density array. Subsequently, we determine for each selected SNP those untyped SNPs to be imputed in the add-one two-step imputation, and lastly how the clusters of imputed genotype are pieced together as the final imputation result. We design extensive empirical experiments using several hundred sequenced and genotyped animals to demonstrate that our novel two-step piecemeal imputation always achieves an improvement compared to the one-step imputation by the state-of-the-art methods Beagle and FImpute. Using the two-step piecemeal imputation, we present some preliminary success on whole genome SNP genotype imputation for genotyped animals via a series of staircase arrays. Conclusions: From a low SNP density to the whole genome, intermediate pseudo-arrays can be computationally constructed by selecting the most informative SNPs for untyped SNP genotype imputation. Such pseudo-array staircases are able to impute more accurately than the classic one-step imputation.
Iterative reconstruction of three-dimensional models of human chromosomes from chromosomal contact data
Background: The entire collection of genetic information resides within the chromosomes, which themselves reside within almost every cell nucleus of eukaryotic organisms. Each individual chromosome is found to have its own preferred three-dimensional (3D) structure independent of the other chromosomes. The structure of each chromosome plays vital roles in controlling certain genome operations, including gene interaction and gene regulation. As a result, knowing the structure of chromosomes assists in the understanding of how the genome functions. Fortunately, the 3D structure of chromosomes proves possible to construct through computational methods via contact data recorded from the chromosome. We developed a unique computational approach based on optimization procedures known as adaptation, simulated annealing, and genetic algorithm to construct 3D models of human chromosomes, using chromosomal contact data. Results: Our models were evaluated using a percentage-based scoring function. Analysis of the scores of the final 3D models demonstrated their effective construction from our computational approach. Specifically, the models resulting from our approach yielded an average score of 80.41 %, with a high of 91 %, across models for all chromosomes of a normal human B-cell. Comparisons made with other methods affirmed the effectiveness of our strategy. Particularly, juxtaposition with models generated through the publicly available method Markov chain Monte Carlo 5C (MCMC5C) illustrated the outperformance of our approach, as seen through a higher average score for all chromosomes. Our methodology was further validated using two consistency checking techniques known as convergence testing and robustness checking, which both proved successful. Conclusions: The pursuit of constructing accurate 3D chromosomal structures is fueled by the benefits revealed by the findings as well as any possible future areas of study that arise. This motivation has led to the development of our computational methodology. The implementation of our approach proved effective in constructing 3D chromosome models and proved consistent with, and more effective than, some other methods thereby achieving our goal of creating a tool to help advance certain research efforts. The source code, test data, test results, and documentation of our method, Gen3D, are available at our sourceforge site at: http://sourceforge.net/projects/gen3d/.
A large-scale conformation sampling and evaluation server for protein tertiary structure prediction and its assessment in CASP11
Background: With more and more protein sequences produced in the genomic era, predicting protein structures from sequences becomes very important for elucidating the molecular details and functions of these proteins for biomedical research. Traditional template-based protein structure prediction methods tend to focus on identifying the best templates, generating the best alignments, and applying the best energy function to rank models, which often cannot achieve the best performance because of the difficulty of obtaining best templates, alignments, and models. Methods: We developed a large-scale conformation sampling and evaluation method and its servers to improve the reliability and robustness of protein structure prediction. In the first step, our method used a variety of alignment methods to sample relevant and complementary templates and to generate alternative and diverse target-template alignments, used a template and alignment combination protocol to combine alignments, and used template-based and template-free modeling methods to generate a pool of conformations for a target protein. In the second step, it used a large number of protein model quality assessment methods to evaluate and rank the models in the protein model pool, in conjunction with an exception handling strategy to deal with any additional failure in model ranking. Results: The method was implemented as two protein structure prediction servers: MULTICOM-CONSTRUCT and MULTICOM-CLUSTER that participated in the 11th Critical Assessment of Techniques for Protein Structure Prediction (CASP11) in 2014. The two servers were ranked among the best 10 server predictors. Conclusions: The good performance of our servers in CASP11 demonstrates the effectiveness and robustness of the large-scale conformation sampling and evaluation. The MULTICOM server is available at: http://sysbio.rnet.missouri.edu/multicom_cluster/.
Phylogenomics and sequence-structure-function relationships in the GmrSD family of Type IV restriction enzymes
Background: GmrSD is a modification-dependent restriction endonuclease that specifically targets and cleaves glucosylated hydroxymethylcytosine (glc-HMC) modified DNA. It is encoded either as two separate single-domain GmrS and GmrD proteins or as a single protein carrying both domains. Previous studies suggested that GmrS acts as endonuclease and NTPase whereas GmrD binds DNA. Methods: In this work we applied homology detection, sequence conservation analysis, fold recognition and homology modeling methods to study sequence-structure-function relationships in the GmrSD restriction endonucleases family. We also analyzed the phylogeny and genomic context of the family members. Results: Results of our comparative genomics study show that GmrS exhibits similarity to proteins from the ParB/Srx fold which can have both NTPase and nuclease activity. In contrast to the previous studies though, we attribute the nuclease activity also to GmrD as we found it to contain the HNH endonuclease motif. We revealed residues potentially important for structure and function in both domains. Moreover, we found that GmrSD systems exist predominantly as a fused, double-domain form rather than as a heterodimer and that their homologs are often encoded in regions enriched in defense and gene mobility-related elements. Finally, phylogenetic reconstructions of GmrS and GmrD domains revealed that they coevolved and only few GmrSD systems appear to be assembled from distantly related GmrS and GmrD components. Conclusions: Our study provides insight into sequence-structure-function relationships in the yet poorly characterized family of Type IV restriction enzymes. Comparative genomics allowed to propose possible role of GmrD domain in the function of the GmrSD enzyme and possible active sites of both GmrS and GmrD domains. Presented results can guide further experimental characterization of these enzymes.
Background: Despite being hugely important in biological processes, allostery is poorly understood and no universal mechanism has been discovered. Allosteric drugs are a largely unexplored prospect with many potential advantages over orthosteric drugs. Computational methods to predict allosteric sites on proteins are needed to aid the discovery of allosteric drugs, as well as to advance our fundamental understanding of allostery. Results: AlloPred, a novel method to predict allosteric pockets on proteins, was developed. AlloPred uses perturbation of normal modes alongside pocket descriptors in a machine learning approach that ranks the pockets on a protein. AlloPred ranked an allosteric pocket top for 23 out of 40 known allosteric proteins, showing comparable and complementary performance to two existing methods. In 28 of 40 cases an allosteric pocket was ranked first or second. The AlloPred web server, freely available at http://www.sbg.bio.ic.ac.uk/allopred/home, allows visualisation and analysis of predictions. The source code and dataset information are also available from this site. Conclusions: Perturbation of normal modes can enhance our ability to predict allosteric sites on proteins. Computational methods such as AlloPred assist drug discovery efforts by suggesting sites on proteins for further experimental study.
Background: Copy number variations are important in the detection and progression of significant tumors and diseases. Recently, Whole Exome Sequencing is gaining popularity with copy number variations detection due to low cost and better efficiency. In this work, we developed VEGAWES for accurate and robust detection of copy number variations on WES data. VEGAWES is an extension to a variational based segmentation algorithm, VEGA: Variational estimator for genomic aberrations, which has previously outperformed several algorithms on segmenting array comparative genomic hybridization data. Results: We tested this algorithm on synthetic data and 100 Glioblastoma Multiforme primary tumor samples. The results on the real data were analyzed with segmentation obtained from Single-nucleotide polymorphism data as ground truth. We compared our results with two other segmentation algorithms and assessed the performance based on accuracy and time. Conclusions: In terms of both accuracy and time, VEGAWES provided better results on the synthetic data and tumor samples demonstrating its potential in robust detection of aberrant regions in the genome.
Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors
Background: The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences. Results: In this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories. Conclusions: In quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics.
Background: Numerous methods are available to profile several epigenetic marks, providing data with different genome coverage and resolution. Large epigenomic datasets are then generated, and often combined with other high-throughput data, including RNA-seq, ChIP-seq for transcription factors (TFs) binding and DNase-seq experiments. Despite the numerous computational tools covering specific steps in the analysis of large-scale epigenomics data, comprehensive software solutions for their integrative analysis are still missing. Multiple tools must be identified and combined to jointly analyze histone marks, TFs binding and other -omics data together with DNA methylation data, complicating the analysis of these data and their integration with publicly available datasets. Results: To overcome the burden of integrating various data types with multiple tools, we developed two companion R/Bioconductor packages. The former, methylPipe, is tailored to the analysis of high- or low-resolution DNA methylomes in several species, accommodating (hydroxy-)methyl-cytosines in both CpG and non-CpG sequence context. The analysis of multiple whole-genome bisulfite sequencing experiments is supported, while maintaining the ability of integrating targeted genomic data. The latter, compEpiTools, seamlessly incorporates the results obtained with methylPipe and supports their integration with other epigenomics data. It provides a number of methods to score these data in regions of interest, leading to the identification of enhancers, lncRNAs, and RNAPII stalling/elongation dynamics. Moreover, it allows a fast and comprehensive annotation of the resulting genomic regions, and the association of the corresponding genes with non-redundant GeneOntology terms. Finally, the package includes a flexible method based on heatmaps for the integration of various data types, combining annotation tracks with continuous or categorical data tracks. Conclusions: methylPipe and compEpiTools provide a comprehensive Bioconductor-compliant solution for the integrative analysis of heterogeneous epigenomics data. These packages are instrumental in providing biologists with minimal R skills a complete toolkit facilitating the analysis of their own data, or in accelerating the analyses performed by more experienced bioinformaticians.