Bioinformatics Journal

Syndicate content
Bioinformatics - RSS feed of current issue
Updated: 8 years 17 weeks ago

oposSOM: R-package for high-dimensional portraying of genome-wide expression landscapes on bioconductor

Mon, 09/21/2015 - 07:37

Motivation: Comprehensive analysis of genome-wide molecular data challenges bioinformatics methodology in terms of intuitive visualization with single-sample resolution, biomarker selection, functional information mining and highly granular stratification of sample classes. oposSOM combines those functionalities making use of a comprehensive analysis and visualization strategy based on self-organizing maps (SOM) machine learning which we call ‘high-dimensional data portraying’. The method was successfully applied in a series of studies using mostly transcriptome data but also data of other OMICs realms.

Availability and implementation: oposSOM is now publicly available as Bioconductor R package.

Contact: wirth@izbi.uni-leipzig.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

A web application for the unspecific detection of differentially expressed DNA regions in strand-specific expression data

Mon, 09/21/2015 - 07:37

Genomic technologies allow laboratories to produce large-scale data sets, either through the use of next-generation sequencing or microarray platforms. To explore these data sets and obtain maximum value from the data, researchers view their results alongside all the known features of a given reference genome. To study transcriptional changes that occur under a given condition, researchers search for regions of the genome that are differentially expressed between different experimental conditions. In order to identify these regions several algorithms have been developed over the years, along with some bioinformatic platforms that enable their use. However, currently available applications for comparative microarray analysis exclusively focus on changes in gene expression within known transcribed regions of predicted protein-coding genes, the changes that occur in non-predictable genetic elements, such as non-coding RNAs. Here, we present a web application for the visualization of strand-specific tiling microarray or next-generation sequencing data that allows customized detection of differentially expressed regions all along the genome in an unspecific manner, that allows identification of all RNA sequences, predictable or not.

Availability and implementation: The web application is freely accessible at http://tilingscan.uv.es/. TilingScan is implemented in PHP and JavaScript.

Contact: vicente.arnau@uv.es

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

lpNet: a linear programming approach to reconstruct signal transduction networks

Mon, 09/21/2015 - 07:37

Summary: With the widespread availability of high-throughput experimental technologies it has become possible to study hundreds to thousands of cellular factors simultaneously, such as coding- or non-coding mRNA or protein concentrations. Still, extracting information about the underlying regulatory or signaling interactions from these data remains a difficult challenge. We present a flexible approach towards network inference based on linear programming. Our method reconstructs the interactions of factors from a combination of perturbation/non-perturbation and steady-state/time-series data. We show both on simulated and real data that our methods are able to reconstruct the underlying networks fast and efficiently, thus shedding new light on biological processes and, in particular, into disease’s mechanisms of action. We have implemented the approach as an R package available through bioconductor.

Availability and implementation: This R package is freely available under the Gnu Public License (GPL-3) from bioconductor.org (http://bioconductor.org/packages/release/bioc/html/lpNet.html) and is compatible with most operating systems (Windows, Linux, Mac OS) and hardware architectures.

Contact: bettina.knapp@helmholtz-muenchen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

TiQuant: software for tissue analysis, quantification and surface reconstruction

Mon, 09/21/2015 - 07:37

Motivation: TiQuant is a modular software tool for efficient quantification of biological tissues based on volume data obtained by biomedical image modalities. It includes a number of versatile image and volume processing chains tailored to the analysis of different tissue types which have been experimentally verified. TiQuant implements a novel method for the reconstruction of three-dimensional surfaces of biological systems, data that often cannot be obtained experimentally but which is of utmost importance for tissue modelling in systems biology.

Availability and implementation: TiQuant is freely available for non-commercial use at msysbio.com/tiquant. Windows, OSX and Linux are supported.

Contact: hoehme@uni-leipzig.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

RPdb: a database of experimentally verified cellular reprogramming records

Mon, 09/21/2015 - 07:37

Summary: Many cell lines can be reprogrammed to other cell lines by forced expression of a few transcription factors or by specifically designed culture methods, which have attracted a great interest in the field of regenerative medicine and stem cell research. Plenty of cell lines have been used to generate induced pluripotent stem cells (IPSCs) by expressing a group of genes and microRNAs. These IPSCs can differentiate into somatic cells to promote tissue regeneration. Similarly, many somatic cells can be directly reprogrammed to other cells without a stem cell state. All these findings are helpful in searching for new reprogramming methods and understanding the biological mechanism inside. However, to the best of our knowledge, there is still no database dedicated to integrating the reprogramming records. We built RPdb (cellular reprogramming database) to collect cellular reprogramming information and make it easy to access. All entries in RPdb are manually extracted from more than 2000 published articles, which is helpful for researchers in regenerative medicine and cell biology.

Availability and Implementation: RPdb is freely available on the web at http://bioinformatics.ustc.edu.cn/rpdb with all major browsers supported.

Contact: aoli@ustc.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Using combined evidence from replicates to evaluate ChIP-seq peaks

Mon, 08/24/2015 - 09:21

Motivation: Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) detects genome-wide DNA–protein interactions and chromatin modifications, returning enriched regions (ERs), usually associated with a significance score. Moderately significant interactions can correspond to true, weak interactions, or to false positives; replicates of a ChIP-seq experiment can provide co-localised evidence to decide between the two cases. We designed a general methodological framework to rigorously combine the evidence of ERs in ChIP-seq replicates, with the option to set a significance threshold on the repeated evidence and a minimum number of samples bearing this evidence.

Results: We applied our method to Myc transcription factor ChIP-seq datasets in K562 cells available in the ENCODE project. Using replicates, we could extend up to 3 times the ER number with respect to single-sample analysis with equivalent significance threshold. We validated the ‘rescued’ ERs by checking for the overlap with open chromatin regions and for the enrichment of the motif that Myc binds with strongest affinity; we compared our results with alternative methods (IDR and jMOSAiCS), obtaining more validated peaks than the former and less peaks than latter, but with a better validation.

Availability and implementation: An implementation of the proposed method and its source code under GPLv3 license are freely available at http://www.bioinformatics.deib.polimi.it/MSPC/ and http://mspc.codeplex.com/, respectively.

Contact: marco.morelli@iit.it

Supplementary information: Supplementary Material are available at Bioinformatics online.

Categories: Journal Articles

Data-dependent bucketing improves reference-free compression of sequencing reads

Mon, 08/24/2015 - 09:21

Motivation: The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data.

Results: We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes.

Availability and implementation: Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/~ckingsf/software/mince.

Contact: carlk@cs.cmu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Polyester: simulating RNA-seq datasets with differential transcript expression

Mon, 08/24/2015 - 09:21

Motivation: Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data.

Results: Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user.

Availability and implementation: Polyester is freely available from Bioconductor (http://bioconductor.org/).

Contact: jtleek@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

RVD2: an ultra-sensitive variant detection model for low-depth heterogeneous next-generation sequencing data

Mon, 08/24/2015 - 09:21

Motivation: Next-generation sequencing technology is increasingly being used for clinical diagnostic tests. Clinical samples are often genomically heterogeneous due to low sample purity or the presence of genetic subpopulations. Therefore, a variant calling algorithm for calling low-frequency polymorphisms in heterogeneous samples is needed.

Results: We present a novel variant calling algorithm that uses a hierarchical Bayesian model to estimate allele frequency and call variants in heterogeneous samples. We show that our algorithm improves upon current classifiers and has higher sensitivity and specificity over a wide range of median read depth and minor allele fraction. We apply our model and identify 15 mutated loci in the PAXP1 gene in a matched clinical breast ductal carcinoma tumor sample; two of which are likely loss-of-heterozygosity events.

Availability and implementation: http://genomics.wpi.edu/rvd2/.

Contact: pjflaherty@wpi.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Phylesystem: a git-based data store for community-curated phylogenetic estimates

Mon, 08/24/2015 - 09:21

Motivation: Phylogenetic estimates from published studies can be archived using general platforms like Dryad (Vision, 2010) or TreeBASE (Sanderson et al., 1994). Such services fulfill a crucial role in ensuring transparency and reproducibility in phylogenetic research. However, digital tree data files often require some editing (e.g. rerooting) to improve the accuracy and reusability of the phylogenetic statements. Furthermore, establishing the mapping between tip labels used in a tree and taxa in a single common taxonomy dramatically improves the ability of other researchers to reuse phylogenetic estimates. As the process of curating a published phylogenetic estimate is not error-free, retaining a full record of the provenance of edits to a tree is crucial for openness, allowing editors to receive credit for their work and making errors introduced during curation easier to correct.

Results: Here, we report the development of software infrastructure to support the open curation of phylogenetic data by the community of biologists. The backend of the system provides an interface for the standard database operations of creating, reading, updating and deleting records by making commits to a git repository. The record of the history of edits to a tree is preserved by git’s version control features. Hosting this data store on GitHub (http://github.com/) provides open access to the data store using tools familiar to many developers. We have deployed a server running the ‘phylesystem-api’, which wraps the interactions with git and GitHub. The Open Tree of Life project has also developed and deployed a JavaScript application that uses the phylesystem-api and other web services to enable input and curation of published phylogenetic statements.

Availability and implementation: Source code for the web service layer is available at https://github.com/OpenTreeOfLife/phylesystem-api. The data store can be cloned from: https://github.com/OpenTreeOfLife/phylesystem. A web application that uses the phylesystem web services is deployed at http://tree.opentreeoflife.org/curator. Code for that tool is available from https://github.com/OpenTreeOfLife/opentree.

Contact: mtholder@gmail.com

Categories: Journal Articles

DockStar: a novel ILP-based integrative method for structural modeling of multimolecular protein complexes

Mon, 08/24/2015 - 09:21

Motivation: Atomic resolution modeling of large multimolecular assemblies is a key task in Structural Cell Biology. Experimental techniques can provide atomic resolution structures of single proteins and small complexes, or low resolution data of large multimolecular complexes.

Results: We present a novel integrative computational modeling method, which integrates both low and high resolution experimental data. The algorithm accepts as input atomic resolution structures of the individual subunits obtained from X-ray, NMR or homology modeling, and interaction data between the subunits obtained from mass spectrometry. The optimal assembly of the individual subunits is formulated as an Integer Linear Programming task. The method was tested on several representative complexes, both in the bound and unbound cases. It placed correctly most of the subunits of multimolecular complexes of up to 16 subunits and significantly outperformed the CombDock and Haddock multimolecular docking methods.

Availability and implementation: http://bioinfo3d.cs.tau.ac.il/DockStar

Contact: naamaamir@mail.tau.ac.il or wolfson@tau.ac.il

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Automated band annotation for RNA structure probing experiments with numerous capillary electrophoresis profiles

Mon, 08/24/2015 - 09:21

Motivation: Capillary electrophoresis (CE) is a powerful approach for structural analysis of nucleic acids, with recent high-throughput variants enabling three-dimensional RNA modeling and the discovery of new rules for RNA structure design. Among the steps composing CE analysis, the process of finding each band in an electrophoretic trace and mapping it to a position in the nucleic acid sequence has required significant manual inspection and remains the most time-consuming and error-prone step. The few available tools seeking to automate this band annotation have achieved limited accuracy and have not taken advantage of information across dozens of profiles routinely acquired in high-throughput measurements.

Results: We present a dynamic-programming-based approach to automate band annotation for high-throughput capillary electrophoresis. The approach is uniquely able to define and optimize a robust target function that takes into account multiple CE profiles (sequencing ladders, different chemical probes, different mutants) collected for the RNA. Over a large benchmark of multi-profile datasets for biological RNAs and designed RNAs from the EteRNA project, the method outperforms prior tools (QuSHAPE and FAST) significantly in terms of accuracy compared with gold-standard manual annotations. The amount of computation required is reasonable at a few seconds per dataset. We also introduce an ‘E-score’ metric to automatically assess the reliability of the band annotation and show it to be practically useful in flagging uncertainties in band annotation for further inspection.

Availability and implementation: The implementation of the proposed algorithm is included in the HiTRACE software, freely available as an online server and for download at http://hitrace.stanford.edu.

Contact: sryoon@snu.ac.kr or rhiju@stanford.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

INPS: predicting the impact of non-synonymous variations on protein stability from sequence

Mon, 08/24/2015 - 09:21

Motivation: A tool for reliably predicting the impact of variations on protein stability is extremely important for both protein engineering and for understanding the effects of Mendelian and somatic mutations in the genome. Next Generation Sequencing studies are constantly increasing the number of protein sequences. Given the huge disproportion between protein sequences and structures, there is a need for tools suited to annotate the effect of mutations starting from protein sequence without relying on the structure. Here, we describe INPS, a novel approach for annotating the effect of non-synonymous mutations on the protein stability from its sequence. INPS is based on SVM regression and it is trained to predict the thermodynamic free energy change upon single-point variations in protein sequences.

Results: We show that INPS performs similarly to the state-of-the-art methods based on protein structure when tested in cross-validation on a non-redundant dataset. INPS performs very well also on a newly generated dataset consisting of a number of variations occurring in the tumor suppressor protein p53. Our results suggest that INPS is a tool suited for computing the effect of non-synonymous polymorphisms on protein stability when the protein structure is not available. We also show that INPS predictions are complementary to those of the state-of-the-art, structure-based method mCSM. When the two methods are combined, the overall prediction on the p53 set scores significantly higher than those of the single methods.

Availability and implementation: The presented method is available as web server at http://inps.biocomp.unibo.it.

Contact: piero.fariselli@unibo.it

Supplementary information: Supplementary Materials are available at Bioinformatics online.

Categories: Journal Articles

Inferring data-specific micro-RNA function through the joint ranking of micro-RNA and pathways from matched micro-RNA and gene expression data

Mon, 08/24/2015 - 09:21

Motivation: In practice, identifying and interpreting the functional impacts of the regulatory relationships between micro-RNA and messenger-RNA is non-trivial. The sheer scale of possible micro-RNA and messenger-RNA interactions can make the interpretation of results difficult.

Results: We propose a supervised framework, pMim, built upon concepts of significance combination, for jointly ranking regulatory micro-RNA and their potential functional impacts with respect to a condition of interest. Here, pMim directly tests if a micro-RNA is differentially expressed and if its predicted targets, which lie in a common biological pathway, have changed in the opposite direction. We leverage the information within existing micro-RNA target and pathway databases to stabilize the estimation and annotation of micro-RNA regulation making our approach suitable for datasets with small sample sizes. In addition to outputting meaningful and interpretable results, we demonstrate in a variety of datasets that the micro-RNA identified by pMim, in comparison to simpler existing approaches, are also more concordant with what is described in the literature.

Availability and implementation: This framework is implemented as an R function, pMim, in the package sydSeq available from http://www.ellispatrick.com/r-packages.

Contact: jean.yang@sydney.edu.au

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

INSPEcT: a computational tool to infer mRNA synthesis, processing and degradation dynamics from RNA- and 4sU-seq time course experiments

Mon, 08/24/2015 - 09:21

Motivation: Cellular mRNA levels originate from the combined action of multiple regulatory processes, which can be recapitulated by the rates of pre-mRNA synthesis, pre-mRNA processing and mRNA degradation. Recent experimental and computational advances set the basis to study these intertwined levels of regulation. Nevertheless, software for the comprehensive quantification of RNA dynamics is still lacking.

Results: INSPEcT is an R package for the integrative analysis of RNA- and 4sU-seq data to study the dynamics of transcriptional regulation. INSPEcT provides gene-level quantification of these rates, and a modeling framework to identify which of these regulatory processes are most likely to explain the observed mRNA and pre-mRNA concentrations. Software performance is tested on a synthetic dataset, instrumental to guide the choice of the modeling parameters and the experimental design.

Availability and implementation: INSPEcT is submitted to Bioconductor and is currently available as Supplementary Additional File S1.

Contact: mattia.pelizzola@iit.it

Supplementary Information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Addressing false discoveries in network inference

Mon, 08/24/2015 - 09:21

Motivation: Experimentally determined gene regulatory networks can be enriched by computational inference from high-throughput expression profiles. However, the prediction of regulatory interactions is severely impaired by indirect and spurious effects, particularly for eukaryotes. Recently, published methods report improved predictions by exploiting the a priori known targets of a regulator (its local topology) in addition to expression profiles.

Results: We find that methods exploiting known targets show an unexpectedly high rate of false discoveries. This leads to inflated performance estimates and the prediction of an excessive number of new interactions for regulators with many known targets. These issues are hidden from common evaluation and cross-validation setups, which is due to Simpson’s paradox. We suggest a confidence score recalibration method (CoRe) that reduces the false discovery rate and enables a reliable performance estimation.

Conclusions: CoRe considerably improves the results of network inference methods that exploit known targets. Predictions then display the biological process specificity of regulators more correctly and enable the inference of accurate genome-wide regulatory networks in eukaryotes. For yeast, we propose a network with more than 22 000 confident interactions. We point out that machine learning approaches outside of the area of network inference may be affected as well.

Availability and implementation: Results, executable code and networks are available via our website http://www.bio.ifi.lmu.de/forschung/CoRe.

Contact: robert.kueffner@helmholtz-muenchen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Genome-scale strain designs based on regulatory minimal cut sets

Mon, 08/24/2015 - 09:21

Motivation: Stoichiometric and constraint-based methods of computational strain design have become an important tool for rational metabolic engineering. One of those relies on the concept of constrained minimal cut sets (cMCSs). However, as most other techniques, cMCSs may consider only reaction (or gene) knockouts to achieve a desired phenotype.

Results: We generalize the cMCSs approach to constrained regulatory MCSs (cRegMCSs), where up/downregulation of reaction rates can be combined along with reaction deletions. We show that flux up/downregulations can virtually be treated as cuts allowing their direct integration into the algorithmic framework of cMCSs. Because of vastly enlarged search spaces in genome-scale networks, we developed strategies to (optionally) preselect suitable candidates for flux regulation and novel algorithmic techniques to further enhance efficiency and speed of cMCSs calculation. We illustrate the cRegMCSs approach by a simple example network and apply it then by identifying strain designs for ethanol production in a genome-scale metabolic model of Escherichia coli. The results clearly show that cRegMCSs combining reaction deletions and flux regulations provide a much larger number of suitable strain designs, many of which are significantly smaller relative to cMCSs involving only knockouts. Furthermore, with cRegMCSs, one may also enable the fine tuning of desired behaviours in a narrower range. The new cRegMCSs approach may thus accelerate the implementation of model-based strain designs for the bio-based production of fuels and chemicals.

Availability and implementation: MATLAB code and the examples can be downloaded at http://www.mpi-magdeburg.mpg.de/projects/cna/etcdownloads.html.

Contact: krishna.mahadevan@utoronto.ca or klamt@mpi-magdeburg.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data

Mon, 08/24/2015 - 09:21

Motivation: Transcription factors (TFs) are a class of DNA-binding proteins that have a central role in regulating gene expression. To reveal mechanisms of transcriptional regulation, a number of computational tools have been proposed for predicting TF-DNA interaction sites. Recent studies have shown that genome-wide sequencing data on open chromatin sites from a DNase I hypersensitivity experiments (DNase-seq) has a great potential to map putative binding sites of all transcription factors in a single experiment. Thus, computational methods for analysing DNase-seq to accurately map TF-DNA interaction sites are highly needed.

Results: Here, we introduce a novel discriminative algorithm, BinDNase, for predicting TF-DNA interaction sites using DNase-seq data. BinDNase implements an efficient method for selecting and extracting informative features from DNase I signal for each TF, either at single nucleotide resolution or for larger regions. The method is applied to 57 transcription factors in cell line K562 and 31 transcription factors in cell line HepG2 using data from the ENCODE project. First, we show that BinDNase compares favourably to other supervised and unsupervised methods developed for TF-DNA interaction prediction using DNase-seq data. We demonstrate the importance to model each TF with a separate prediction model, reflecting TF-specific DNA accessibility around the TF-DNA interaction site. We also show that a highly standardised DNase-seq data (pre)processing is a requisite for accurate TF binding predictions and that sequencing depth has on average only a moderate effect on prediction accuracy. Finally, BinDNase’s binding predictions generalise to other cell types, thus making BinDNase a versatile tool for accurate TF binding prediction.

Availability and implementation: R implementation of the algorithm is available in: http://research.ics.aalto.fi/csb/software/bindnase/.

Contact: juhani.kahara@aalto.fi

Supplementary information: Supplemental data are available at Bioinformatics online.

Categories: Journal Articles

The SwissLipids knowledgebase for lipid biology

Mon, 08/24/2015 - 09:21

Motivation: Lipids are a large and diverse group of biological molecules with roles in membrane formation, energy storage and signaling. Cellular lipidomes may contain tens of thousands of structures, a staggering degree of complexity whose significance is not yet fully understood. High-throughput mass spectrometry-based platforms provide a means to study this complexity, but the interpretation of lipidomic data and its integration with prior knowledge of lipid biology suffers from a lack of appropriate tools to manage the data and extract knowledge from it.

Results: To facilitate the description and exploration of lipidomic data and its integration with prior biological knowledge, we have developed a knowledge resource for lipids and their biology—SwissLipids. SwissLipids provides curated knowledge of lipid structures and metabolism which is used to generate an in silico library of feasible lipid structures. These are arranged in a hierarchical classification that links mass spectrometry analytical outputs to all possible lipid structures, metabolic reactions and enzymes. SwissLipids provides a reference namespace for lipidomic data publication, data exploration and hypothesis generation. The current version of SwissLipids includes over 244 000 known and theoretically possible lipid structures, over 800 proteins, and curated links to published knowledge from over 620 peer-reviewed publications. We are continually updating the SwissLipids hierarchy with new lipid categories and new expert curated knowledge.

Availability: SwissLipids is freely available at http://www.swisslipids.org/.

Contact: alan.bridge@isb-sib.ch

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles