Feed aggregator

Phylesystem: a git-based data store for community-curated phylogenetic estimates

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: Phylogenetic estimates from published studies can be archived using general platforms like Dryad (Vision, 2010) or TreeBASE (Sanderson et al., 1994). Such services fulfill a crucial role in ensuring transparency and reproducibility in phylogenetic research. However, digital tree data files often require some editing (e.g. rerooting) to improve the accuracy and reusability of the phylogenetic statements. Furthermore, establishing the mapping between tip labels used in a tree and taxa in a single common taxonomy dramatically improves the ability of other researchers to reuse phylogenetic estimates. As the process of curating a published phylogenetic estimate is not error-free, retaining a full record of the provenance of edits to a tree is crucial for openness, allowing editors to receive credit for their work and making errors introduced during curation easier to correct.

Results: Here, we report the development of software infrastructure to support the open curation of phylogenetic data by the community of biologists. The backend of the system provides an interface for the standard database operations of creating, reading, updating and deleting records by making commits to a git repository. The record of the history of edits to a tree is preserved by git’s version control features. Hosting this data store on GitHub (http://github.com/) provides open access to the data store using tools familiar to many developers. We have deployed a server running the ‘phylesystem-api’, which wraps the interactions with git and GitHub. The Open Tree of Life project has also developed and deployed a JavaScript application that uses the phylesystem-api and other web services to enable input and curation of published phylogenetic statements.

Availability and implementation: Source code for the web service layer is available at https://github.com/OpenTreeOfLife/phylesystem-api. The data store can be cloned from: https://github.com/OpenTreeOfLife/phylesystem. A web application that uses the phylesystem web services is deployed at http://tree.opentreeoflife.org/curator. Code for that tool is available from https://github.com/OpenTreeOfLife/opentree.

Contact: mtholder@gmail.com

Categories: Journal Articles

DockStar: a novel ILP-based integrative method for structural modeling of multimolecular protein complexes

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: Atomic resolution modeling of large multimolecular assemblies is a key task in Structural Cell Biology. Experimental techniques can provide atomic resolution structures of single proteins and small complexes, or low resolution data of large multimolecular complexes.

Results: We present a novel integrative computational modeling method, which integrates both low and high resolution experimental data. The algorithm accepts as input atomic resolution structures of the individual subunits obtained from X-ray, NMR or homology modeling, and interaction data between the subunits obtained from mass spectrometry. The optimal assembly of the individual subunits is formulated as an Integer Linear Programming task. The method was tested on several representative complexes, both in the bound and unbound cases. It placed correctly most of the subunits of multimolecular complexes of up to 16 subunits and significantly outperformed the CombDock and Haddock multimolecular docking methods.

Availability and implementation: http://bioinfo3d.cs.tau.ac.il/DockStar

Contact: naamaamir@mail.tau.ac.il or wolfson@tau.ac.il

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Automated band annotation for RNA structure probing experiments with numerous capillary electrophoresis profiles

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: Capillary electrophoresis (CE) is a powerful approach for structural analysis of nucleic acids, with recent high-throughput variants enabling three-dimensional RNA modeling and the discovery of new rules for RNA structure design. Among the steps composing CE analysis, the process of finding each band in an electrophoretic trace and mapping it to a position in the nucleic acid sequence has required significant manual inspection and remains the most time-consuming and error-prone step. The few available tools seeking to automate this band annotation have achieved limited accuracy and have not taken advantage of information across dozens of profiles routinely acquired in high-throughput measurements.

Results: We present a dynamic-programming-based approach to automate band annotation for high-throughput capillary electrophoresis. The approach is uniquely able to define and optimize a robust target function that takes into account multiple CE profiles (sequencing ladders, different chemical probes, different mutants) collected for the RNA. Over a large benchmark of multi-profile datasets for biological RNAs and designed RNAs from the EteRNA project, the method outperforms prior tools (QuSHAPE and FAST) significantly in terms of accuracy compared with gold-standard manual annotations. The amount of computation required is reasonable at a few seconds per dataset. We also introduce an ‘E-score’ metric to automatically assess the reliability of the band annotation and show it to be practically useful in flagging uncertainties in band annotation for further inspection.

Availability and implementation: The implementation of the proposed algorithm is included in the HiTRACE software, freely available as an online server and for download at http://hitrace.stanford.edu.

Contact: sryoon@snu.ac.kr or rhiju@stanford.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

INPS: predicting the impact of non-synonymous variations on protein stability from sequence

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: A tool for reliably predicting the impact of variations on protein stability is extremely important for both protein engineering and for understanding the effects of Mendelian and somatic mutations in the genome. Next Generation Sequencing studies are constantly increasing the number of protein sequences. Given the huge disproportion between protein sequences and structures, there is a need for tools suited to annotate the effect of mutations starting from protein sequence without relying on the structure. Here, we describe INPS, a novel approach for annotating the effect of non-synonymous mutations on the protein stability from its sequence. INPS is based on SVM regression and it is trained to predict the thermodynamic free energy change upon single-point variations in protein sequences.

Results: We show that INPS performs similarly to the state-of-the-art methods based on protein structure when tested in cross-validation on a non-redundant dataset. INPS performs very well also on a newly generated dataset consisting of a number of variations occurring in the tumor suppressor protein p53. Our results suggest that INPS is a tool suited for computing the effect of non-synonymous polymorphisms on protein stability when the protein structure is not available. We also show that INPS predictions are complementary to those of the state-of-the-art, structure-based method mCSM. When the two methods are combined, the overall prediction on the p53 set scores significantly higher than those of the single methods.

Availability and implementation: The presented method is available as web server at http://inps.biocomp.unibo.it.

Contact: piero.fariselli@unibo.it

Supplementary information: Supplementary Materials are available at Bioinformatics online.

Categories: Journal Articles

Inferring data-specific micro-RNA function through the joint ranking of micro-RNA and pathways from matched micro-RNA and gene expression data

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: In practice, identifying and interpreting the functional impacts of the regulatory relationships between micro-RNA and messenger-RNA is non-trivial. The sheer scale of possible micro-RNA and messenger-RNA interactions can make the interpretation of results difficult.

Results: We propose a supervised framework, pMim, built upon concepts of significance combination, for jointly ranking regulatory micro-RNA and their potential functional impacts with respect to a condition of interest. Here, pMim directly tests if a micro-RNA is differentially expressed and if its predicted targets, which lie in a common biological pathway, have changed in the opposite direction. We leverage the information within existing micro-RNA target and pathway databases to stabilize the estimation and annotation of micro-RNA regulation making our approach suitable for datasets with small sample sizes. In addition to outputting meaningful and interpretable results, we demonstrate in a variety of datasets that the micro-RNA identified by pMim, in comparison to simpler existing approaches, are also more concordant with what is described in the literature.

Availability and implementation: This framework is implemented as an R function, pMim, in the package sydSeq available from http://www.ellispatrick.com/r-packages.

Contact: jean.yang@sydney.edu.au

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

INSPEcT: a computational tool to infer mRNA synthesis, processing and degradation dynamics from RNA- and 4sU-seq time course experiments

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: Cellular mRNA levels originate from the combined action of multiple regulatory processes, which can be recapitulated by the rates of pre-mRNA synthesis, pre-mRNA processing and mRNA degradation. Recent experimental and computational advances set the basis to study these intertwined levels of regulation. Nevertheless, software for the comprehensive quantification of RNA dynamics is still lacking.

Results: INSPEcT is an R package for the integrative analysis of RNA- and 4sU-seq data to study the dynamics of transcriptional regulation. INSPEcT provides gene-level quantification of these rates, and a modeling framework to identify which of these regulatory processes are most likely to explain the observed mRNA and pre-mRNA concentrations. Software performance is tested on a synthetic dataset, instrumental to guide the choice of the modeling parameters and the experimental design.

Availability and implementation: INSPEcT is submitted to Bioconductor and is currently available as Supplementary Additional File S1.

Contact: mattia.pelizzola@iit.it

Supplementary Information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Addressing false discoveries in network inference

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: Experimentally determined gene regulatory networks can be enriched by computational inference from high-throughput expression profiles. However, the prediction of regulatory interactions is severely impaired by indirect and spurious effects, particularly for eukaryotes. Recently, published methods report improved predictions by exploiting the a priori known targets of a regulator (its local topology) in addition to expression profiles.

Results: We find that methods exploiting known targets show an unexpectedly high rate of false discoveries. This leads to inflated performance estimates and the prediction of an excessive number of new interactions for regulators with many known targets. These issues are hidden from common evaluation and cross-validation setups, which is due to Simpson’s paradox. We suggest a confidence score recalibration method (CoRe) that reduces the false discovery rate and enables a reliable performance estimation.

Conclusions: CoRe considerably improves the results of network inference methods that exploit known targets. Predictions then display the biological process specificity of regulators more correctly and enable the inference of accurate genome-wide regulatory networks in eukaryotes. For yeast, we propose a network with more than 22 000 confident interactions. We point out that machine learning approaches outside of the area of network inference may be affected as well.

Availability and implementation: Results, executable code and networks are available via our website http://www.bio.ifi.lmu.de/forschung/CoRe.

Contact: robert.kueffner@helmholtz-muenchen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Genome-scale strain designs based on regulatory minimal cut sets

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: Stoichiometric and constraint-based methods of computational strain design have become an important tool for rational metabolic engineering. One of those relies on the concept of constrained minimal cut sets (cMCSs). However, as most other techniques, cMCSs may consider only reaction (or gene) knockouts to achieve a desired phenotype.

Results: We generalize the cMCSs approach to constrained regulatory MCSs (cRegMCSs), where up/downregulation of reaction rates can be combined along with reaction deletions. We show that flux up/downregulations can virtually be treated as cuts allowing their direct integration into the algorithmic framework of cMCSs. Because of vastly enlarged search spaces in genome-scale networks, we developed strategies to (optionally) preselect suitable candidates for flux regulation and novel algorithmic techniques to further enhance efficiency and speed of cMCSs calculation. We illustrate the cRegMCSs approach by a simple example network and apply it then by identifying strain designs for ethanol production in a genome-scale metabolic model of Escherichia coli. The results clearly show that cRegMCSs combining reaction deletions and flux regulations provide a much larger number of suitable strain designs, many of which are significantly smaller relative to cMCSs involving only knockouts. Furthermore, with cRegMCSs, one may also enable the fine tuning of desired behaviours in a narrower range. The new cRegMCSs approach may thus accelerate the implementation of model-based strain designs for the bio-based production of fuels and chemicals.

Availability and implementation: MATLAB code and the examples can be downloaded at http://www.mpi-magdeburg.mpg.de/projects/cna/etcdownloads.html.

Contact: krishna.mahadevan@utoronto.ca or klamt@mpi-magdeburg.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: Transcription factors (TFs) are a class of DNA-binding proteins that have a central role in regulating gene expression. To reveal mechanisms of transcriptional regulation, a number of computational tools have been proposed for predicting TF-DNA interaction sites. Recent studies have shown that genome-wide sequencing data on open chromatin sites from a DNase I hypersensitivity experiments (DNase-seq) has a great potential to map putative binding sites of all transcription factors in a single experiment. Thus, computational methods for analysing DNase-seq to accurately map TF-DNA interaction sites are highly needed.

Results: Here, we introduce a novel discriminative algorithm, BinDNase, for predicting TF-DNA interaction sites using DNase-seq data. BinDNase implements an efficient method for selecting and extracting informative features from DNase I signal for each TF, either at single nucleotide resolution or for larger regions. The method is applied to 57 transcription factors in cell line K562 and 31 transcription factors in cell line HepG2 using data from the ENCODE project. First, we show that BinDNase compares favourably to other supervised and unsupervised methods developed for TF-DNA interaction prediction using DNase-seq data. We demonstrate the importance to model each TF with a separate prediction model, reflecting TF-specific DNA accessibility around the TF-DNA interaction site. We also show that a highly standardised DNase-seq data (pre)processing is a requisite for accurate TF binding predictions and that sequencing depth has on average only a moderate effect on prediction accuracy. Finally, BinDNase’s binding predictions generalise to other cell types, thus making BinDNase a versatile tool for accurate TF binding prediction.

Availability and implementation: R implementation of the algorithm is available in: http://research.ics.aalto.fi/csb/software/bindnase/.

Contact: juhani.kahara@aalto.fi

Supplementary information: Supplemental data are available at Bioinformatics online.

Categories: Journal Articles

The SwissLipids knowledgebase for lipid biology

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: Lipids are a large and diverse group of biological molecules with roles in membrane formation, energy storage and signaling. Cellular lipidomes may contain tens of thousands of structures, a staggering degree of complexity whose significance is not yet fully understood. High-throughput mass spectrometry-based platforms provide a means to study this complexity, but the interpretation of lipidomic data and its integration with prior knowledge of lipid biology suffers from a lack of appropriate tools to manage the data and extract knowledge from it.

Results: To facilitate the description and exploration of lipidomic data and its integration with prior biological knowledge, we have developed a knowledge resource for lipids and their biology—SwissLipids. SwissLipids provides curated knowledge of lipid structures and metabolism which is used to generate an in silico library of feasible lipid structures. These are arranged in a hierarchical classification that links mass spectrometry analytical outputs to all possible lipid structures, metabolic reactions and enzymes. SwissLipids provides a reference namespace for lipidomic data publication, data exploration and hypothesis generation. The current version of SwissLipids includes over 244 000 known and theoretically possible lipid structures, over 800 proteins, and curated links to published knowledge from over 620 peer-reviewed publications. We are continually updating the SwissLipids hierarchy with new lipid categories and new expert curated knowledge.

Availability: SwissLipids is freely available at http://www.swisslipids.org/.

Contact: alan.bridge@isb-sib.ch

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

CiVi: circular genome visualization with unique features to analyze sequence elements

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Summary: We have developed CiVi, a user-friendly web-based tool to create custom circular maps to aid the analysis of microbial genomes and sequence elements. Sequence related data such as gene-name, COG class, PFAM domain, GC%, and subcellular location can be comprehensively viewed. Quantitative gene-related data (e.g. expression ratios or read counts) as well as predicted sequence elements (e.g. regulatory sequences) can be uploaded and visualized. CiVi accommodates the analysis of genomic elements by allowing a visual interpretation in the context of: (i) their genome-wide distribution, (ii) provided experimental data and (iii) the local orientation and location with respect to neighboring genes. CiVi thus enables both experts and non-experts to conveniently integrate public genome data with the results of genome analyses in circular genome maps suitable for publication.

Contact: L.Overmars@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Availability and implementation: CiVi is freely available at http://civi.cmbi.ru.nl

Categories: Journal Articles

IonGAP: integrative bacterial genome analysis for Ion Torrent sequence data

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Summary: We introduce IonGAP, a publicly available Web platform designed for the analysis of whole bacterial genomes using Ion Torrent sequence data. Besides assembly, it integrates a variety of comparative genomics, annotation and bacterial classification routines, based on the widely used FASTQ, BAM and SRA file formats. Benchmarking with different datasets evidenced that IonGAP is a fast, powerful and simple-to-use bioinformatics tool. By releasing this platform, we aim to translate low-cost bacterial genome analysis for microbiological prevention and control in healthcare, agroalimentary and pharmaceutical industry applications.

Availability and implementation: IonGAP is hosted by the ITER’s Teide-HPC supercomputer and is freely available on the Web for non-commercial use at http://iongap.hpc.iter.es.

Contact: mcolesan@ull.edu.es or cflores@ull.edu.es

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Interactive analysis of large cancer copy number studies with Copy Number Explorer

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Summary: Copy number abnormalities (CNAs) such as somatically-acquired chromosomal deletions and duplications drive the development of cancer. As individual tumor genomes can contain tens or even hundreds of large and/or focal CNAs, a major difficulty is differentiating between important, recurrent pathogenic changes and benign changes unrelated to the subject’s phenotype. Here we present Copy Number Explorer, an interactive tool for mining large copy number datasets. Copy Number Explorer facilitates rapid visual and statistical identification of recurrent regions of gain or loss, identifies the genes most likely to drive CNA formation using the cghMCR method and identifies recurrently broken genes that may be disrupted or fused. The software also allows users to identify recurrent CNA regions that may be associated with differential survival.

Availability and Implementation: Copy Number Explorer is available under the GNU public license (GPL-3). Source code is available at: https://sourceforge.net/projects/copynumberexplorer/

Contact: scott.newman@emory.edu

Categories: Journal Articles

kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Summary:We announce the release of kSNP3.0, a program for SNP identification and phylogenetic analysis without genome alignment or the requirement for reference genomes. kSNP3.0 is a significantly improved version of kSNP v2.

Availability and implementation: kSNP3.0 is implemented as a package of stand-alone executables for Linux and Mac OS X under the open-source BSD license. The executable packages, source code and a full User Guide are freely available at https://sourceforge.net/projects/ksnp/files/

Contact: barryghall@gmail.com

Categories: Journal Articles

Identification of C2H2-ZF binding preferences from ChIP-seq data using RCADE

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Summary: Current methods for motif discovery from chromatin immunoprecipitation followed by sequencing (ChIP-seq) data often identify non-targeted transcription factor (TF) motifs, and are even further limited when peak sequences are similar due to common ancestry rather than common binding factors. The latter aspect particularly affects a large number of proteins from the Cys2His2 zinc finger (C2H2-ZF) class of TFs, as their binding sites are often dominated by endogenous retroelements that have highly similar sequences. Here, we present recognition code-assisted discovery of regulatory elements (RCADE) for motif discovery from C2H2-ZF ChIP-seq data. RCADE combines predictions from a DNA recognition code of C2H2-ZFs with ChIP-seq data to identify models that represent the genuine DNA binding preferences of C2H2-ZF proteins. We show that RCADE is able to identify generalizable binding models even from peaks that are exclusively located within the repeat regions of the genome, where state-of-the-art motif finding approaches largely fail.

Availability and implementation: RCADE is available as a webserver and also for download at http://rcade.ccbr.utoronto.ca/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Contact: t.hughes@utoronto.ca

Categories: Journal Articles

Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Motivation: The characterization of phylogenetic and functional diversity is a key element in the analysis of microbial communities. Amplicon-based sequencing of marker genes, such as 16S rRNA, is a powerful tool for assessing and comparing the structure of microbial communities at a high phylogenetic resolution. Because 16S rRNA sequencing is more cost-effective than whole metagenome shotgun sequencing, marker gene analysis is frequently used for broad studies that involve a large number of different samples. However, in comparison to shotgun sequencing approaches, insights into the functional capabilities of the community get lost when restricting the analysis to taxonomic assignment of 16S rRNA data.

Results: Tax4Fun is a software package that predicts the functional capabilities of microbial communities based on 16S rRNA datasets. We evaluated Tax4Fun on a range of paired metagenome/16S rRNA datasets to assess its performance. Our results indicate that Tax4Fun provides a good approximation to functional profiles obtained from metagenomic shotgun sequencing approaches.

Availability and implementation: Tax4Fun is an open-source R package and applicable to output as obtained from the SILVAngs web server or the application of QIIME with a SILVA database extension. Tax4Fun is freely available for download at http://tax4fun.gobics.de/.

Contact: kasshau@gwdg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

BFC: correcting Illumina sequencing errors

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Summary: BFC is a free, fast and easy-to-use sequencing error corrector designed for Illumina short reads. It uses a non-greedy algorithm but still maintains a speed comparable to implementations based on greedy methods. In evaluations on real data, BFC appears to correct more errors with fewer overcorrections in comparison to existing tools. It particularly does well in suppressing systematic sequencing errors, which helps to improve the base accuracy of de novo assemblies.

Availability and implementation: https://github.com/lh3/bfc

Contact: hengli@broadinstitute.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

pez: phylogenetics for the environmental sciences

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Summary: pez is an R package that permits measurement, modelling and simulation of phylogenetic structure in ecological data. pez contains the first implementation of many methods in R, and aggregates existing data structures and methods into a single, coherent package.

Availability and implementation: pez is released under the GPL v3 open-source license, available on the Internet from CRAN (http://cran.r-project.org). The package is under active development, and the authors welcome contributions (see http://github.com/willpearse/pez).

Contact: will.pearse@gmail.com

Categories: Journal Articles

iFoldRNA v2: folding RNA with constraints

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Summary: A key to understanding RNA function is to uncover its complex 3D structure. Experimental methods used for determining RNA 3D structures are technologically challenging and laborious, which makes the development of computational prediction methods of substantial interest. Previously, we developed the iFoldRNA server that allows accurate prediction of short (<50 nt) tertiary RNA structures starting from primary sequences. Here, we present a new version of the iFoldRNA server that permits the prediction of tertiary structure of RNAs as long as a few hundred nucleotides. This substantial increase in the server capacity is achieved by utilization of experimental information such as base-pairing and hydroxyl-radical probing. We demonstrate a significant benefit provided by integration of experimental data and computational methods.

Availability and implementation: http://ifoldrna.dokhlab.org

Contact: dokh@unc.eu

Categories: Journal Articles

PDBest: a user-friendly platform for manipulating and enhancing protein structures

Bioinformatics Journal - Mon, 08/24/2015 - 09:21

Summary: PDBest (PDB Enhanced Structures Toolkit) is a user-friendly, freely available platform for acquiring, manipulating and normalizing protein structures in a high-throughput and seamless fashion. With an intuitive graphical interface it allows users with no programming background to download and manipulate their files. The platform also exports protocols, enabling users to easily share PDB searching and filtering criteria, enhancing analysis reproducibility.

Availability and implementation: PDBest installation packages are freely available for several platforms at http://www.pdbest.dcc.ufmg.br

Contact: wellisson@dcc.ufmg.br, dpires@dcc.ufmg.br, raquelcm@dcc.ufmg.br

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles