Bioinformatics Journal

Syndicate content
Bioinformatics - RSS feed of current issue
Updated: 43 weeks 12 hours ago

ERC analysis: web-based inference of gene function via evolutionary rate covariation

Fri, 11/20/2015 - 02:05

Summary: The recent explosion of comparative genomics data presents an unprecedented opportunity to construct gene networks via the evolutionary rate covariation (ERC) signature. ERC is used to identify genes that experienced similar evolutionary histories, and thereby draws functional associations between them. The ERC Analysis website allows researchers to exploit genome-wide datasets to infer novel genes in any biological function and to explore deep evolutionary connections between distinct pathways and complexes. The website provides five analytical methods, graphical output, statistical support and access to an increasing number of taxonomic groups.

Availability and implementation: Analyses and data at http://csb.pitt.edu/erc_analysis/

Contact: nclark@pitt.edu

Categories: Journal Articles

Correcting systematic bias and instrument measurement drift with mzRefinery

Fri, 11/20/2015 - 02:05

Motivation: Systematic bias in mass measurement adversely affects data quality and negates the advantages of high precision instruments.

Results: We introduce the mzRefinery tool for calibration of mass spectrometry data files. Using confident peptide spectrum matches, three different calibration methods are explored and the optimal transform function is chosen. After calibration, systematic bias is removed and the mass measurement errors are centered at 0 ppm. Because it is part of the ProteoWizard package, mzRefinery can read and write a wide variety of file formats.

Availability and implementation: The mzRefinery tool is part of msConvert, available with the ProteoWizard open source package at http://proteowizard.sourceforge.net/

Contact: samuel.payne@pnnl.gov

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

AlignBucket: a tool to speed up 'all-against-all' protein sequence alignments optimizing length constraints

Fri, 11/20/2015 - 02:05

Motivation: The next-generation sequencing era requires reliable, fast and efficient approaches for the accurate annotation of the ever-increasing number of biological sequences and their variations. Transfer of annotation upon similarity search is a standard approach. The procedure of all-against-all protein comparison is a preliminary step of different available methods that annotate sequences based on information already present in databases. Given the actual volume of sequences, methods are necessary to pre-process data to reduce the time of sequence comparison.

Results: We present an algorithm that optimizes the partition of a large volume of sequences (the whole database) into sets where sequence length values (in residues) are constrained depending on a bounded minimal and expected alignment coverage. The idea is to optimally group protein sequences according to their length, and then computing the all-against-all sequence alignments among sequences that fall in a selected length range. We describe a mathematically optimal solution and we show that our method leads to a 5-fold speed-up in real world cases.

Availability and implementation: The software is available for downloading at http://www.biocomp.unibo.it/~giuseppe/partitioning.html.

Contact: giuseppe.profiti2@unibo.it

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

ARResT/AssignSubsets: a novel application for robust subclassification of chronic lymphocytic leukemia based on B cell receptor IG stereotypy

Fri, 11/20/2015 - 02:05

Motivation: An ever-increasing body of evidence supports the importance of B cell receptor immunoglobulin (BcR IG) sequence restriction, alias stereotypy, in chronic lymphocytic leukemia (CLL). This phenomenon accounts for ~30% of studied cases, one in eight of which belong to major subsets, and extends beyond restricted sequence patterns to shared biologic and clinical characteristics and, generally, outcome. Thus, the robust assignment of new cases to major CLL subsets is a critical, and yet unmet, requirement.

Results: We introduce a novel application, ARResT/AssignSubsets, which enables the robust assignment of BcR IG sequences from CLL patients to major stereotyped subsets. ARResT/AssignSubsets uniquely combines expert immunogenetic sequence annotation from IMGT/V-QUEST with curation to safeguard quality, statistical modeling of sequence features from more than 7500 CLL patients, and results from multiple perspectives to allow for both objective and subjective assessment. We validated our approach on the learning set, and evaluated its real-world applicability on a new representative dataset comprising 459 sequences from a single institution.

Availability and implementation: ARResT/AssignSubsets is freely available on the web at http://bat.infspire.org/arrest/assignsubsets/

Contact: nikos.darzentas@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

motifbreakR: an R/Bioconductor package for predicting variant effects at transcription factor binding sites

Fri, 11/20/2015 - 02:05

Summary: Functional annotation represents a key step toward the understanding and interpretation of germline and somatic variation as revealed by genome-wide association studies (GWAS) and The Cancer Genome Atlas (TCGA), respectively. GWAS have revealed numerous genetic risk variants residing in non-coding DNA associated with complex diseases. For sequences that lie within enhancers or promoters of transcription, it is not straightforward to assess the effects of variants on likely transcription factor binding sites. Consequently we introduce motifbreakR, which allows the biologist to judge whether the sequence surrounding a polymorphism or mutation is a good match, and how much information is gained or lost in one allele of the polymorphism or mutation relative to the other. MotifbreakR is flexible, giving a choice of algorithms for interrogation of genomes with motifs from many public sources that users can choose from. MotifbreakR can predict effects for novel or previously described variants in public databases, making it suitable for tasks beyond the scope of its original design. Lastly, it can be used to interrogate any genome curated within bioconductor.

Availability and implementation: https://github.com/Simon-Coetzee/MotifBreakR, www.bioconductor.org.

Contact: dennis.hazelett@cshs.org

Categories: Journal Articles

HHalign-Kbest: exploring sub-optimal alignments for remote homology comparative modeling

Fri, 11/20/2015 - 02:05

Motivation: The HHsearch algorithm, implementing a hidden Markov model (HMM)-HMM alignment method, has shown excellent alignment performance in the so-called twilight zone (target-template sequence identity with ~20%). However, an optimal alignment by HHsearch may contain small to large errors, leading to poor structure prediction if these errors are located in important structural elements.

Results: HHalign-Kbest server runs a full pipeline, from the generation of suboptimal HMM-HMM alignments to the evaluation of the best structural models. In the HHsearch framework, it implements a novel algorithm capable of generating k-best HMM-HMM suboptimal alignments rather than only the optimal one. For large proteins, a directed acyclic graph-based implementation reduces drastically the memory usage. Improved alignments were systematically generated among the top k suboptimal alignments. To recognize them, corresponding structural models were systematically generated and evaluated with Qmean score. The method was benchmarked over 420 targets from the SCOP30 database. In the range of HHsearch probability of 20–99%, average quality of the models (TM-score) raised by 4.1–16.3% and 8.0–21.0% considering the top 1 and top 10 best models, respectively.

Availability and implementation: http://bioserv.rpbs.univ-paris-diderot.fr/services/HHalign-Kbest/ (source code and server).

Contact: guerois@cea.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

MEPSA: minimum energy pathway analysis for energy landscapes

Fri, 11/20/2015 - 02:05

Summary: From conformational studies to atomistic descriptions of enzymatic reactions, potential and free energy landscapes can be used to describe biomolecular systems in detail. However, extracting the relevant data of complex 3D energy surfaces can sometimes be laborious. In this article, we present MEPSA (Minimum Energy Path Surface Analysis), a cross-platform user friendly tool for the analysis of energy landscapes from a transition state theory perspective. Some of its most relevant features are: identification of all the barriers and minima of the landscape at once, description of maxima edge profiles, detection of the lowest energy path connecting two minima and generation of transition state theory diagrams along these paths. In addition to a built-in plotting system, MEPSA can save most of the generated data into easily parseable text files, allowing more versatile uses of MEPSA’s output such as the generation of molecular dynamics restraints from a calculated path.

Availability and implementation: MEPSA is freely available (under GPLv3 license) at: http://bioweb.cbm.uam.es/software/MEPSA/

Contact: pagomez@cbm.csic.es

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

NRGsuite: a PyMOL plugin to perform docking simulations in real time using FlexAID

Fri, 11/20/2015 - 02:05

Ligand protein docking simulations play a fundamental role in understanding molecular recognition. Herein we introduce the NRGsuite, a PyMOL plugin that permits the detection of surface cavities in proteins, their refinements, calculation of volume and use, individually or jointly, as target binding-sites for docking simulations with FlexAID. The NRGsuite offers the users control over a large number of important parameters in docking simulations including the assignment of flexible side-chains and definition of geometric constraints. Furthermore, the NRGsuite permits the visualization of the docking simulation in real time. The NRGsuite give access to powerful docking simulations that can be used in structure-guided drug design as well as an educational tool. The NRGsuite is implemented in Python and C/C++ with an easy to use package installer. The NRGsuite is available for Windows, Linux and MacOS.

Availability and implementation: http://bcb.med.usherbrooke.ca/flexaid.

Contact: rafael.najmanovich@usherbroke.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

aRrayLasso: a network-based approach to microarray interconversion

Fri, 11/20/2015 - 02:05

Summary: Robust conversion between microarray platforms is needed to leverage the wide variety of microarray expression studies that have been conducted to date. Currently available conversion methods rely on manufacturer annotations, which are often incomplete, or on direct alignment of probes from different platforms, which often fail to yield acceptable genewise correlation. Here, we describe aRrayLasso, which uses the Lasso-penalized generalized linear model to model the relationships between individual probes in different probe sets. We have implemented aRrayLasso in a set of five open-source R functions that allow the user to acquire data from public sources such as Gene Expression Omnibus, train a set of Lasso models on that data and directly map one microarray platform to another. aRrayLasso significantly predicts expression levels with similar fidelity to technical replicates of the same RNA pool, demonstrating its utility in the integration of datasets from different platforms.

Availability and implementation: All functions are available, along with descriptions, at https://github.com/adam-sam-brown/aRrayLasso.

Contact: chirag_patel@hms.harvard.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

NAM: association studies in multiple populations

Fri, 11/20/2015 - 02:05

Motivation: Mixed linear models provide important techniques for performing genome-wide association studies. However, current models have pitfalls associated with their strong assumptions. Here, we propose a new implementation designed to overcome some of these pitfalls using an empirical Bayes algorithm.

Results: Here we introduce NAM, an R package that allows user to take into account prior information regarding population stratification to relax the linkage phase assumption of current methods. It allows markers to be treated as a random effect to increase the resolution, and uses a sliding-window strategy to increase power and avoid double fitting markers into the model.

Availability and implementation: NAM is an R package available in the CRAN repository. It can be installed in R by typing install.packages (‘NAM’).

Contact: krainey@purdue.edu

Supplementary information: Supplementary date are available at Bioinformatics online.

Categories: Journal Articles

stringgaussnet: from differentially expressed genes to semantic and Gaussian networks generation

Fri, 11/20/2015 - 02:05

Motivation: Knowledge-based and co-expression networks are two kinds of gene networks that can be currently implemented by sophisticated but distinct tools. We developed stringgaussnet, an R package that integrates both approaches, starting from a list of differentially expressed genes.

Contact: henri-jean.garchon@inserm.fr

Availability and implementation: Freely available on the web at http://cran.r-project.org/web/packages/stringgaussnet.

Categories: Journal Articles

cyNeo4j: connecting Neo4j and Cytoscape

Fri, 11/20/2015 - 02:05

Summary: We developed cyNeo4j, a Cytoscape App to link Cytoscape and Neo4j databases to utilize the performance and storage capacities Neo4j offers. We implemented a Neo4j NetworkAnalyzer, ForceAtlas2 layout and Cypher component to demonstrate the possibilities a distributed setup of Cytoscape and Neo4j have.

Availability and implementation: The app is available from the Cytoscape App Store at http://apps.cytoscape.org/apps/cyneo4j, the Neo4j plugins at www.github.com/gsummer/cyneo4j-parent and the community and commercial editions of Neo4j can be found at http://www.neo4j.com.

Contact: georg.summer@gmail.com

Categories: Journal Articles

DSviaDRM: an R package for estimating disease similarity via dysfunctional regulation mechanism

Fri, 11/20/2015 - 02:05

Summary: Elucidation of human disease similarities has provided new insights into etiology, disease classification and drug repositioning. Since dysfunctional regulation would be manifested as the decoupling of expression correlation, disease similarity (DS) in terms of dysfunctional regulation mechanism (DRM) could be estimated by using a differential coexpression based approach, which is described in a companion paper. Due to the lack of tools for estimating DS from the viewpoint of DRM in public domain, we implemented an R package ‘DSviaDRM’ to identify significant DS via DRM based on transcriptomic data. DSviaDRM contains five easy-to-use functions, DCEA, DCpathway, DS, comDCGL and comDCGLplot, for identifying disease relationships and showing common differential regulation information shared by similar diseases.

Availability and implementation: DSviaDRM is available as an R package, with a user’s guide and source code, at http://cran.r-project.org/web/packages/DSviaDRM/index.html.

Contact: yyli@scbit.org or yxli@scbit.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

ASSIsT: an automatic SNP scoring tool for in- and outbreeding species

Fri, 11/20/2015 - 02:05

ASSIsT (Automatic SNP ScorIng Tool) is a user-friendly customized pipeline for efficient calling and filtering of SNPs from Illumina Infinium arrays, specifically devised for custom genotyping arrays. Illumina has developed an integrated software for SNP data visualization and inspection called GenomeStudio® (GS). ASSIsT builds on GS-derived data and identifies those markers that follow a bi-allelic genetic model and show reliable genotype calls. Moreover, ASSIsT re-edits SNP calls with null alleles or additional SNPs in the probe annealing site. ASSIsT can be employed in the analysis of different population types such as full-sib families and mating schemes used in the plant kingdom (backcross, F1, F2), and unrelated individuals. The final result can be directly exported in the format required by the most common software for genetic mapping and marker–trait association analysis. ASSIsT is developed in Python and runs in Windows and Linux.

Availability and implementation: The software, example data sets and tutorials are freely available at http://compbiotoolbox.fmach.it/assist/.

Contact: eric.vandeweg@wur.nl

Categories: Journal Articles

Vizardous: interactive analysis of microbial populations with single cell resolution

Fri, 11/20/2015 - 02:05

Motivation: Single cell time-lapse microscopy is a powerful method for investigating heterogeneous cell behavior. Advances in microfluidic lab-on-a-chip technologies and live-cell imaging render the parallel observation of the development of individual cells in hundreds of populations possible. While image analysis tools are available for cell detection and tracking, biologists are still confronted with the challenge of exploring and evaluating this data.

Results: We present the software tool Vizardous that assists scientists with explorative analysis and interpretation tasks of single cell data in an interactive, configurable and visual way. With Vizardous, lineage tree drawings can be augmented with various, time-resolved cellular characteristics. Associated statistical moments bridge the gap between single cell and the population-average level.

Availability and implementation: The software, including documentation and examples, is available as executable Java archive as well as in source form at https://github.com/modsim/vizardous.

Contact: k.noeh@fz-juelich.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

SurvCurv database and online survival analysis platform update

Fri, 11/20/2015 - 02:05

Summary: Understanding the biology of ageing is an important and complex challenge. Survival experiments are one of the primary approaches for measuring changes in ageing. Here, we present a major update to SurvCurv, a database and online resource for survival data in animals. As well as a substantial increase in data and additions to existing graphical and statistical survival analysis features, SurvCurv now includes extended mathematical mortality modelling functions and survival density plots for more advanced representation of groups of survival cohorts.

Availability and implementation: The database is freely available at https://www.ebi.ac.uk/thornton-srv/databases/SurvCurv/. All data are published under the Creative Commons Attribution License.

Contact: matthias.ziehm@ebi.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Determining conserved metabolic biomarkers from a million database queries

Fri, 11/20/2015 - 02:05

Motivation: Metabolite databases provide a unique window into metabolome research allowing the most commonly searched biomarkers to be catalogued. Omic scale metabolite profiling, or metabolomics, is finding increased utility in biomarker discovery largely driven by improvements in analytical technologies and the concurrent developments in bioinformatics. However, the successful translation of biomarkers into clinical or biologically relevant indicators is limited.

Results: With the aim of improving the discovery of translatable metabolite biomarkers, we present search analytics for over one million METLIN metabolite database queries. The most common metabolites found in METLIN were cross-correlated against XCMS Online, the widely used cloud-based data processing and pathway analysis platform. Analysis of the METLIN and XCMS common metabolite data has two primary implications: these metabolites, might indicate a conserved metabolic response to stressors and, this data may be used to gauge the relative uniqueness of potential biomarkers.

Availability and implementation. METLIN can be accessed by logging on to: https://metlin.scripps.edu

Contact: siuzdak@scripps.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

TIPR: transcription initiation pattern recognition on a genome scale

Fri, 11/20/2015 - 02:05

Motivation: The computational identification of gene transcription start sites (TSSs) can provide insights into the regulation and function of genes without performing expensive experiments, particularly in organisms with incomplete annotations. High-resolution general-purpose TSS prediction remains a challenging problem, with little recent progress on the identification and differentiation of TSSs which are arranged in different spatial patterns along the chromosome.

Results: In this work, we present the Transcription Initiation Pattern Recognizer (TIPR), a sequence-based machine learning model that identifies TSSs with high accuracy and resolution for multiple spatial distribution patterns along the genome, including broadly distributed TSS patterns that have previously been difficult to characterize. TIPR predicts not only the locations of TSSs but also the expected spatial initiation pattern each TSS will form along the chromosome—a novel capability for TSS prediction algorithms. As spatial initiation patterns are associated with spatiotemporal expression patterns and gene function, this capability has the potential to improve gene annotations and our understanding of the regulation of transcription initiation. The high nucleotide resolution of this model locates TSSs within 10 nucleotides or less on average.

Availability and implementation: Model source code is made available online at http://megraw.cgrb.oregonstate.edu/software/TIPR/.

Contact: megrawm@science.oregonstate.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments

Fri, 11/20/2015 - 02:05

Motivation: Genome assemblies generated with next-generation sequencing (NGS) reads usually contain a number of gaps. Several tools have recently been developed to close the gaps in these assemblies with NGS reads. Although these gap-closing tools efficiently close the gaps, they entail a high rate of misassembly at gap-closing sites.

Results: We have found that the assembly error rates caused by these tools are 20–500-fold higher than the rate of errors introduced into contigs by de novo assemblers. We here describe GMcloser, a tool that accurately closes these gaps with a preassembled contig set or a long read set (i.e. error-corrected PacBio reads). GMcloser uses likelihood-based classifiers calculated from the alignment statistics between scaffolds, contigs and paired-end reads to correctly assign contigs or long reads to gap regions of scaffolds, thereby achieving accurate and efficient gap closure. We demonstrate with sequencing data from various organisms that the gap-closing accuracy of GMcloser is 3–100-fold higher than those of other available tools, with similar efficiency.

Availability and implementation: GMcloser and an accompanying tool (GMvalue) for evaluating the assembly and correcting misassemblies except SNPs and short indels in the assembly are available at https://sourceforge.net/projects/gmcloser/.

Contact: shunichi.kosugi@riken.jp

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Mutadelic: mutation analysis using description logic inferencing capabilities

Fri, 11/20/2015 - 02:05

Motivation: As next generation sequencing gains a foothold in clinical genetics, there is a need for annotation tools to characterize increasing amounts of patient variant data for identifying clinically relevant mutations. While existing informatics tools provide efficient bulk variant annotations, they often generate excess information that may limit their scalability.

Results: We propose an alternative solution based on description logic inferencing to generate workflows that produce only those annotations that will contribute to the interpretation of each variant. Workflows are dynamically generated using a novel abductive reasoning framework called a basic framework for abductive workflow generation (AbFab). Criteria for identifying disease-causing variants in Mendelian blood disorders were identified and implemented as AbFab services. A web application was built allowing users to run workflows generated from the criteria to analyze genomic variants. Significant variants are flagged and explanations provided for why they match or fail to match the criteria.

Availability and implementation: The Mutadelic web application is available for use at http://krauthammerlab.med.yale.edu/mutadelic.

Contact: michael.krauthammer@yale.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles