Bioinformatics Journal

Syndicate content
Bioinformatics - RSS feed of current issue
Updated: 1 year 50 weeks ago

SuccFind: a novel succinylation sites online prediction tool via enhanced characteristic strategy

Fri, 11/20/2015 - 02:05

Summary: Lysine succinylation orchestrates a variety of biological processes. Annotation of succinylation in proteomes is the first-crucial step to decipher physiological roles of succinylation implicated in the pathological processes. In this work, we developed a novel succinylation site online prediction tool, called SuccFind, which is constructed to predict the lysine succinylation sites based on two major categories of characteristics: sequence-derived features and evolutionary-derived information of sequence and via an enhanced feature strategy for further optimizations. The assessment results obtained from cross-validation suggest that SuccFind can provide more instructive guidance for further experimental investigation of protein succinylation.

Availability and implementation: A user-friendly server is freely available on the web at: http://bioinfo.ncu.edu.cn/SuccFind.aspx

Contact: jdqiu@ncu.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

An efficient algorithm for the extraction of HGVS variant descriptions from sequences

Fri, 11/20/2015 - 02:05

Motivation: Unambiguous sequence variant descriptions are important in reporting the outcome of clinical diagnostic DNA tests. The standard nomenclature of the Human Genome Variation Society (HGVS) describes the observed variant sequence relative to a given reference sequence. We propose an efficient algorithm for the extraction of HGVS descriptions from two sequences with three main requirements in mind: minimizing the length of the resulting descriptions, minimizing the computation time and keeping the unambiguous descriptions biologically meaningful.

Results: Our algorithm is able to compute the HGVS descriptions of complete chromosomes or other large DNA strings in a reasonable amount of computation time and its resulting descriptions are relatively small. Additional applications include updating of gene variant database contents and reference sequence liftovers.

Availability: The algorithm is accessible as an experimental service in the Mutalyzer program suite (https://mutalyzer.nl). The C++ source code and Python interface are accessible at: https://github.com/mutalyzer/description-extractor.

Contact: j.k.vis@lumc.nl

Categories: Journal Articles

BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements

Fri, 11/20/2015 - 02:05

Motivation: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected.

Results: We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays.

Availability and implementation: BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller

Contact: Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

LoopIng: a template-based tool for predicting the structure of protein loops

Fri, 11/20/2015 - 02:05

Motivation: Predicting the structure of protein loops is very challenging, mainly because they are not necessarily subject to strong evolutionary pressure. This implies that, unlike the rest of the protein, standard homology modeling techniques are not very effective in modeling their structure. However, loops are often involved in protein function, hence inferring their structure is important for predicting protein structure as well as function.

Results: We describe a method, LoopIng, based on the Random Forest automated learning technique, which, given a target loop, selects a structural template for it from a database of loop candidates. Compared to the most recently available methods, LoopIng is able to achieve similar accuracy for short loops (4–10 residues) and significant enhancements for long loops (11–20 residues). The quality of the predictions is robust to errors that unavoidably affect the stem regions when these are modeled. The method returns a confidence score for the predicted template loops and has the advantage of being very fast (on average: 1 min/loop).

Availability and implementation: www.biocomputing.it/looping

Contact: anna.tramontano@uniroma1.it

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Accurate disulfide-bonding network predictions improve ab initio structure prediction of cysteine-rich proteins

Fri, 11/20/2015 - 02:05

Motivation: Cysteine-rich proteins cover many important families in nature but there are currently no methods specifically designed for modeling the structure of these proteins. The accuracy of disulfide connectivity pattern prediction, particularly for the proteins of higher-order connections, e.g. >3 bonds, is too low to effectively assist structure assembly simulations.

Results: We propose a new hierarchical order reduction protocol called Cyscon for disulfide-bonding prediction. The most confident disulfide bonds are first identified and bonding prediction is then focused on the remaining cysteine residues based on SVR training. Compared with purely machine learning-based approaches, Cyscon improved the average accuracy of connectivity pattern prediction by 21.9%. For proteins with more than 5 disulfide bonds, Cyscon improved the accuracy by 585% on the benchmark set of PDBCYS. When applied to 158 non-redundant cysteine-rich proteins, Cyscon predictions helped increase (or decrease) the TM-score (or RMSD) of the ab initio QUARK modeling by 12.1% (or 14.4%). This result demonstrates a new avenue to improve the ab initio structure modeling for cysteine-rich proteins.

Availability and implementation: http://www.csbio.sjtu.edu.cn/bioinf/Cyscon/

Contact: zhng@umich.edu or hbshen@sjtu.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Improving protein fold recognition with hybrid profiles combining sequence and structure evolution

Fri, 11/20/2015 - 02:05

Motivation: Template-based modeling, the most successful approach for predicting protein 3D structure, often requires detecting distant evolutionary relationships between the target sequence and proteins of known structure. Developed for this purpose, fold recognition methods use elaborate strategies to exploit evolutionary information, mainly by encoding amino acid sequence into profiles. Since protein structure is more conserved than sequence, the inclusion of structural information can improve the detection of remote homology.

Results: Here, we present ORION, a new fold recognition method based on the pairwise comparison of hybrid profiles that contain evolutionary information from both protein sequence and structure. Our method uses the 16-state structural alphabet Protein Blocks, which provides an accurate 1D description of protein structure local conformations. ORION systematically outperforms PSI-BLAST and HHsearch on several benchmarks, including target sequences from the modeling competitions CASP8, 9 and 10, and detects ~10% more templates at fold and superfamily SCOP levels.

Availability: Software freely available for download at http://www.dsimb.inserm.fr/orion/.

Contact: jean-christophe.gelly@univ-paris-diderot.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

PBAP: a pipeline for file processing and quality control of pedigree data with dense genetic markers

Fri, 11/20/2015 - 02:05

Motivation: Huge genetic datasets with dense marker panels are now common. With the availability of sequence data and recognition of importance of rare variants, smaller studies based on pedigrees are again also common. Pedigree-based samples often start with a dense marker panel, a subset of which may be used for linkage analysis to reduce computational burden and to limit linkage disequilibrium between single-nucleotide polymorphisms (SNPs). Programs attempting to select markers for linkage panels exist but lack flexibility.

Results: We developed a pedigree-based analysis pipeline (PBAP) suite of programs geared towards SNPs and sequence data. PBAP performs quality control, marker selection and file preparation. PBAP sets up files for MORGAN, which can handle analyses for small and large pedigrees, typically human, and results can be used with other programs and for downstream analyses. We evaluate and illustrate its features with two real datasets.

Availability and implementation: PBAP scripts may be downloaded from http://faculty.washington.edu/wijsman/software.shtml.

Contact: wijsman@uw.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Identifying kinase dependency in cancer cells by integrating high-throughput drug screening and kinase inhibition data

Fri, 11/20/2015 - 02:05

Motivation: Targeted kinase inhibitors have dramatically improved cancer treatment, but kinase dependency for an individual patient or cancer cell can be challenging to predict. Kinase dependency does not always correspond with gene expression and mutation status. High-throughput drug screens are powerful tools for determining kinase dependency, but drug polypharmacology can make results difficult to interpret.

Results: We developed Kinase Addiction Ranker (KAR), an algorithm that integrates high-throughput drug screening data, comprehensive kinase inhibition data and gene expression profiles to identify kinase dependency in cancer cells. We applied KAR to predict kinase dependency of 21 lung cancer cell lines and 151 leukemia patient samples using published datasets. We experimentally validated KAR predictions of FGFR and MTOR dependence in lung cancer cell line H1581, showing synergistic reduction in proliferation after combining ponatinib and AZD8055.

Availability and implementation: KAR can be downloaded as a Python function or a MATLAB script along with example inputs and outputs at: http://tanlab.ucdenver.edu/KAR/.

Contact: aikchoon.tan@ucdenver.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

A statistical approach to virtual cellular experiments: improved causal discovery using accumulation IDA (aIDA)

Fri, 11/20/2015 - 02:05

Motivation: We address the following question: Does inhibition of the expression of a gene X in a cellular assay affect the expression of another gene Y? Rather than inhibiting gene X experimentally, we aim at answering this question computationally using as the only input observational gene expression data. Recently, a new statistical algorithm called Intervention calculus when the Directed acyclic graph is Absent (IDA), has been proposed for this problem. For several biological systems, IDA has been shown to outcompete regression-based methods with respect to the number of true positives versus the number of false positives for the top 5000 predicted effects. Further improvements in the performance of IDA have been realized by stability selection, a resampling method wrapped around IDA that enhances the discovery of true causal effects. Nevertheless, the rate of false positive and false negative predictions is still unsatisfactorily high.

Results: We introduce a new resampling approach for causal discovery called accumulation IDA (aIDA). We show that aIDA improves the performance of causal discoveries compared to existing variants of IDA on both simulated and real yeast data. The higher reliability of top causal effect predictions achieved by aIDA promises to increase the rate of success of wet lab intervention experiments for functional studies.

Availability and implementation: R code for aIDA is available in the Supplementary material.

Contact: franziska.taruttis@ur.de, julia.engelmann@ur.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Impact of normalization methods on high-throughput screening data with high hit rates and drug testing with dose-response data

Fri, 11/20/2015 - 02:05

Motivation: Most data analysis tools for high-throughput screening (HTS) seek to uncover interesting hits for further analysis. They typically assume a low hit rate per plate. Hit rates can be dramatically higher in secondary screening, RNAi screening and in drug sensitivity testing using biologically active drugs. In particular, drug sensitivity testing on primary cells is often based on dose–response experiments, which pose a more stringent requirement for data quality and for intra- and inter-plate variation. Here, we compared common plate normalization and noise-reduction methods, including the B-score and the Loess a local polynomial fit method under high hit-rate scenarios of drug sensitivity testing. We generated simulated 384-well plate HTS datasets, each with 71 plates having a range of 20 (5%) to 160 (42%) hits per plate, with controls placed either at the edge of the plates or in a scattered configuration.

Results: We identified 20% (77/384) as the critical hit-rate after which the normalizations started to perform poorly. Results from real drug testing experiments supported this estimation. In particular, the B-score resulted in incorrect normalization of high hit-rate plates, leading to poor data quality, which could be attributed to its dependency on the median polish algorithm. We conclude that a combination of a scattered layout of controls per plate and normalization using a polynomial least squares fit method, such as Loess helps to reduce column, row and edge effects in HTS experiments with high hit-rates and is optimal for generating accurate dose–response curves.

Contact: john.mpindi@helsinki.fi

Availability and implementation, Supplementary information: R code and Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

OVA: integrating molecular and physical phenotype data from multiple biomedical domain ontologies with variant filtering for enhanced variant prioritization

Fri, 11/20/2015 - 02:05

Motivation: Exome sequencing has become a de facto standard method for Mendelian disease gene discovery in recent years, yet identifying disease-causing mutations among thousands of candidate variants remains a non-trivial task.

Results: Here we describe a new variant prioritization tool, OVA (ontology variant analysis), in which user-provided phenotypic information is exploited to infer deeper biological context. OVA combines a knowledge-based approach with a variant-filtering framework. It reduces the number of candidate variants by considering genotype and predicted effect on protein sequence, and scores the remainder on biological relevance to the query phenotype.

We take advantage of several ontologies in order to bridge knowledge across multiple biomedical domains and facilitate computational analysis of annotations pertaining to genes, diseases, phenotypes, tissues and pathways. In this way, OVA combines information regarding molecular and physical phenotypes and integrates both human and model organism data to effectively prioritize variants. By assessing performance on both known and novel disease mutations, we show that OVA performs biologically meaningful candidate variant prioritization and can be more accurate than another recently published candidate variant prioritization tool.

Availability and implementation: OVA is freely accessible at http://dna2.leeds.ac.uk:8080/OVA/index.jsp

Supplementary information: Supplementary data are available at Bioinformatics online.

Contact: umaan@leeds.ac.uk

Categories: Journal Articles

cgmisc: enhanced genome-wide association analyses and visualization

Fri, 11/20/2015 - 02:05

Summary: High-throughput genotyping and sequencing technologies facilitate studies of complex genetic traits and provide new research opportunities. The increasing popularity of genome-wide association studies (GWAS) leads to the discovery of new associated loci and a better understanding of the genetic architecture underlying not only diseases, but also other monogenic and complex phenotypes. Several softwares are available for performing GWAS analyses, R environment being one of them.

Results: We present cgmisc, an R package that enables enhanced data analysis and visualization of results from GWAS. The package contains several utilities and modules that complement and enhance the functionality of the existing software. It also provides several tools for advanced visualization of genomic data and utilizes the power of the R language to aid in preparation of publication-quality figures. Some of the package functions are specific for the domestic dog (Canis familiaris) data.

Availability and implementation: The package is operating system-independent and is available from: https://github.com/cgmisc-team/cgmisc

Contact: marcin.kierczak@imbim.uu.se

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

MICC: an R package for identifying chromatin interactions from ChIA-PET data

Fri, 11/20/2015 - 02:05

Summary: ChIA-PET is rapidly emerging as an important experimental approach to detect chromatin long-range interactions at high resolution. Here, we present Model based Interaction Calling from ChIA-PET data (MICC), an easy-to-use R package to detect chromatin interactions from ChIA-PET sequencing data. By applying a Bayesian mixture model to systematically remove random ligation and random collision noise, MICC could identify chromatin interactions with a significantly higher sensitivity than existing methods at the same false discovery rate.

Availability and implementation: http://bioinfo.au.tsinghua.edu.cn/member/xwwang/MICCusage

Contact: michael.zhang@utdallas.edu or xwwang@tsinghua.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Proteny: discovering and visualizing statistically significant syntenic clusters at the proteome level

Tue, 10/20/2015 - 09:50

Background: With more and more genomes being sequenced, detecting synteny between genomes becomes more and more important. However, for microorganisms the genomic divergence quickly becomes large, resulting in different codon usage and shuffling of gene order and gene elements such as exons.

Results: We present Proteny, a methodology to detect synteny between diverged genomes. It operates on the amino acid sequence level to be insensitive to codon usage adaptations and clusters groups of exons disregarding order to handle diversity in genomic ordering between genomes. Furthermore, Proteny assigns significance levels to the syntenic clusters such that they can be selected on statistical grounds. Finally, Proteny provides novel ways to visualize results at different scales, facilitating the exploration and interpretation of syntenic regions. We test the performance of Proteny on a standard ground truth dataset, and we illustrate the use of Proteny on two closely related genomes (two different strains of Aspergillus niger) and on two distant genomes (two species of Basidiomycota). In comparison to other tools, we find that Proteny finds clusters with more true homologies in fewer clusters that contain more genes, i.e. Proteny is able to identify a more consistent synteny. Further, we show how genome rearrangements, assembly errors, gene duplications and the conservation of specific genes can be easily studied with Proteny.

Availability and implementation: Proteny is freely available at the Delft Bioinformatics Lab website http://bioinformatics.tudelft.nl/dbl/software.

Contact: t.gehrmann@tudelft.nl

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites

Tue, 10/20/2015 - 09:50

Motivation: The position-weight matrix (PWM) is a useful representation of a transcription factor binding site (TFBS) sequence pattern because the PWM can be estimated from a small number of representative TFBS sequences. However, because the PWM probability model assumes independence between individual nucleotide positions, the PWMs for some TFs poorly discriminate binding sites from non-binding-sites that have similar sequence content. Since the local three-dimensional DNA structure (‘shape’) is a determinant of TF binding specificity and since DNA shape has a significant sequence-dependence, we combined DNA shape-derived features into a TF-generalized regulatory score and tested whether the score could improve PWM-based discrimination of TFBS from non-binding-sites.

Results: We compared a traditional PWM model to a model that combines the PWM with a DNA shape feature-based regulatory potential score, for accuracy in detecting binding sites for 75 vertebrate transcription factors. The PWM + shape model was more accurate than the PWM-only model, for 45% of TFs tested, with no significant loss of accuracy for the remaining TFs.

Availability and implementation: The shape-based model is available as an open-source R package at that is archived on the GitHub software repository at https://github.com/ramseylab/regshape/.

Contact: stephen.ramsey@oregonstate.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Estimating beta diversity for under-sampled communities using the variably weighted Odum dissimilarity index and OTUshuff

Tue, 10/20/2015 - 09:50

Motivation: In profiling the composition and structure of complex microbial communities via high throughput amplicon sequencing, a very low proportion of community members are typically sampled. As a result of this incomplete sampling, estimates of dissimilarity between communities are often inflated, an issue we term pseudo β-diversity.

Results: We present a set of tools to identify and correct for the presence of pseudo β-diversity in contrasts between microbial communities. The variably weighted Odum dissimilarity (DwOdum) allows for down-weighting the influence of either abundant or rare taxa in calculating a measure of similarity between two communities. We show that down-weighting the influence of rare taxa can be used to minimize pseudo β-diversity arising from incomplete sampling. Down-weighting the influence of abundant taxa can increase the sensitivity of hypothesis testing. OTUshuff is an associated test for identifying the presence of pseudo β-diversity in pairwise community contrasts.

Availability and implementation: A Perl script for calculating the DwOdum score from a taxon abundance table and performing pairwise contrasts with OTUshuff can be obtained at http://www.ars.usda.gov/services/software/software.htm?modecode=30-12-10-00.

Contact: daniel.manter@ars.usda.gov

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

Functional classification of CATH superfamilies: a domain-based approach for protein function annotation

Tue, 10/20/2015 - 09:50

Motivation: Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterized. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional sub-classification of CATH superfamilies. The superfamilies are sub-classified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer.

Results: FunFHMMer generates more functionally coherent groupings of protein sequences than other domain-based protein classifications. This has been validated using known functional information. The conserved positions predicted by the FunFams are also found to be enriched in known functional residues. Moreover, the functional annotations provided by the FunFams are found to be more precise than other domain-based resources. FunFHMMer currently identifies 110 439 FunFams in 2735 superfamilies which can be used to functionally annotate > 16 million domain sequences.

Availability and implementation: All FunFam annotation data are made available through the CATH webpages (http://www.cathdb.info). The FunFHMMer webserver (http://www.cathdb.info/search/by_funfhmmer) allows users to submit query sequences for assignment to a CATH FunFam.

Contact: sayoni.das.12@ucl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

ERGC: an efficient referential genome compression algorithm

Tue, 10/20/2015 - 09:50

Motivation: Genome sequencing has become faster and more affordable. Consequently, the number of available complete genomic sequences is increasing rapidly. As a result, the cost to store, process, analyze and transmit the data is becoming a bottleneck for research and future medical applications. So, the need for devising efficient data compression and data reduction techniques for biological sequencing data is growing by the day. Although there exists a number of standard data compression algorithms, they are not efficient in compressing biological data. These generic algorithms do not exploit some inherent properties of the sequencing data while compressing. To exploit statistical and information-theoretic properties of genomic sequences, we need specialized compression algorithms. Five different next-generation sequencing data compression problems have been identified and studied in the literature. We propose a novel algorithm for one of these problems known as reference-based genome compression.

Results: We have done extensive experiments using five real sequencing datasets. The results on real genomes show that our proposed algorithm is indeed competitive and performs better than the best known algorithms for this problem. It achieves compression ratios that are better than those of the currently best performing algorithms. The time to compress and decompress the whole genome is also very promising.

Availability and implementation: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/~rajasek/ERGC.zip.

Contact: rajasek@engr.uconn.edu

Categories: Journal Articles

Error filtering, pair assembly and error correction for next-generation sequencing reads

Tue, 10/20/2015 - 09:50

Motivation: Next-generation sequencing produces vast amounts of data with errors that are difficult to distinguish from true biological variation when coverage is low.

Results: We demonstrate large reductions in error frequencies, especially for high-error-rate reads, by three independent means: (i) filtering reads according to their expected number of errors, (ii) assembling overlapping read pairs and (iii) for amplicon reads, by exploiting unique sequence abundances to perform error correction. We also show that most published paired read assemblers calculate incorrect posterior quality scores.

Availability and implementation: These methods are implemented in the USEARCH package. Binaries are freely available at http://drive5.com/usearch.

Contact: robert@drive5.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles

JASSA: a comprehensive tool for prediction of SUMOylation sites and SIMs

Tue, 10/20/2015 - 09:50

Motivation: Post-translational modification by the Small Ubiquitin-like Modifier (SUMO) proteins, a process termed SUMOylation, is involved in many fundamental cellular processes. SUMO proteins are conjugated to a protein substrate, creating an interface for the recruitment of cofactors harboring SUMO-interacting motifs (SIMs). Mapping both SUMO-conjugation sites and SIMs is required to study the functional consequence of SUMOylation. To define the best candidate sites for experimental validation we designed JASSA, a Joint Analyzer of SUMOylation site and SIMs.

Results: JASSA is a predictor that uses a scoring system based on a Position Frequency Matrix derived from the alignment of experimental SUMOylation sites or SIMs. Compared with existing web-tools, JASSA displays on par or better performances. Novel features were implemented towards a better evaluation of the prediction, including identification of database hits matching the query sequence and representation of candidate sites within the secondary structural elements and/or the 3D fold of the protein of interest, retrievable from deposited PDB files.

Availability and Implementation: JASSA is freely accessible at http://www.jassa.fr/. Website is implemented in PHP and MySQL, with all major browsers supported.

Contact: guillaume.beauclair@inserm.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

Categories: Journal Articles