Bioinformatics Journal

Bioinformatics - RSS feed of current issue
  • Three minimal sequences found in Ebola virus genomes and absent from human DNA
    [Jul 2015]

    Motivation: Ebola virus causes high mortality hemorrhagic fevers, with more than 25 000 cases and 10 000 deaths in the current outbreak. Only experimental therapies are available, thus, novel diagnosis tools and druggable targets are needed.

    Results: Analysis of Ebola virus genomes from the current outbreak reveals the presence of short DNA sequences that appear nowhere in the human genome. We identify the shortest such sequences with lengths between 12 and 14. Only three absent sequences of length 12 exist and they consistently appear at the same location on two of the Ebola virus proteins, in all Ebola virus genomes, but nowhere in the human genome. The alignment-free method used is able to identify pathogen-specific signatures for quick and precise action against infectious agents, of which the current Ebola virus outbreak provides a compelling example.

    Availability and Implementation: EAGLE is freely available for non-commercial purposes at http://bioinformatics.ua.pt/software/eagle.

    Contact: raquelsilva@ua.pt; pratas@ua.pt

    Supplementary Information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Learning chromatin states with factorized information criteria
    [Jul 2015]

    Motivation: Recent studies have suggested that both the genome and the genome with epigenetic modifications, the so-called epigenome, play important roles in various biological functions, such as transcription and DNA replication, repair, and recombination. It is well known that specific combinations of histone modifications (e.g. methylations and acetylations) of nucleosomes induce chromatin states that correspond to specific functions of chromatin. Although the advent of next-generation sequencing (NGS) technologies enables measurement of epigenetic information for entire genomes at high-resolution, the variety of chromatin states has not been completely characterized.

    Results: In this study, we propose a method to estimate the chromatin states indicated by genome-wide chromatin marks identified by NGS technologies. The proposed method automatically estimates the number of chromatin states and characterize each state on the basis of a hidden Markov model (HMM) in combination with a recently proposed model selection technique, factorized information criteria. The method is expected to provide an unbiased model because it relies on only two adjustable parameters and avoids heuristic procedures as much as possible. Computational experiments with simulated datasets show that our method automatically learns an appropriate model, even in cases where methods that rely on Bayesian information criteria fail to learn the model structures. In addition, we comprehensively compare our method to ChromHMM on three real datasets and show that our method estimates more chromatin states than ChromHMM for those datasets.

    Availability and implementation: The details of the characterized chromatin states are available in the Supplementary information. The program is available on request.

    Contact: mhamada@waseda.jp

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • DISSCO: direct imputation of summary statistics allowing covariates
    [Jul 2015]

    Background: Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates.

    Methods: We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO).

    Results: We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9–15.2% for variants with minor allele frequency <5%.

    Availability and implementation: http://www.unc.edu/~yunmli/DISSCO.

    Contact: yunli@med.unc.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • MEDUSA: a multi-draft based scaffolder
    [Jul 2015]

    Motivation: Completing the genome sequence of an organism is an important task in comparative, functional and structural genomics. However, this remains a challenging issue from both a computational and an experimental viewpoint. Genome scaffolding (i.e. the process of ordering and orientating contigs) of de novo assemblies usually represents the first step in most genome finishing pipelines.

    Results: In this article we present MeDuSa (Multi-Draft based Scaffolder), an algorithm for genome scaffolding. MeDuSa exploits information obtained from a set of (draft or closed) genomes from related organisms to determine the correct order and orientation of the contigs. MeDuSa formalizes the scaffolding problem by means of a combinatorial optimization formulation on graphs and implements an efficient constant factor approximation algorithm to solve it. In contrast to currently used scaffolders, it does not require either prior knowledge on the microrganisms dataset under analysis (e.g. their phylogenetic relationships) or the availability of paired end read libraries. This makes usability and running time two additional important features of our method. Moreover, benchmarks and tests on real bacterial datasets showed that MeDuSa is highly accurate and, in most cases, outperforms traditional scaffolders. The possibility to use MeDuSa on eukaryotic datasets has also been evaluated, leading to interesting results.

    Availability and implementation: MeDuSa web server: http://combo.dbe.unifi.it/medusa. A stand-alone version of the software can be downloaded from https://github.com/combogenomics/medusa/releases. All results presented in this work have been obtained with MeDuSa v. 1.3.

    Contact: marco.fondi@unifi.it

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • DEOD: uncovering dominant effects of cancer-driver genes based on a partial covariance selection method
    [Jul 2015]

    Motivation: The generation of a large volume of cancer genomes has allowed us to identify disease-related alterations more accurately, which is expected to enhance our understanding regarding the mechanism of cancer development. With genomic alterations detected, one challenge is to pinpoint cancer-driver genes that cause functional abnormalities.

    Results: Here, we propose a method for uncovering the dominant effects of cancer-driver genes (DEOD) based on a partial covariance selection approach. Inspired by a convex optimization technique, it estimates the dominant effects of candidate cancer-driver genes on the expression level changes of their target genes. It constructs a gene network as a directed-weighted graph by integrating DNA copy numbers, single nucleotide mutations and gene expressions from matched tumor samples, and estimates partial covariances between driver genes and their target genes. Then, a scoring function to measure the cancer-driver score for each gene is applied. To test the performance of DEOD, a novel scheme is designed for simulating conditional multivariate normal variables (targets and free genes) given a group of variables (driver genes). When we applied the DEOD method to both the simulated data and breast cancer data, DEOD successfully uncovered driver variables in the simulation data, and identified well-known oncogenes in breast cancer. In addition, two highly ranked genes by DEOD were related to survival time. The copy number amplifications of MYC (8q24.21) and TRPS1 (8q23.3) were closely related to the survival time with P-values = 0.00246 and 0.00092, respectively. The results demonstrate that DEOD can efficiently uncover cancer-driver genes.

    Availability and implementation: DEOD was implemented in Matlab, and source codes and data are available at http://combio.gist.ac.kr/softwares/.

    Contact: hyunjulee@gist.ac.kr

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Power and sample-size estimation for microbiome studies using pairwise distances and PERMANOVA
    [Jul 2015]

    Motivation: The variation in community composition between microbiome samples, termed beta diversity, can be measured by pairwise distance based on either presence–absence or quantitative species abundance data. PERMANOVA, a permutation-based extension of multivariate analysis of variance to a matrix of pairwise distances, partitions within-group and between-group distances to permit assessment of the effect of an exposure or intervention (grouping factor) upon the sampled microbiome. Within-group distance and exposure/intervention effect size must be accurately modeled to estimate statistical power for a microbiome study that will be analyzed with pairwise distances and PERMANOVA.

    Results: We present a framework for PERMANOVA power estimation tailored to marker-gene microbiome studies that will be analyzed by pairwise distances, which includes: (i) a novel method for distance matrix simulation that permits modeling of within-group pairwise distances according to pre-specified population parameters; (ii) a method to incorporate effects of different sizes within the simulated distance matrix; (iii) a simulation-based method for estimating PERMANOVA power from simulated distance matrices; and (iv) an R statistical software package that implements the above. Matrices of pairwise distances can be efficiently simulated to satisfy the triangle inequality and incorporate group-level effects, which are quantified by the adjusted coefficient of determination, omega-squared ($${\omega }^{2}$$). From simulated distance matrices, available PERMANOVA power or necessary sample size can be estimated for a planned microbiome study.

    Availability and implementation: http://github.com/brendankelly/micropower.

    Contact: brendank@mail.med.upenn.edu or hongzhe@upenn.edu

    Categories: Journal Articles
  • A statistical physics perspective on alignment-independent protein sequence comparison
    [Jul 2015]

    Motivation: Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly.

    Results: Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from ‘first passage probability distribution’ to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach.

    Contact: d.r.flower@aston.ac.uk

    Categories: Journal Articles
  • HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy
    [Jul 2015]

    Motivation: Multiple sequence alignment (MSA) is important work, but bottlenecks arise in the massive MSA of homologous DNA or genome sequences. Most of the available state-of-the-art software tools cannot address large-scale datasets, or they run rather slowly. The similarity of homologous DNA sequences is often ignored. Lack of parallelization is still a challenge for MSA research.

    Results: We developed two software tools to address the DNA MSA problem. The first employed trie trees to accelerate the centre star MSA strategy. The expected time complexity was decreased to linear time from square time. To address large-scale data, parallelism was applied using the hadoop platform. Experiments demonstrated the performance of our proposed methods, including their running time, sum-of-pairs scores and scalability. Moreover, we supplied two massive DNA/RNA MSA datasets for further testing and research.

    Availability and implementation: The codes, tools and data are accessible free of charge at http://datamining.xmu.edu.cn/software/halign/.

    Contact: zouquan@nclab.net or ghwang@hit.edu.cn

    Categories: Journal Articles
  • Halvade: scalable sequence analysis with MapReduce
    [Jul 2015]

    Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.

    Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50x coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.

    Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR. Its source is available at http://bioinformatics.intec.ugent.be/halvade under GPL license.

    Contact: jan.fostier@intec.ugent.be

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics
    [Jul 2015]

    Motivation: RNA-Seq experiments have revealed a multitude of novel ncRNAs. The gold standard for their analysis based on simultaneous alignment and folding suffers from extreme time complexity of $$O({n}^{6})$$. Subsequently, numerous faster ‘Sankoff-style’ approaches have been suggested. Commonly, the performance of such methods relies on sequence-based heuristics that restrict the search space to optimal or near-optimal sequence alignments; however, the accuracy of sequence-based methods breaks down for RNAs with sequence identities below 60%. Alignment approaches like LocARNA that do not require sequence-based heuristics, have been limited to high complexity ($$\ge $$ quartic time).

    Results: Breaking this barrier, we introduce the novel Sankoff-style algorithm ‘sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)’, which runs in quadratic time without sequence-based heuristics. To achieve this low complexity, on par with sequence alignment algorithms, SPARSE features strong sparsification based on structural properties of the RNA ensembles. Following PMcomp, SPARSE gains further speed-up from lightweight energy computation. Although all existing lightweight Sankoff-style methods restrict Sankoff’s original model by disallowing loop deletions and insertions, SPARSE transfers the Sankoff algorithm to the lightweight energy model completely for the first time. Compared with LocARNA, SPARSE achieves similar alignment and better folding quality in significantly less time (speedup: 3.7). At similar run-time, it aligns low sequence identity instances substantially more accurate than RAF, which uses sequence-based heuristics.

    Availability and implementation: SPARSE is freely available at http://www.bioinf.uni-freiburg.de/Software/SPARSE.

    Contact: backofen@informatik.uni-freiburg.de

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Assessing allele-specific expression across multiple tissues from RNA-seq read data
    [Jul 2015]

    Motivation: RNA sequencing enables allele-specific expression (ASE) studies that complement standard genotype expression studies for common variants and, importantly, also allow measuring the regulatory impact of rare variants. The Genotype-Tissue Expression (GTEx) project is collecting RNA-seq data on multiple tissues of a same set of individuals and novel methods are required for the analysis of these data.

    Results: We present a statistical method to compare different patterns of ASE across tissues and to classify genetic variants according to their impact on the tissue-wide expression profile. We focus on strong ASE effects that we are expecting to see for protein-truncating variants, but our method can also be adjusted for other types of ASE effects. We illustrate the method with a real data example on a tissue-wide expression profile of a variant causal for lipoid proteinosis, and with a simulation study to assess our method more generally.

    Availability and implementation: http://www.well.ox.ac.uk/~rivas/mamba/. R-sources and data examples http://www.iki.fi/mpirinen/

    Contact: matti.pirinen@helsinki.fi or rivas@well.ox.ac.uk

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Identification of a small set of plasma signalling proteins using neural network for prediction of Alzheimer's disease
    [Jul 2015]

    Motivation: Alzheimer’s disease (AD) is a dementia that gets worse with time resulting in loss of memory and cognitive functions. The life expectancy of AD patients following diagnosis is ~7 years. In 2006, researchers estimated that 0.40% of the world population (range 0.17–0.89%) was afflicted by AD, and that the prevalence rate would be tripled by 2050. Usually, examination of brain tissues is required for definite diagnosis of AD. So, it is crucial to diagnose AD at an early stage via some alternative methods. As the brain controls many functions via releasing signalling proteins through blood, we analyse blood plasma proteins for diagnosis of AD.

    Results: Here, we use a radial basis function (RBF) network for feature selection called feature selection RBF network for selection of plasma proteins that can help diagnosis of AD. We have identified a set of plasma proteins, smaller in size than previous study, with comparable prediction accuracy. We have also analysed mild cognitive impairment (MCI) samples with our selected proteins. We have used neural networks and support vector machines as classifiers. The principle component analysis, Sammmon projection and heat-map of the selected proteins have been used to demonstrate the proteins’ discriminating power for diagnosis of AD. We have also found a set of plasma signalling proteins that can distinguish incipient AD from MCI at an early stage. Literature survey strongly supports the AD diagnosis capability of the selected plasma proteins.

    Availability and implementation: The FSRBF code is available at https://sites.google.com/site/agarwalswapna/publications.

    Contact: agarwal.swapna@gmail.com or swapna_r@isical.ac.in

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Computational modeling of the expansion of human cord blood CD133+ hematopoietic stem/progenitor cells with different cytokine combinations
    [Jul 2015]

    Motivation: Many important problems in cell biology require dense non-linear interactions between functional modules to be considered. The importance of computer simulation in understanding cellular processes is now widely accepted, and a variety of simulation algorithms useful for studying certain subsystems have been designed. Expansion of hematopoietic stem and progenitor cells (HSC/HPC) in ex vivo culture with cytokines and small molecules is a method to increase the restricted numbers of stem cells found in umbilical cord blood (CB), while also enhancing the content of early engrafting neutrophil and platelet precursors. The efficacy of the expanded product depends on the composition of the cocktail of cytokines and small molecules used for culture. Testing the influence of a cytokine or small molecule on the expansion of HSC/HPC is a laborious and expensive process. We therefore developed a computational model based on cellular signaling interactions that predict the influence of a cytokine on the survival, duplication and differentiation of the CD133+ HSC/HPC subset from human umbilical CB.

    Results: We have used results from in vitro expansion cultures with different combinations of one or more cytokines to develop an ordinary differential equation model that includes the effect of cytokines on survival, duplication and differentiation of the CD133+ HSC/HPC. Comparing the results of in vitro and in silico experiments, we show that the model can predict the effect of a cytokine on the fold expansion and differentiation of CB CD133+ HSC/HPC after 8-day culture on a 3D scaffold.

    Availability and implementation: The model is available visiting the following URL: http://www.francescopappalardo.net/Bioinformatics_CD133_Model.

    Contact: francesco.pappalardo@unict.it or suzanne.watt@nhsbt.nhs.uk

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Global optimization-based inference of chemogenomic features from drug-target interactions
    [Jul 2015]

    Motivation: Gaining insight into chemogenomic drug–target interactions, such as those involving the substructures of synthetic drugs and protein domains, is important in fragment-based drug discovery and drug repositioning. Previous studies evaluated the interactions locally, thereby ignoring the competitive effects of different substructures or domains, but this could lead to high false-positive estimation, calling for a computational method that presents more predictive power.

    Results: A statistical model, termed Global optimization-based InFerence of chemogenomic features from drug–Target interactions, or GIFT, is proposed herein to evaluate substructure-domain interactions globally such that all substructure-domain contributions to drug–target interaction are analyzed simultaneously. Combinations of different chemical substructures were included since they may function as one unit. When compared to previous methods, GIFT showed better interpretive performance, and performance for the recovery of drug–target interactions was good. Among 53 known drug–domain interactions, 81% were accurately predicted by GIFT. Eighteen of the top 100 predicted combined substructure-domain interactions had corresponding drug–target structures in the Protein Data Bank database, and 15 out of the 18 had been proved. GIFT was then implemented to predict substructure-domain interactions based on drug repositioning. For example, the anticancer activities of tazarotene, adapalene, acitretin and raloxifene were identified. In summary, GIFT is a global chemogenomic inference approach and offers fresh insight into drug–target interactions.

    Availability and implementation: The source codes can be found at http://bioinfo.au.tsinghua.edu.cn/software/GIFT.

    Contact: shaoli@mail.tsinghua.edu.cn

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Outlier detection at the transcriptome-proteome interface
    [Jul 2015]

    Background: In high-throughput experimental biology, it is widely acknowledged that while expression levels measured at the levels of transcriptome and the corresponding proteome do not, in general, correlate well, messenger RNA levels are used as convenient proxies for protein levels. Our interest is in developing data-driven computational models that can bridge the gap between these two levels of measurement at which different mechanisms of regulation may act on different molecular species causing any observed lack of correlations. To this end, we build data-driven predictors of protein levels using mRNA levels and known proxies of translation efficiencies as covariates. Previous work showed that in such a setting, outliers with respect to the model are reliable candidates for post-translational regulation.

    Results: Here, we present and compare two novel formulations of deriving a protein concentration predictor from which outliers may be extracted in a systematic manner. The first approach, outlier rejecting regression, allows explicit specification of a certain fraction of the data as outliers. In a regression setting, this is a non-convex optimization problem which we solve by deriving a difference of convex functions algorithm (DCA). With post-translationally regulated proteins, one expects their concentrations to be affected primarily by disruption of protein stability. Our second algorithm exploits this observation by minimizing an asymmetric loss using quantile regression and extracts outlier proteins whose measured concentrations are lower than what a genome-wide regression would predict. We validate the two approaches on a dataset of yeast transcriptome and proteome. Functional annotation check on detected outliers demonstrate that the methods are able to identify post-translationally regulated genes with high statistical confidence.

    Contact: mn@ecs.soton.ac.uk

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Regulatory network inferred using expression data of small sample size: application and validation in erythroid system
    [Jul 2015]

    Motivation: Modeling regulatory networks using expression data observed in a differentiation process may help identify context-specific interactions. The outcome of the current algorithms highly depends on the quality and quantity of a single time-course dataset, and the performance may be compromised for datasets with a limited number of samples.

    Results: In this work, we report a multi-layer graphical model that is capable of leveraging many publicly available time-course datasets, as well as a cell lineage-specific data with small sample size, to model regulatory networks specific to a differentiation process. First, a collection of network inference methods are used to predict the regulatory relationships in individual public datasets. Then, the inferred directional relationships are weighted and integrated together by evaluating against the cell lineage-specific dataset. To test the accuracy of this algorithm, we collected a time-course RNA-Seq dataset during human erythropoiesis to infer regulatory relationships specific to this differentiation process. The resulting erythroid-specific regulatory network reveals novel regulatory relationships activated in erythropoiesis, which were further validated by genome-wide TR4 binding studies using ChIP-seq. These erythropoiesis-specific regulatory relationships were not identifiable by single dataset-based methods or context-independent integrations. Analysis of the predicted targets reveals that they are all closely associated with hematopoietic lineage differentiation.

    Availability and implementation: The predicted erythroid regulatory network is available at http://guanlab.ccmb.med.umich.edu/data/inferenceNetwork/.

    Contact: gyuanfan@umich.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Using neighborhood cohesiveness to infer interactions between protein domains
    [Jul 2015]

    Motivation: In recent years, large-scale studies have been undertaken to describe, at least partially, protein-protein interaction maps, or interactomes, for a number of relevant organisms, including human. However, current interactomes provide a somehow limited picture of the molecular details involving protein interactions, mostly because essential experimental information, especially structural data, is lacking. Indeed, the gap between structural and interactomics information is enlarging and thus, for most interactions, key experimental information is missing. We elaborate on the observation that many interactions between proteins involve a pair of their constituent domains and, thus, the knowledge of how protein domains interact adds very significant information to any interactomic analysis.

    Results: In this work, we describe a novel use of the neighborhood cohesiveness property to infer interactions between protein domains given a protein interaction network. We have shown that some clustering coefficients can be extended to measure a degree of cohesiveness between two sets of nodes within a network. Specifically, we used the meet/min coefficient to measure the proportion of interacting nodes between two sets of nodes and the fraction of common neighbors. This approach extends previous works where homolog coefficients were first defined around network nodes and later around edges. The proposed approach substantially increases both the number of predicted domain-domain interactions as well as its accuracy as compared with current methods.

    Availability and implementation: http://dimero.cnb.csic.es

    Contact: jsegura@cnb.csic.es

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Overlap and diversity in antimicrobial peptide databases: compiling a non-redundant set of sequences
    [Jul 2015]

    Motivation: The large variety of antimicrobial peptide (AMP) databases developed to date are characterized by a substantial overlap of data and similarity of sequences. Our goals are to analyze the levels of redundancy for all available AMP databases and use this information to build a new non-redundant sequence database. For this purpose, a new software tool is introduced.

    Results: A comparative study of 25 AMP databases reveals the overlap and diversity among them and the internal diversity within each database. The overlap analysis shows that only one database (Peptaibol) contains exclusive data, not present in any other, whereas all sequences in the LAMP_Patent database are included in CAMP_Patent. However, the majority of databases have their own set of unique sequences, as well as some overlap with other databases. The complete set of non-duplicate sequences comprises 16 990 cases, which is almost half of the total number of reported peptides. On the other hand, the diversity analysis identifies the most and least diverse databases and proves that all databases exhibit some level of redundancy. Finally, we present a new parallel-free software, named Dover Analyzer, developed to compute the overlap and diversity between any number of databases and compile a set of non-redundant sequences. These results are useful for selecting or building a suitable representative set of AMPs, according to specific needs.

    Availability and implementation: The regularly updated non-redundant sequence databases and the Dover Analyzer software to perform custom analysis are available at http://mobiosd-hub.com/doveranalyzer/.

    Contact: ymarrero77@yahoo.es

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • 4DGenome: a comprehensive database of chromatin interactions
    [Jul 2015]

    Motivation: The 3D structure of the genome plays a critical role in regulating gene expression. Recent progress in mapping technologies for chromatin interactions has led to a rapid increase in this kind of interaction data. This trend will continue as research in this burgeoning field intensifies.

    Results: We describe the 4DGenome database that stores chromatin interaction data compiled through comprehensive literature curation. The database currently covers both low- and high-throughput assays, including 3C, 4C-Seq, 5C, Hi-C, ChIA-PET and Capture-C. To complement the set of interactions detected by experimental assays, we also include interactions predicted by a recently developed computational method with demonstrated high accuracy. The database currently contains ~8 million records, covering 102 cell/tissue types in five organisms. Records in the database are described using a standardized file format, facilitating data exchange. The vast major of the interactions were assigned a confidence score. Using the web interface, users can query and download database records via a number of annotation dimensions. Query results can be visualized along with other genomics datasets via links to the UCSC genome browser. We anticipate that 4DGenome will be a valuable resource for investigating the spatial structure-and-function relationship of genomes.

    Availability and Implementation: 4Dgenome is freely accessible at http://4dgenome.int-med.uiowa.edu. The database and web interface are implemented in MySQL, Apache and JavaScript with all major browsers supported.

    Contact: kai-tan@uiowa.edu

    Supplementary Information: Supplementary Materials are available at Bioinformatics online.

    Categories: Journal Articles
  • bio-samtools 2: a package for analysis and visualization of sequence and alignment data with SAMtools in Ruby
    [Jul 2015]

    Motivation: bio-samtools is a Ruby language interface to SAMtools, the highly popular library that provides utilities for manipulating high-throughput sequence alignments in the Sequence Alignment/Map format. Advances in Ruby, now allow us to improve the analysis capabilities and increase bio-samtools utility, allowing users to accomplish a large amount of analysis using a very small amount of code. bio-samtools can also be easily developed to include additional SAMtools methods and hence stay current with the latest SAMtools releases.

    Results: We have added new Ruby classes for the MPileup and Variant Call Format (VCF) data formats emitted by SAMtools and introduced more analysis methods for variant analysis, including alternative allele calculation and allele frequency calling for SNPs. Our new implementation of bio-samtools also ensures that all the functionality of the SAMtools library is now supported and that bio-samtools can be easily extended to include future changes in SAMtools. bio-samtools 2 also provides methods that allow the user to directly produce visualization of alignment data.

    Availability and implementation: bio-samtools is available as a BioGem from http://www.biogems.info or as source code from https://github.com/helios/bioruby-samtools under the MIT License.

    Contact: dan.maclean@tsl.ac.uk

    Categories: Journal Articles