Bioinformatics Journal

Bioinformatics - RSS feed of current issue
  • Using combined evidence from replicates to evaluate ChIP-seq peaks
    [Aug 2015]

    Motivation: Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) detects genome-wide DNA–protein interactions and chromatin modifications, returning enriched regions (ERs), usually associated with a significance score. Moderately significant interactions can correspond to true, weak interactions, or to false positives; replicates of a ChIP-seq experiment can provide co-localised evidence to decide between the two cases. We designed a general methodological framework to rigorously combine the evidence of ERs in ChIP-seq replicates, with the option to set a significance threshold on the repeated evidence and a minimum number of samples bearing this evidence.

    Results: We applied our method to Myc transcription factor ChIP-seq datasets in K562 cells available in the ENCODE project. Using replicates, we could extend up to 3 times the ER number with respect to single-sample analysis with equivalent significance threshold. We validated the ‘rescued’ ERs by checking for the overlap with open chromatin regions and for the enrichment of the motif that Myc binds with strongest affinity; we compared our results with alternative methods (IDR and jMOSAiCS), obtaining more validated peaks than the former and less peaks than latter, but with a better validation.

    Availability and implementation: An implementation of the proposed method and its source code under GPLv3 license are freely available at http://www.bioinformatics.deib.polimi.it/MSPC/ and http://mspc.codeplex.com/, respectively.

    Contact: marco.morelli@iit.it

    Supplementary information: Supplementary Material are available at Bioinformatics online.

    Categories: Journal Articles
  • Data-dependent bucketing improves reference-free compression of sequencing reads
    [Aug 2015]

    Motivation: The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data.

    Results: We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes.

    Availability and implementation: Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/~ckingsf/software/mince.

    Contact: carlk@cs.cmu.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Polyester: simulating RNA-seq datasets with differential transcript expression
    [Aug 2015]

    Motivation: Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data.

    Results: Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user.

    Availability and implementation: Polyester is freely available from Bioconductor (http://bioconductor.org/).

    Contact: jtleek@gmail.com

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • RVD2: an ultra-sensitive variant detection model for low-depth heterogeneous next-generation sequencing data
    [Aug 2015]

    Motivation: Next-generation sequencing technology is increasingly being used for clinical diagnostic tests. Clinical samples are often genomically heterogeneous due to low sample purity or the presence of genetic subpopulations. Therefore, a variant calling algorithm for calling low-frequency polymorphisms in heterogeneous samples is needed.

    Results: We present a novel variant calling algorithm that uses a hierarchical Bayesian model to estimate allele frequency and call variants in heterogeneous samples. We show that our algorithm improves upon current classifiers and has higher sensitivity and specificity over a wide range of median read depth and minor allele fraction. We apply our model and identify 15 mutated loci in the PAXP1 gene in a matched clinical breast ductal carcinoma tumor sample; two of which are likely loss-of-heterozygosity events.

    Availability and implementation: http://genomics.wpi.edu/rvd2/.

    Contact: pjflaherty@wpi.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Phylesystem: a git-based data store for community-curated phylogenetic estimates
    [Aug 2015]

    Motivation: Phylogenetic estimates from published studies can be archived using general platforms like Dryad (Vision, 2010) or TreeBASE (Sanderson et al., 1994). Such services fulfill a crucial role in ensuring transparency and reproducibility in phylogenetic research. However, digital tree data files often require some editing (e.g. rerooting) to improve the accuracy and reusability of the phylogenetic statements. Furthermore, establishing the mapping between tip labels used in a tree and taxa in a single common taxonomy dramatically improves the ability of other researchers to reuse phylogenetic estimates. As the process of curating a published phylogenetic estimate is not error-free, retaining a full record of the provenance of edits to a tree is crucial for openness, allowing editors to receive credit for their work and making errors introduced during curation easier to correct.

    Results: Here, we report the development of software infrastructure to support the open curation of phylogenetic data by the community of biologists. The backend of the system provides an interface for the standard database operations of creating, reading, updating and deleting records by making commits to a git repository. The record of the history of edits to a tree is preserved by git’s version control features. Hosting this data store on GitHub (http://github.com/) provides open access to the data store using tools familiar to many developers. We have deployed a server running the ‘phylesystem-api’, which wraps the interactions with git and GitHub. The Open Tree of Life project has also developed and deployed a JavaScript application that uses the phylesystem-api and other web services to enable input and curation of published phylogenetic statements.

    Availability and implementation: Source code for the web service layer is available at https://github.com/OpenTreeOfLife/phylesystem-api. The data store can be cloned from: https://github.com/OpenTreeOfLife/phylesystem. A web application that uses the phylesystem web services is deployed at http://tree.opentreeoflife.org/curator. Code for that tool is available from https://github.com/OpenTreeOfLife/opentree.

    Contact: mtholder@gmail.com

    Categories: Journal Articles
  • DockStar: a novel ILP-based integrative method for structural modeling of multimolecular protein complexes
    [Aug 2015]

    Motivation: Atomic resolution modeling of large multimolecular assemblies is a key task in Structural Cell Biology. Experimental techniques can provide atomic resolution structures of single proteins and small complexes, or low resolution data of large multimolecular complexes.

    Results: We present a novel integrative computational modeling method, which integrates both low and high resolution experimental data. The algorithm accepts as input atomic resolution structures of the individual subunits obtained from X-ray, NMR or homology modeling, and interaction data between the subunits obtained from mass spectrometry. The optimal assembly of the individual subunits is formulated as an Integer Linear Programming task. The method was tested on several representative complexes, both in the bound and unbound cases. It placed correctly most of the subunits of multimolecular complexes of up to 16 subunits and significantly outperformed the CombDock and Haddock multimolecular docking methods.

    Availability and implementation: http://bioinfo3d.cs.tau.ac.il/DockStar

    Contact: naamaamir@mail.tau.ac.il or wolfson@tau.ac.il

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Automated band annotation for RNA structure probing experiments with numerous capillary electrophoresis profiles
    [Aug 2015]

    Motivation: Capillary electrophoresis (CE) is a powerful approach for structural analysis of nucleic acids, with recent high-throughput variants enabling three-dimensional RNA modeling and the discovery of new rules for RNA structure design. Among the steps composing CE analysis, the process of finding each band in an electrophoretic trace and mapping it to a position in the nucleic acid sequence has required significant manual inspection and remains the most time-consuming and error-prone step. The few available tools seeking to automate this band annotation have achieved limited accuracy and have not taken advantage of information across dozens of profiles routinely acquired in high-throughput measurements.

    Results: We present a dynamic-programming-based approach to automate band annotation for high-throughput capillary electrophoresis. The approach is uniquely able to define and optimize a robust target function that takes into account multiple CE profiles (sequencing ladders, different chemical probes, different mutants) collected for the RNA. Over a large benchmark of multi-profile datasets for biological RNAs and designed RNAs from the EteRNA project, the method outperforms prior tools (QuSHAPE and FAST) significantly in terms of accuracy compared with gold-standard manual annotations. The amount of computation required is reasonable at a few seconds per dataset. We also introduce an ‘E-score’ metric to automatically assess the reliability of the band annotation and show it to be practically useful in flagging uncertainties in band annotation for further inspection.

    Availability and implementation: The implementation of the proposed algorithm is included in the HiTRACE software, freely available as an online server and for download at http://hitrace.stanford.edu.

    Contact: sryoon@snu.ac.kr or rhiju@stanford.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • INPS: predicting the impact of non-synonymous variations on protein stability from sequence
    [Aug 2015]

    Motivation: A tool for reliably predicting the impact of variations on protein stability is extremely important for both protein engineering and for understanding the effects of Mendelian and somatic mutations in the genome. Next Generation Sequencing studies are constantly increasing the number of protein sequences. Given the huge disproportion between protein sequences and structures, there is a need for tools suited to annotate the effect of mutations starting from protein sequence without relying on the structure. Here, we describe INPS, a novel approach for annotating the effect of non-synonymous mutations on the protein stability from its sequence. INPS is based on SVM regression and it is trained to predict the thermodynamic free energy change upon single-point variations in protein sequences.

    Results: We show that INPS performs similarly to the state-of-the-art methods based on protein structure when tested in cross-validation on a non-redundant dataset. INPS performs very well also on a newly generated dataset consisting of a number of variations occurring in the tumor suppressor protein p53. Our results suggest that INPS is a tool suited for computing the effect of non-synonymous polymorphisms on protein stability when the protein structure is not available. We also show that INPS predictions are complementary to those of the state-of-the-art, structure-based method mCSM. When the two methods are combined, the overall prediction on the p53 set scores significantly higher than those of the single methods.

    Availability and implementation: The presented method is available as web server at http://inps.biocomp.unibo.it.

    Contact: piero.fariselli@unibo.it

    Supplementary information: Supplementary Materials are available at Bioinformatics online.

    Categories: Journal Articles
  • Inferring data-specific micro-RNA function through the joint ranking of micro-RNA and pathways from matched micro-RNA and gene expression data
    [Aug 2015]

    Motivation: In practice, identifying and interpreting the functional impacts of the regulatory relationships between micro-RNA and messenger-RNA is non-trivial. The sheer scale of possible micro-RNA and messenger-RNA interactions can make the interpretation of results difficult.

    Results: We propose a supervised framework, pMim, built upon concepts of significance combination, for jointly ranking regulatory micro-RNA and their potential functional impacts with respect to a condition of interest. Here, pMim directly tests if a micro-RNA is differentially expressed and if its predicted targets, which lie in a common biological pathway, have changed in the opposite direction. We leverage the information within existing micro-RNA target and pathway databases to stabilize the estimation and annotation of micro-RNA regulation making our approach suitable for datasets with small sample sizes. In addition to outputting meaningful and interpretable results, we demonstrate in a variety of datasets that the micro-RNA identified by pMim, in comparison to simpler existing approaches, are also more concordant with what is described in the literature.

    Availability and implementation: This framework is implemented as an R function, pMim, in the package sydSeq available from http://www.ellispatrick.com/r-packages.

    Contact: jean.yang@sydney.edu.au

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • INSPEcT: a computational tool to infer mRNA synthesis, processing and degradation dynamics from RNA- and 4sU-seq time course experiments
    [Aug 2015]

    Motivation: Cellular mRNA levels originate from the combined action of multiple regulatory processes, which can be recapitulated by the rates of pre-mRNA synthesis, pre-mRNA processing and mRNA degradation. Recent experimental and computational advances set the basis to study these intertwined levels of regulation. Nevertheless, software for the comprehensive quantification of RNA dynamics is still lacking.

    Results: INSPEcT is an R package for the integrative analysis of RNA- and 4sU-seq data to study the dynamics of transcriptional regulation. INSPEcT provides gene-level quantification of these rates, and a modeling framework to identify which of these regulatory processes are most likely to explain the observed mRNA and pre-mRNA concentrations. Software performance is tested on a synthetic dataset, instrumental to guide the choice of the modeling parameters and the experimental design.

    Availability and implementation: INSPEcT is submitted to Bioconductor and is currently available as Supplementary Additional File S1.

    Contact: mattia.pelizzola@iit.it

    Supplementary Information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Addressing false discoveries in network inference
    [Aug 2015]

    Motivation: Experimentally determined gene regulatory networks can be enriched by computational inference from high-throughput expression profiles. However, the prediction of regulatory interactions is severely impaired by indirect and spurious effects, particularly for eukaryotes. Recently, published methods report improved predictions by exploiting the a priori known targets of a regulator (its local topology) in addition to expression profiles.

    Results: We find that methods exploiting known targets show an unexpectedly high rate of false discoveries. This leads to inflated performance estimates and the prediction of an excessive number of new interactions for regulators with many known targets. These issues are hidden from common evaluation and cross-validation setups, which is due to Simpson’s paradox. We suggest a confidence score recalibration method (CoRe) that reduces the false discovery rate and enables a reliable performance estimation.

    Conclusions: CoRe considerably improves the results of network inference methods that exploit known targets. Predictions then display the biological process specificity of regulators more correctly and enable the inference of accurate genome-wide regulatory networks in eukaryotes. For yeast, we propose a network with more than 22 000 confident interactions. We point out that machine learning approaches outside of the area of network inference may be affected as well.

    Availability and implementation: Results, executable code and networks are available via our website http://www.bio.ifi.lmu.de/forschung/CoRe.

    Contact: robert.kueffner@helmholtz-muenchen.de

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Genome-scale strain designs based on regulatory minimal cut sets
    [Aug 2015]

    Motivation: Stoichiometric and constraint-based methods of computational strain design have become an important tool for rational metabolic engineering. One of those relies on the concept of constrained minimal cut sets (cMCSs). However, as most other techniques, cMCSs may consider only reaction (or gene) knockouts to achieve a desired phenotype.

    Results: We generalize the cMCSs approach to constrained regulatory MCSs (cRegMCSs), where up/downregulation of reaction rates can be combined along with reaction deletions. We show that flux up/downregulations can virtually be treated as cuts allowing their direct integration into the algorithmic framework of cMCSs. Because of vastly enlarged search spaces in genome-scale networks, we developed strategies to (optionally) preselect suitable candidates for flux regulation and novel algorithmic techniques to further enhance efficiency and speed of cMCSs calculation. We illustrate the cRegMCSs approach by a simple example network and apply it then by identifying strain designs for ethanol production in a genome-scale metabolic model of Escherichia coli. The results clearly show that cRegMCSs combining reaction deletions and flux regulations provide a much larger number of suitable strain designs, many of which are significantly smaller relative to cMCSs involving only knockouts. Furthermore, with cRegMCSs, one may also enable the fine tuning of desired behaviours in a narrower range. The new cRegMCSs approach may thus accelerate the implementation of model-based strain designs for the bio-based production of fuels and chemicals.

    Availability and implementation: MATLAB code and the examples can be downloaded at http://www.mpi-magdeburg.mpg.de/projects/cna/etcdownloads.html.

    Contact: krishna.mahadevan@utoronto.ca or klamt@mpi-magdeburg.mpg.de

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data
    [Aug 2015]

    Motivation: Transcription factors (TFs) are a class of DNA-binding proteins that have a central role in regulating gene expression. To reveal mechanisms of transcriptional regulation, a number of computational tools have been proposed for predicting TF-DNA interaction sites. Recent studies have shown that genome-wide sequencing data on open chromatin sites from a DNase I hypersensitivity experiments (DNase-seq) has a great potential to map putative binding sites of all transcription factors in a single experiment. Thus, computational methods for analysing DNase-seq to accurately map TF-DNA interaction sites are highly needed.

    Results: Here, we introduce a novel discriminative algorithm, BinDNase, for predicting TF-DNA interaction sites using DNase-seq data. BinDNase implements an efficient method for selecting and extracting informative features from DNase I signal for each TF, either at single nucleotide resolution or for larger regions. The method is applied to 57 transcription factors in cell line K562 and 31 transcription factors in cell line HepG2 using data from the ENCODE project. First, we show that BinDNase compares favourably to other supervised and unsupervised methods developed for TF-DNA interaction prediction using DNase-seq data. We demonstrate the importance to model each TF with a separate prediction model, reflecting TF-specific DNA accessibility around the TF-DNA interaction site. We also show that a highly standardised DNase-seq data (pre)processing is a requisite for accurate TF binding predictions and that sequencing depth has on average only a moderate effect on prediction accuracy. Finally, BinDNase’s binding predictions generalise to other cell types, thus making BinDNase a versatile tool for accurate TF binding prediction.

    Availability and implementation: R implementation of the algorithm is available in: http://research.ics.aalto.fi/csb/software/bindnase/.

    Contact: juhani.kahara@aalto.fi

    Supplementary information: Supplemental data are available at Bioinformatics online.

    Categories: Journal Articles
  • The SwissLipids knowledgebase for lipid biology
    [Aug 2015]

    Motivation: Lipids are a large and diverse group of biological molecules with roles in membrane formation, energy storage and signaling. Cellular lipidomes may contain tens of thousands of structures, a staggering degree of complexity whose significance is not yet fully understood. High-throughput mass spectrometry-based platforms provide a means to study this complexity, but the interpretation of lipidomic data and its integration with prior knowledge of lipid biology suffers from a lack of appropriate tools to manage the data and extract knowledge from it.

    Results: To facilitate the description and exploration of lipidomic data and its integration with prior biological knowledge, we have developed a knowledge resource for lipids and their biology—SwissLipids. SwissLipids provides curated knowledge of lipid structures and metabolism which is used to generate an in silico library of feasible lipid structures. These are arranged in a hierarchical classification that links mass spectrometry analytical outputs to all possible lipid structures, metabolic reactions and enzymes. SwissLipids provides a reference namespace for lipidomic data publication, data exploration and hypothesis generation. The current version of SwissLipids includes over 244 000 known and theoretically possible lipid structures, over 800 proteins, and curated links to published knowledge from over 620 peer-reviewed publications. We are continually updating the SwissLipids hierarchy with new lipid categories and new expert curated knowledge.

    Availability: SwissLipids is freely available at http://www.swisslipids.org/.

    Contact: alan.bridge@isb-sib.ch

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • CiVi: circular genome visualization with unique features to analyze sequence elements
    [Aug 2015]

    Summary: We have developed CiVi, a user-friendly web-based tool to create custom circular maps to aid the analysis of microbial genomes and sequence elements. Sequence related data such as gene-name, COG class, PFAM domain, GC%, and subcellular location can be comprehensively viewed. Quantitative gene-related data (e.g. expression ratios or read counts) as well as predicted sequence elements (e.g. regulatory sequences) can be uploaded and visualized. CiVi accommodates the analysis of genomic elements by allowing a visual interpretation in the context of: (i) their genome-wide distribution, (ii) provided experimental data and (iii) the local orientation and location with respect to neighboring genes. CiVi thus enables both experts and non-experts to conveniently integrate public genome data with the results of genome analyses in circular genome maps suitable for publication.

    Contact: L.Overmars@gmail.com

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Availability and implementation: CiVi is freely available at http://civi.cmbi.ru.nl

    Categories: Journal Articles
  • IonGAP: integrative bacterial genome analysis for Ion Torrent sequence data
    [Aug 2015]

    Summary: We introduce IonGAP, a publicly available Web platform designed for the analysis of whole bacterial genomes using Ion Torrent sequence data. Besides assembly, it integrates a variety of comparative genomics, annotation and bacterial classification routines, based on the widely used FASTQ, BAM and SRA file formats. Benchmarking with different datasets evidenced that IonGAP is a fast, powerful and simple-to-use bioinformatics tool. By releasing this platform, we aim to translate low-cost bacterial genome analysis for microbiological prevention and control in healthcare, agroalimentary and pharmaceutical industry applications.

    Availability and implementation: IonGAP is hosted by the ITER’s Teide-HPC supercomputer and is freely available on the Web for non-commercial use at http://iongap.hpc.iter.es.

    Contact: mcolesan@ull.edu.es or cflores@ull.edu.es

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Interactive analysis of large cancer copy number studies with Copy Number Explorer
    [Aug 2015]

    Summary: Copy number abnormalities (CNAs) such as somatically-acquired chromosomal deletions and duplications drive the development of cancer. As individual tumor genomes can contain tens or even hundreds of large and/or focal CNAs, a major difficulty is differentiating between important, recurrent pathogenic changes and benign changes unrelated to the subject’s phenotype. Here we present Copy Number Explorer, an interactive tool for mining large copy number datasets. Copy Number Explorer facilitates rapid visual and statistical identification of recurrent regions of gain or loss, identifies the genes most likely to drive CNA formation using the cghMCR method and identifies recurrently broken genes that may be disrupted or fused. The software also allows users to identify recurrent CNA regions that may be associated with differential survival.

    Availability and Implementation: Copy Number Explorer is available under the GNU public license (GPL-3). Source code is available at: https://sourceforge.net/projects/copynumberexplorer/

    Contact: scott.newman@emory.edu

    Categories: Journal Articles
  • kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome
    [Aug 2015]

    Summary:We announce the release of kSNP3.0, a program for SNP identification and phylogenetic analysis without genome alignment or the requirement for reference genomes. kSNP3.0 is a significantly improved version of kSNP v2.

    Availability and implementation: kSNP3.0 is implemented as a package of stand-alone executables for Linux and Mac OS X under the open-source BSD license. The executable packages, source code and a full User Guide are freely available at https://sourceforge.net/projects/ksnp/files/

    Contact: barryghall@gmail.com

    Categories: Journal Articles
  • Identification of C2H2-ZF binding preferences from ChIP-seq data using RCADE
    [Aug 2015]

    Summary: Current methods for motif discovery from chromatin immunoprecipitation followed by sequencing (ChIP-seq) data often identify non-targeted transcription factor (TF) motifs, and are even further limited when peak sequences are similar due to common ancestry rather than common binding factors. The latter aspect particularly affects a large number of proteins from the Cys2His2 zinc finger (C2H2-ZF) class of TFs, as their binding sites are often dominated by endogenous retroelements that have highly similar sequences. Here, we present recognition code-assisted discovery of regulatory elements (RCADE) for motif discovery from C2H2-ZF ChIP-seq data. RCADE combines predictions from a DNA recognition code of C2H2-ZFs with ChIP-seq data to identify models that represent the genuine DNA binding preferences of C2H2-ZF proteins. We show that RCADE is able to identify generalizable binding models even from peaks that are exclusively located within the repeat regions of the genome, where state-of-the-art motif finding approaches largely fail.

    Availability and implementation: RCADE is available as a webserver and also for download at http://rcade.ccbr.utoronto.ca/.

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Contact: t.hughes@utoronto.ca

    Categories: Journal Articles
  • Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data
    [Aug 2015]

    Motivation: The characterization of phylogenetic and functional diversity is a key element in the analysis of microbial communities. Amplicon-based sequencing of marker genes, such as 16S rRNA, is a powerful tool for assessing and comparing the structure of microbial communities at a high phylogenetic resolution. Because 16S rRNA sequencing is more cost-effective than whole metagenome shotgun sequencing, marker gene analysis is frequently used for broad studies that involve a large number of different samples. However, in comparison to shotgun sequencing approaches, insights into the functional capabilities of the community get lost when restricting the analysis to taxonomic assignment of 16S rRNA data.

    Results: Tax4Fun is a software package that predicts the functional capabilities of microbial communities based on 16S rRNA datasets. We evaluated Tax4Fun on a range of paired metagenome/16S rRNA datasets to assess its performance. Our results indicate that Tax4Fun provides a good approximation to functional profiles obtained from metagenomic shotgun sequencing approaches.

    Availability and implementation: Tax4Fun is an open-source R package and applicable to output as obtained from the SILVAngs web server or the application of QIIME with a SILVA database extension. Tax4Fun is freely available for download at http://tax4fun.gobics.de/.

    Contact: kasshau@gwdg.de

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles