Nucleic Acids Research
The IntAct molecular interaction database has created a new, free, open-source, manually curated resource, the Complex Portal (www.ebi.ac.uk/intact/complex), through which protein complexes from major model organisms are being collated and made available for search, viewing and download. It has been built in close collaboration with other bioinformatics services and populated with data from ChEMBL, MatrixDB, PDBe, Reactome and UniProtKB. Each entry contains information about the participating molecules (including small molecules and nucleic acids), their stoichiometry, topology and structural assembly. Complexes are annotated with details about their function, properties and complex-specific Gene Ontology (GO) terms. Consistent nomenclature is used throughout the resource with systematic names, recommended names and a list of synonyms all provided. The use of the Evidence Code Ontology allows us to indicate for which entries direct experimental evidence is available or if the complex has been inferred based on homology or orthology. The data are searchable using standard identifiers, such as UniProt, ChEBI and GO IDs, protein, gene and complex names or synonyms. This reference resource will be maintained and grow to encompass an increasing number of organisms. Input from groups and individuals with specific areas of expertise is welcome.
ComPPI: a cellular compartment-specific database for protein-protein interaction network analysis
Here we present ComPPI, a cellular compartment-specific database of proteins and their interactions enabling an extensive, compartmentalized protein–protein interaction network analysis (URL: http://ComPPI.LinkGroup.hu). ComPPI enables the user to filter biologically unlikely interactions, where the two interacting proteins have no common subcellular localizations and to predict novel properties, such as compartment-specific biological functions. ComPPI is an integrated database covering four species (S. cerevisiae, C. elegans, D. melanogaster and H. sapiens). The compilation of nine protein–protein interaction and eight subcellular localization data sets had four curation steps including a manually built, comprehensive hierarchical structure of >1600 subcellular localizations. ComPPI provides confidence scores for protein subcellular localizations and protein–protein interactions. ComPPI has user-friendly search options for individual proteins giving their subcellular localization, their interactions and the likelihood of their interactions considering the subcellular localization of their interacting partners. Download options of search results, whole-proteomes, organelle-specific interactomes and subcellular localization data are available on its website. Due to its novel features, ComPPI is useful for the analysis of experimental results in biochemistry and molecular biology, as well as for proteome-wide studies in bioinformatics and network science helping cellular biology, medicine and drug design.
PTMcode v2: a resource for functional associations of post-translational modifications within and between proteins
The post-translational regulation of proteins is mainly driven by two molecular events, their modification by several types of moieties and their interaction with other proteins. These two processes are interdependent and together are responsible for the function of the protein in a particular cell state. Several databases focus on the prediction and compilation of protein–protein interactions (PPIs) and no less on the collection and analysis of protein post-translational modifications (PTMs), however, there are no resources that concentrate on describing the regulatory role of PTMs in PPIs. We developed several methods based on residue co-evolution and proximity to predict the functional associations of pairs of PTMs that we apply to modifications in the same protein and between two interacting proteins. In order to make data available for understudied organisms, PTMcode v2 (http://ptmcode.embl.de) includes a new strategy to propagate PTMs from validated modified sites through orthologous proteins. The second release of PTMcode covers 19 eukaryotic species from which we collected more than 300 000 experimentally verified PTMs (>1 300 000 propagated) of 69 types extracting the post-translational regulation of >100 000 proteins and >100 000 interactions. In total, we report 8 million associations of PTMs regulating single proteins and over 9.4 million interplays tuning PPIs.
dbSNO 2.0: a resource for exploring structural environment, functional and disease association and regulatory network of protein S-nitrosylation
Given the increasing number of proteins reported to be regulated by S-nitrosylation (SNO), it is considered to act, in a manner analogous to phosphorylation, as a pleiotropic regulator that elicits dual effects to regulate diverse pathophysiological processes by altering protein function, stability, and conformation change in various cancers and human disorders. Due to its importance in regulating protein functions and cell signaling, dbSNO (http://dbSNO.mbc.nctu.edu.tw) is extended as a resource for exploring structural environment of SNO substrate sites and regulatory networks of S-nitrosylated proteins. An increasing interest in the structural environment of PTM substrate sites motivated us to map all manually curated SNO peptides (4165 SNO sites within 2277 proteins) to PDB protein entries by sequence identity, which provides the information of spatial amino acid composition, solvent-accessible surface area, spatially neighboring amino acids, and side chain orientation for 298 substrate cysteine residues. Additionally, the annotations of protein molecular functions, biological processes, functional domains and human diseases are integrated to explore the functional and disease associations for S-nitrosoproteome. In this update, users are allowed to search a group of interested proteins/genes and the system reconstructs the SNO regulatory network based on the information of metabolic pathways and protein-protein interactions. Most importantly, an endogenous yet pathophysiological S-nitrosoproteomic dataset from colorectal cancer patients was adopted to demonstrate that dbSNO could discover potential SNO proteins involving in the regulation of NO signaling for cancer pathways.
PhosphoSitePlus® (PSP, http://www.phosphosite.org/), a knowledgebase dedicated to mammalian post-translational modifications (PTMs), contains over 330 000 non-redundant PTMs, including phospho, acetyl, ubiquityl and methyl groups. Over 95% of the sites are from mass spectrometry (MS) experiments. In order to improve data reliability, early MS data have been reanalyzed, applying a common standard of analysis across over 1 000 000 spectra. Site assignments with P > 0.05 were filtered out. Two new downloads are available from PSP. The ‘Regulatory sites’ dataset includes curated information about modification sites that regulate downstream cellular processes, molecular functions and protein-protein interactions. The ‘PTMVar’ dataset, an intersect of missense mutations and PTMs from PSP, identifies over 25 000 PTMVars (PTMs Impacted by Variants) that can rewire signaling pathways. The PTMVar data include missense mutations from UniPROTKB, TCGA and other sources that cause over 2000 diseases or syndromes (MIM) and polymorphisms, or are associated with hundreds of cancers. PTMVars include 18 548 phosphorlyation sites, 3412 ubiquitylation sites, 2316 acetylation sites, 685 methylation sites and 245 succinylation sites.
ProteomeScout: a repository and analysis resource for post-translational modifications and proteins
ProteomeScout (https://proteomescout.wustl.edu) is a resource for the study of proteins and their post-translational modifications (PTMs) consisting of a database of PTMs, a repository for experimental data, an analysis suite for PTM experiments, and a tool for visualizing the relationships between complex protein annotations. The PTM database is a compendium of public PTM data, coupled with user-uploaded experimental data. ProteomeScout provides analysis tools for experimental datasets, including summary views and subset selection, which can identify relationships within subsets of data by testing for statistically significant enrichment of protein annotations. Protein annotations are incorporated in the ProteomeScout database from external resources and include terms such as Gene Ontology annotations, domains, secondary structure and non-synonymous polymorphisms. These annotations are available in the database download, in the analysis tools and in the protein viewer. The protein viewer allows for the simultaneous visualization of annotations in an interactive web graphic, which can be exported in Scalable Vector Graphics (SVG) format. Finally, quantitative data measurements associated with public experiments are also easily viewable within protein records, allowing researchers to see how PTMs change across different contexts. ProteomeScout should prove useful for protein researchers and should benefit the proteomics community by providing a stable repository for PTM experiments.
Phosphatases are crucial enzymes in health and disease, but the knowledge of their biological roles is still limited. Identifying substrates continues to be a great challenge. To support the research on phosphatase–kinase–substrate networks we present here an update on the human DEPhOsphorylation Database: DEPOD (http://www.depod.org or http://www.koehn.embl.de/depod). DEPOD is a manually curated open access database providing human phosphatases, their protein and non-protein substrates, dephosphorylation sites, pathway involvements and external links to kinases and small molecule modulators. All internal data are fully searchable including a BLAST application. Since the first release, more human phosphatases and substrates, their associated signaling pathways (also from new sources), and interacting proteins for all phosphatases and protein substrates have been added into DEPOD. The user interface has been further optimized; for example, the interactive human phosphatase–substrate network contains now a ‘highlight node’ function for phosphatases, which includes the visualization of neighbors in the network.
The P2CS database (http://www.p2cs.org/) is a comprehensive resource for the analysis of Prokaryotic Two-Component Systems (TCSs). TCSs are comprised of a receptor histidine kinase (HK) and a partner response regulator (RR) and control important prokaryotic behaviors. The latest incarnation of P2CS includes 164 651 TCS proteins, from 2758 sequenced prokaryotic genomes.
Several important new features have been added to P2CS since it was last described. Users can search P2CS via BLAST, adding hits to their cart, and homologous proteins can be aligned using MUSCLE and viewed using Jalview within P2CS. P2CS also provides phylogenetic trees based on the conserved signaling domains of the RRs and HKs from entire genomes. HK and RR trees are annotated with gene organization and domain architecture, providing insights into the evolutionary origin of the contemporary gene set.
The majority of TCSs are encoded by adjacent HK and RR genes, however, ‘orphan’ unpaired TCS genes are also abundant and identifying their partner proteins is challenging. P2CS now provides paired HK and RR trees with proteins from the same genetic locus indicated. This allows the appraisal of evolutionary relationships across entire TCSs and in some cases the identification of candidate partners for orphan TCS proteins.
BioModels: ten-year anniversary
BioModels (http://www.ebi.ac.uk/biomodels/) is a repository of mathematical models of biological processes. A large set of models is curated to verify both correspondence to the biological process that the model seeks to represent, and reproducibility of the simulation results as described in the corresponding peer-reviewed publication. Many models submitted to the database are annotated, cross-referencing its components to external resources such as database records, and terms from controlled vocabularies and ontologies. BioModels comprises two main branches: one is composed of models derived from literature, while the second is generated through automated processes. BioModels currently hosts over 1200 models derived directly from the literature, as well as in excess of 140 000 models automatically generated from pathway resources. This represents an approximate 60-fold growth for literature-based model numbers alone, since BioModels’ first release a decade ago. This article describes updates to the resource over this period, which include changes to the user interface, the annotation profiles of models in the curation pipeline, major infrastructure changes, ability to perform online simulations and the availability of model content in Linked Data form. We also outline planned improvements to cope with a diverse array of new challenges.
CeCaFDB: a curated database for the documentation, visualization and comparative analysis of central carbon metabolic flux distributions explored by 13C-fluxomics
The Central Carbon Metabolic Flux Database (CeCaFDB, available at http://www.cecafdb.org) is a manually curated, multipurpose and open-access database for the documentation, visualization and comparative analysis of the quantitative flux results of central carbon metabolism among microbes and animal cells. It encompasses records for more than 500 flux distributions among 36 organisms and includes information regarding the genotype, culture medium, growth conditions and other specific information gathered from hundreds of journal articles. In addition to its comprehensive literature-derived data, the CeCaFDB supports a common text search function among the data and interactive visualization of the curated flux distributions with compartmentation information based on the Cytoscape Web API, which facilitates data interpretation. The CeCaFDB offers four modules to calculate a similarity score or to perform an alignment between the flux distributions. One of the modules was built using an inter programming algorithm for flux distribution alignment that was specifically designed for this study. Based on these modules, the CeCaFDB also supports an extensive flux distribution comparison function among the curated data. The CeCaFDB is strenuously designed to address the broad demands of biochemists, metabolic engineers, systems biologists and members of the -omics community.
CFam: a chemical families database based on iterative selection of functional seeds and seed-directed compound clustering
Similarity-based clustering and classification of compounds enable the search of drug leads and the structural and chemogenomic studies for facilitating chemical, biomedical, agricultural, material and other industrial applications. A database that organizes compounds into similarity-based as well as scaffold-based and property-based families is useful for facilitating these tasks. CFam Chemical Family database http://bidd2.cse.nus.edu.sg/cfam was developed to hierarchically cluster drugs, bioactive molecules, human metabolites, natural products, patented agents and other molecules into functional families, superfamilies and classes of structurally similar compounds based on the literature-reported high, intermediate and remote similarity measures. The compounds were represented by molecular fingerprint and molecular similarity was measured by Tanimoto coefficient. The functional seeds of CFam families were from hierarchically clustered drugs, bioactive molecules, human metabolites, natural products, patented agents, respectively, which were used to characterize families and cluster compounds into families, superfamilies and classes. CFam currently contains 11 643 classes, 34 880 superfamilies and 87 136 families of 490 279 compounds (1691 approved drugs, 1228 clinical trial drugs, 12 386 investigative drugs, 262 881 highly active molecules, 15 055 human metabolites, 80 255 ZINC-processed natural products and 116 783 patented agents). Efforts will be made to further expand CFam database and add more functional categories and families based on other types of molecular representations.
The ‘Human Immunodeficiency Virus Type 1 (HIV-1), Human Interaction Database’, available through the National Library of Medicine at http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses/hiv-1/interactions, serves the scientific community exploring the discovery of novel HIV vaccine candidates and therapeutic targets. Each HIV-1 human protein interaction can be retrieved without restriction by web-based downloads and ftp protocols and includes: Reference Sequence (RefSeq) protein accession numbers, National Center for Biotechnology Information Gene identification numbers, brief descriptions of the interactions, searchable keywords for interactions and PubMed identification numbers (PMIDs) of journal articles describing the interactions. In addition to specific HIV-1 protein–human protein interactions, included are interaction effects upon HIV-1 replication resulting when individual human gene expression is blocked using siRNA. A total of 3142 human genes are described participating in 12 786 protein–protein interactions, along with 1316 replication interactions described for each of 1250 human genes identified using small interfering RNA (siRNA). Together the data identifies 4006 human genes involved in 14 102 interactions. With the inclusion of siRNA interactions we introduce a redesigned web interface to enhance viewing, filtering and downloading of the combined data set.
NCBI Viral Genomes Resource
Recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. Yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. The NCBI Viral Genomes Resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. The resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. As the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. The rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets.
Increasing evidence reveals that diverse non-coding RNAs (ncRNAs) play critically important roles in viral infection. Viruses can use diverse ncRNAs to manipulate both cellular and viral gene expression to establish a host environment conducive to the completion of the viral life cycle. Many host cellular ncRNAs can also directly or indirectly influence viral replication and even target virus genomes. ViRBase (http://www.rna-society.org/virbase) aims to provide the scientific community with a resource for efficient browsing and visualization of virus-host ncRNA-associated interactions and interaction networks in viral infection. The current version of ViRBase documents more than 12 000 viral and cellular ncRNA-associated virus–virus, virus–host, host–virus and host–host interactions involving more than 460 non-redundant ncRNAs and 4400 protein-coding genes from between more than 60 viruses and 20 hosts. Users can query, browse and manipulate these virus–host ncRNA-associated interactions. ViRBase will be of help in uncovering the generic organizing principles of cellular virus–host ncRNA-associated interaction networks in viral infection.
VirHostNet release 2.0 (http://virhostnet.prabi.fr) is a knowledgebase dedicated to the network-based exploration of virus–host protein–protein interactions. Since the previous VirhostNet release (2009), a second run of manual curation was performed to annotate the new torrent of high-throughput protein–protein interactions data from the literature. This resource is shared publicly, in PSI-MI TAB 2.5 format, using a PSICQUIC web service. The new interface of VirHostNet 2.0 is based on Cytoscape web library and provides a user-friendly access to the most complete and accurate resource of virus–virus and virus–host protein–protein interactions as well as their projection onto their corresponding host cell protein interaction networks. We hope that the VirHostNet 2.0 system will facilitate systems biology and gene-centered analysis of infectious diseases and will help to identify new molecular targets for antiviral drugs design. This resource will also continue to help worldwide scientists to improve our knowledge on molecular mechanisms involved in the antiviral response mediated by the cell and in the viral strategies selected by viruses to hijack the host immune system.
Viral infections often cause diseases by perturbing several cellular processes in the infected host. Viral proteins target host proteins and either form new complexes or modulate the formation of functional host complexes. Describing and understanding the perturbation of the host interactome following viral infection is essential for basic virology and for the development of antiviral therapies. In order to provide a general overview of such interactions, a few years ago we developed VirusMINT. We have now extended the scope and coverage of VirusMINT and established VirusMentha, a new virus–virus and virus–host interaction resource build on the detailed curation protocols of the IMEx consortium and on the integration strategies developed for mentha. VirusMentha is regularly and automatically updated every week by capturing, via the PSICQUIC protocol, interactions curated by five different databases that are part of the IMEx consortium. VirusMentha can be freely browsed at http://virusmentha.uniroma2.it/ and its complete data set is available for download.
rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development
Microbiologists utilize ribosomal RNA genes as molecular markers of taxonomy in surveys of microbial communities. rRNA genes are often co-located as part of an rrn operon, and multiple copies of this operon are present in genomes across the microbial tree of life. rrn copy number variability provides valuable insight into microbial life history, but introduces systematic bias when measuring community composition in molecular surveys. Here we present an update to the ribosomal RNA operon copy number database (rrnDB), a publicly available, curated resource for copy number information for bacteria and archaea. The redesigned rrnDB (http://rrndb.umms.med.umich.edu/) brings a substantial increase in the number of genomes described, improved curation, mapping of genomes to both NCBI and RDP taxonomies, and refined tools for querying and analyzing these data. With these changes, the rrnDB is better positioned to remain a comprehensive resource under the torrent of microbial genome sequencing. The enhanced rrnDB will contribute to the analysis of molecular surveys and to research linking genomic characteristics to life history.
Update on RefSeq microbial genomes resources
NCBI RefSeq genome collection http://www.ncbi.nlm.nih.gov/genome represents all three major domains of life: Eukarya, Bacteria and Archaea as well as Viruses. Prokaryotic genome sequences are the most rapidly growing part of the collection. During the year of 2014 more than 10 000 microbial genome assemblies have been publicly released bringing the total number of prokaryotic genomes close to 30 000. We continue to improve the quality and usability of the microbial genome resources by providing easy access to the data and the results of the pre-computed analysis, and improving analysis and visualization tools. A number of improvements have been incorporated into the Prokaryotic Genome Annotation Pipeline. Several new features have been added to RefSeq prokaryotic genomes data processing pipeline including the calculation of genome groups (clades) and the optimization of protein clusters generation using pan-genome approach.
Comprehensive experimental resources, such as ORFeome clone libraries and deletion mutant collections, are fundamental tools for elucidation of gene function. Data sets by omics analysis using these resources provide key information for functional analysis, modeling and simulation both in individual and systematic approaches. With the long-term goal of complete understanding of a cell, we have over the past decade created a variety of clone and mutant sets for functional genomics studies of Escherichia coli K-12. We have made these experimental resources freely available to the academic community worldwide. Accordingly, these resources have now been used in numerous investigations of a multitude of cell processes. Quality control is extremely important for evaluating results generated by these resources. Because the annotation has been changed since 2005, which we originally used for the construction, we have updated these genomic resources accordingly. Here, we describe GenoBase (http://ecoli.naist.jp/GB/), which contains key information about comprehensive experimental resources of E. coli K-12, their quality control and several omics data sets generated using these resources.
MyMpn (http://mympn.crg.eu) is an online resource devoted to studying the human pathogen Mycoplasma pneumoniae, a minimal bacterium causing lower respiratory tract infections. Due to its small size, its ability to grow in vitro, and the amount of data produced over the past decades, M. pneumoniae is an interesting model organisms for the development of systems biology approaches for unicellular organisms. Our database hosts a wealth of omics-scale datasets generated by hundreds of experimental and computational analyses. These include data obtained from gene expression profiling experiments, gene essentiality studies, protein abundance profiling, protein complex analysis, metabolic reactions and network modeling, cell growth experiments, comparative genomics and 3D tomography. In addition, the intuitive web interface provides access to several visualization and analysis tools as well as to different data search options. The availability and—even more relevant—the accessibility of properly structured and organized data are of up-most importance when aiming to understand the biology of an organism on a global scale. Therefore, MyMpn constitutes a unique and valuable new resource for the large systems biology and microbiology community.