Nucleic Acids Research
The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is an international functional genomics database at the European Bioinformatics Institute (EMBL-EBI) recommended by most journals as a repository for data supporting peer-reviewed publications. It contains data from over 7000 public sequencing and 42 000 array-based studies comprising over 1.5 million assays in total. The proportion of sequencing-based submissions has grown significantly over the last few years and has doubled in the last 18 months, whilst the rate of microarray submissions is growing slightly. All data in ArrayExpress are available in the MAGE-TAB format, which allows robust linking to data analysis and visualization tools and standardized analysis. The main development over the last two years has been the release of a new data submission tool Annotare, which has reduced the average submission time almost 3-fold. In the near future, Annotare will become the only submission route into ArrayExpress, alongside MAGE-TAB format-based pipelines. ArrayExpress is a stable and highly accessed resource. Our future tasks include automation of data flows and further integration with other EMBL-EBI resources for the representation of multi-omics data.
CODEX: a next-generation sequencing experiment database for the haematopoietic and embryonic stem cell communities
CODEX (http://codex.stemcells.cam.ac.uk/) is a user-friendly database for the direct access and interrogation of publicly available next-generation sequencing (NGS) data, specifically aimed at experimental biologists. In an era of multi-centre genomic dataset generation, CODEX provides a single database where these samples are collected, uniformly processed and vetted. The main drive of CODEX is to provide the wider scientific community with instant access to high-quality NGS data, which, irrespective of the publishing laboratory, is directly comparable. CODEX allows users to immediately visualize or download processed datasets, or compare user-generated data against the database's cumulative knowledge-base. CODEX contains four types of NGS experiments: transcription factor chromatin immunoprecipitation coupled to high-throughput sequencing (ChIP-Seq), histone modification ChIP-Seq, DNase-Seq and RNA-Seq. These are largely encompassed within two specialized repositories, HAEMCODE and ESCODE, which are focused on haematopoiesis and embryonic stem cell samples, respectively. To date, CODEX contains over 1000 samples, including 221 unique TFs and 93 unique cell types. CODEX therefore provides one of the most complete resources of publicly available NGS data for the direct interrogation of transcriptional programmes that regulate cellular identity and fate in the context of mammalian development, homeostasis and disease.
Co-expression networks have proven effective at assigning putative functions to genes based on the functional annotation of their co-expressed partners, in candidate gene prioritization studies and in improving our understanding of regulatory networks. The growing number of genome resequencing efforts and genome-wide association studies often identify loci containing novel genes and there is a need to infer their functions and interaction partners. To facilitate this we have expanded GeneFriends, an online database that allows users to identify co-expressed genes with one or more user-defined genes. This expansion entails an RNA-seq-based co-expression map that includes genes and transcripts that are not present in the microarray-based co-expression maps, including over 10 000 non-coding RNAs. The results users obtain from GeneFriends include a co-expression network as well as a summary of the functional enrichment among the co-expressed genes. Novel insights can be gathered from this database for different splice variants and ncRNAs, such as microRNAs and lincRNAs. Furthermore, our updated tool allows candidate transcripts to be linked to diseases and processes using a guilt-by-association approach. GeneFriends is freely available from http://www.GeneFriends.org and can be used to quickly identify and rank candidate targets relevant to the process or disease under study.
Current gene co-expression databases and correlation networks do not support cell-specific analysis. Gene co-expression and expression correlation are subtly different phenomena, although both are likely to be functionally significant. Here, we report a new database, ImmuCo (http://immuco.bjmu.edu.cn), which is a cell-specific database that contains information about gene co-expression in immune cells, identifying co-expression and correlation between any two genes. The strength of co-expression of queried genes is indicated by signal values and detection calls, whereas expression correlation and strength are reflected by Pearson correlation coefficients. A scatter plot of the signal values is provided to directly illustrate the extent of co-expression and correlation. In addition, the database allows the analysis of cell-specific gene expression profile across multiple experimental conditions and can generate a list of genes that are highly correlated with the queried genes. Currently, the database covers 18 human cell groups and 10 mouse cell groups, including 20 283 human genes and 20 963 mouse genes. More than 8.6 x 108 and 7.4 x 108 probe set combinations are provided for querying each human and mouse cell group, respectively. Sample applications support the distinctive advantages of the database.
The eukaryotic cell division cycle is a highly regulated process that consists of a complex series of events and involves thousands of proteins. Researchers have studied the regulation of the cell cycle in several organisms, employing a wide range of high-throughput technologies, such as microarray-based mRNA expression profiling and quantitative proteomics. Due to its complexity, the cell cycle can also fail or otherwise change in many different ways if important genes are knocked out, which has been studied in several microscopy-based knockdown screens. The data from these many large-scale efforts are not easily accessed, analyzed and combined due to their inherent heterogeneity. To address this, we have created Cyclebase—available at http://www.cyclebase.org—an online database that allows users to easily visualize and download results from genome-wide cell-cycle-related experiments. In Cyclebase version 3.0, we have updated the content of the database to reflect changes to genome annotation, added new mRNA and protein expression data, and integrated cell-cycle phenotype information from high-content screens and model-organism databases. The new version of Cyclebase also features a new web interface, designed around an overview figure that summarizes all the cell-cycle-related data for a gene.
MOPED (Multi-Omics Profiling Expression Database; http://moped.proteinspire.org) has transitioned from solely a protein expression database to a multi-omics resource for human and model organisms. Through a web-based interface, MOPED presents consistently processed data for gene, protein and pathway expression. To improve data quality, consistency and use, MOPED includes metadata detailing experimental design and analysis methods. The multi-omics data are integrated through direct links between genes and proteins and further connected to pathways and experiments. MOPED now contains over 5 million records, information for approximately 75 000 genes and 50 000 proteins from four organisms (human, mouse, worm, yeast). These records correspond to 670 unique combinations of experiment, condition, localization and tissue. MOPED includes the following new features: pathway expression, Pathway Details pages, experimental metadata checklists, experiment summary statistics and more advanced searching tools. Advanced searching enables querying for genes, proteins, experiments, pathways and keywords of interest. The system is enhanced with visualizations for comparing across different data types. In the future MOPED will expand the number of organisms, increase integration with pathways and provide connections to disease.
The Addgene Repository (http://www.addgene.org) was founded to accelerate research and discovery by improving access to useful, high-quality research materials and information. The repository archives plasmids generated by scientists, conducts quality control, annotates the associated data and makes the plasmids and their data available to the scientific community. Plasmid associated data undergoes ongoing curation by members of the scientific community and by Addgene scientists. The growing database contains information on >31 000 unique plasmids spanning most experimental biological systems and organisms. The library includes a large number of plasmid tools for use in a wide variety of research areas, such as empty backbones, lentiviral resources, fluorescent protein vectors and genome engineering tools. The Addgene Repository database is always evolving with new plasmid deposits so it contains currently pertinent resources while ensuring the information on earlier deposits is still available. Custom search and browse features are available to access information on the diverse collection. Extensive educational materials and information are provided by the database curators to support the scientists that are accessing the repository's materials and data.
The Biobanking Analysis Resource Catalogue (BARCdb): a new research tool for the analysis of biobank samples
We report the development of a new database of technology services and products for analysis of biobank samples in biomedical research. BARCdb, the Biobanking Analysis Resource Catalogue (http://www.barcdb.org), is a freely available web resource, listing expertise and molecular resource capabilities of research centres and biotechnology companies. The database is designed for researchers who require information on how to make best use of valuable biospecimens from biobanks and other sample collections, focusing on the choice of analytical techniques and the demands they make on the type of samples, pre-analytical sample preparation and amounts needed. BARCdb has been developed as part of the Swedish biobanking infrastructure (BBMRI.se), but now welcomes submissions from service providers throughout Europe. BARCdb can help match resource providers with potential users, stimulating transnational collaborations and ensuring compatibility of results from different labs. It can promote a more optimal use of European resources in general, both with respect to standard and more experimental technologies, as well as for valuable biobank samples. This article describes how information on service and reagent providers of relevant technologies is made available on BARCdb, and how this resource may contribute to strengthening biomedical research in academia and in the biotechnology and pharmaceutical industries.
BioAssay Research Database (BARD): chemical biology and probe-development enabled by structured metadata and result types
BARD, the BioAssay Research Database (https://bard.nih.gov/) is a public database and suite of tools developed to provide access to bioassay data produced by the NIH Molecular Libraries Program (MLP). Data from 631 MLP projects were migrated to a new structured vocabulary designed to capture bioassay data in a formalized manner, with particular emphasis placed on the description of assay protocols. New data can be submitted to BARD with a user-friendly set of tools that assist in the creation of appropriately formatted datasets and assay definitions. Data published through the BARD application program interface (API) can be accessed by researchers using web-based query tools or a desktop client. Third-party developers wishing to create new tools can use the API to produce stand-alone tools or new plug-ins that can be integrated into BARD. The entire BARD suite of tools therefore supports three classes of researcher: those who wish to publish data, those who wish to mine data for testable hypotheses, and those in the developer community who wish to build tools that leverage this carefully curated chemical biology resource.
INFRAFRONTIER--providing mutant mouse resources as research tools for the international scientific community
The laboratory mouse is a key model organism to investigate mechanism and therapeutics of human disease. The number of targeted genetic mouse models of disease is growing rapidly due to high-throughput production strategies employed by the International Mouse Phenotyping Consortium (IMPC) and the development of new, more efficient genome engineering techniques such as CRISPR based systems. We have previously described the European Mouse Mutant Archive (EMMA) resource and how this international infrastructure provides archiving and distribution worldwide for mutant mouse strains. EMMA has since evolved into INFRAFRONTIER (http://www.infrafrontier.eu), the pan-European research infrastructure for the systemic phenotyping, archiving and distribution of mouse disease models. Here we describe new features including improved search for mouse strains, support for new embryonic stem cell resources, access to training materials via a comprehensive knowledgebase and the promotion of innovative analytical and diagnostic techniques.
PlasmoGEM, a database supporting a community resource for large-scale experimental genetics in malaria parasites
The Plasmodium Genetic Modification (PlasmoGEM) database (http://plasmogem.sanger.ac.uk) provides access to a resource of modular, versatile and adaptable vectors for genome modification of Plasmodium spp. parasites. PlasmoGEM currently consists of >2000 plasmids designed to modify the genome of Plasmodium berghei, a malaria parasite of rodents, which can be requested by non-profit research organisations free of charge. PlasmoGEM vectors are designed with long homology arms for efficient genome integration and carry gene specific barcodes to identify individual mutants. They can be used for a wide array of applications, including protein localisation, gene interaction studies and high-throughput genetic screens. The vector production pipeline is supported by a custom software suite that automates both the vector design process and quality control by full-length sequencing of the finished vectors. The PlasmoGEM web interface allows users to search a database of finished knock-out and gene tagging vectors, view details of their designs, download vector sequence in different formats and view available quality control data as well as suggested genotyping strategies. We also make gDNA library clones and intermediate vectors available for researchers to produce vectors for themselves.
SEVA 2.0: an update of the Standard European Vector Architecture for de-/re-construction of bacterial functionalities
The Standard European Vector Architecture 2.0 database (SEVA-DB 2.0, http://seva.cnb.csic.es) is an improved and expanded version of the platform released in 2013 (doi: 10.1093/nar/gks1119) aimed at assisting the choice of optimal genetic tools for de-constructing and re-constructing complex prokaryotic phenotypes. By adopting simple compositional rules, the SEVA standard facilitates combinations of functional DNA segments that ease both the analysis and the engineering of diverse Gram-negative bacteria for fundamental or biotechnological purposes. The large number of users of the SEVA-DB during its first two years of existence has resulted in a valuable feedback that we have exploited for fixing DNA sequence errors, improving the nomenclature of the SEVA plasmids, expanding the vector collection, adding new features to the web interface and encouraging contributions of materials from the community of users. The SEVA platform is also adopting the Synthetic Biology Open Language (SBOL) for electronic-like description of the constructs available in the collection and their interfacing with genetic devices developed by other Synthetic Biology communities. We advocate the SEVA format as one interim asset for the ongoing transition of genetic design of microorganisms from being a trial-and-error endeavor to become an authentic engineering discipline.
The Immuno Polymorphism Database (IPD) was developed to provide a centralized system for the study of polymorphism in genes of the immune system. Through the IPD project we have established a central platform for the curation and publication of locus-specific databases involved either directly or related to the function of the Major Histocompatibility Complex in a number of different species. We have collaborated with specialist groups or nomenclature committees that curate the individual sections before they are submitted to IPD for online publication. IPD consists of five core databases, with the IMGT/HLA Database as the primary database. Through the work of the various nomenclature committees, the HLA Informatics Group and in collaboration with the European Bioinformatics Institute we are able to provide public access to this data through the website http://www.ebi.ac.uk/ipd/. The IPD project continues to develop with new tools being added to address scientific developments, such as Next Generation Sequencing, and to address user feedback and requests. Regular updates to the website ensure that new and confirmatory sequences are dispersed to the immunogenetics community, and the wider research and clinical communities.
Classification of the structures of the complementarity determining regions (CDRs) of antibodies is critically important for antibody structure prediction and computational design. We have previously performed a clustering of antibody CDR conformations and defined a systematic nomenclature consisting of the CDR, length and an integer starting from the largest to the smallest cluster in the data set (e.g. L1-11-1). We present PyIgClassify (for Python-based immunoglobulin classification; available at http://dunbrack2.fccc.edu/pyigclassify/), a database and web server that provides access to assignments of all CDR structures in the PDB to our classification system. The database includes assignments to the IMGT germline V regions for heavy and light chains for several species. For humanized antibodies, the assignment of the frameworks is to human germlines and the CDRs to the germlines of mice or other species sources. The database can be searched by PDB entry, cluster identifier and IMGT germline group (e.g. human IGHV1). The entire database is downloadable so that users may filter the data as needed for antibody structure analysis, prediction and design.
The BRENDA enzyme information system (http://www.brenda-enzymes.org/) has developed into an elaborate system of enzyme and enzyme-ligand information obtained from different sources, combined with flexible query systems and evaluation tools. The information is obtained by manual extraction from primary literature, text and data mining, data integration, and prediction algorithms. Approximately 300 million data include enzyme function and molecular data from more than 30 000 organisms. The manually derived core contains 3 million data from 77 000 enzymes annotated from 135 000 literature references. Each entry is connected to the literature reference and the source organism. They are complemented by information on occurrence, enzyme/disease relationships from text mining, sequences and 3D structures from other databases, and predicted enzyme location and genome annotation. Functional and structural data of more than 190 000 enzyme ligands are stored in BRENDA. New features improving the functionality and analysis tools were implemented. The human anatomy atlas CAVEman is linked to the BRENDA Tissue Ontology terms providing a connection between anatomical and functional enzyme data. Word Maps for enzymes obtained from PubMed abstracts highlight application and scientific relevance of enzymes. The EnzymeDetector genome annotation tool and the reaction database BKM-react including reactions from BRENDA, KEGG and MetaCyc were improved. The website was redesigned providing new query options.
The many functional partnerships and interactions that occur between proteins are at the core of cellular processing and their systematic characterization helps to provide context in molecular systems biology. However, known and predicted interactions are scattered over multiple resources, and the available data exhibit notable differences in terms of quality and completeness. The STRING database (http://string-db.org) aims to provide a critical assessment and integration of protein–protein interactions, including direct (physical) as well as indirect (functional) associations. The new version 10.0 of STRING covers more than 2000 organisms, which has necessitated novel, scalable algorithms for transferring interaction information between organisms. For this purpose, we have introduced hierarchical and self-consistent orthology annotations for all interacting proteins, grouping the proteins into families at various levels of phylogenetic resolution. Further improvements in version 10.0 include a completely redesigned prediction pipeline for inferring protein–protein associations from co-expression data, an API interface for the R computing environment and improved statistical analysis for enrichment tests in user-provided networks.
The EzCatDB database (http://ezcatdb.cbrc.jp/EzCatDB/) has emphasized manual classification of enzyme reactions from the viewpoints of enzyme active-site structures and their catalytic mechanisms based on literature information, amino acid sequences of enzymes (UniProtKB) and the corresponding tertiary structures from the Protein Data Bank (PDB). Reaction types such as hydrolysis, transfer, addition, elimination, isomerization, hydride transfer and electron transfer have been included in the reaction classification, RLCP. This database includes information related to ligand molecules on the enzyme structures in the PDB data, classified in terms of cofactors, substrates, products and intermediates, which are also necessary to elucidate the catalytic mechanisms. Recently, the database system was updated. The 3D structures of active sites for each PDB entry can be viewed using Jmol or Rasmol software. Moreover, sequence search systems of two types were developed for the EzCatDB database: EzCat-BLAST and EzCat-FORTE. EzCat-BLAST is suitable for quick searches, adopting the BLAST algorithm, whereas EzCat-FORTE is more suitable for detecting remote homologues, adopting the algorithm for FORTE protein structure prediction software. Another system, EzMetAct, is also available to searching for major active-site structures in EzCatDB, for which PDB-formatted queries can be searched.
Rhea (http://www.ebi.ac.uk/rhea) is a comprehensive and non-redundant resource of expert-curated biochemical reactions described using species from the ChEBI (Chemical Entities of Biological Interest) ontology of small molecules. Rhea has been designed for the functional annotation of enzymes and the description of genome-scale metabolic networks, providing stoichiometrically balanced enzyme-catalyzed reactions (covering the IUBMB Enzyme Nomenclature list and additional reactions), transport reactions and spontaneously occurring reactions. Rhea reactions are extensively curated with links to source literature and are mapped to other publicly available enzyme and pathway databases such as Reactome, BioCyc, KEGG and UniPathway, through manual curation and computational methods. Here we describe developments in Rhea since our last report in the 2012 database issue of Nucleic Acids Research. These include significant growth in the number of Rhea reactions and the inclusion of reactions involving complex macromolecules such as proteins, nucleic acids and other polymers that lie outside the scope of ChEBI. Together these developments will significantly increase the utility of Rhea as a tool for the description, analysis and reconciliation of genome-scale metabolic models.
Recent improvements to Binding MOAD: a resource for protein-ligand binding affinities and structures
For over 10 years, Binding MOAD (Mother of All Databases; http://www.BindingMOAD.org) has been one of the largest resources for high-quality protein–ligand complexes and associated binding affinity data. Binding MOAD has grown at the rate of 1994 complexes per year, on average. Currently, it contains 23 269 complexes and 8156 binding affinities. Our annual updates curate the data using a semi-automated literature search of the references cited within the PDB file, and we have recently upgraded our website and added new features and functionalities to better serve Binding MOAD users. In order to eliminate the legacy application server of the old platform and to accommodate new changes, the website has been completely rewritten in the LAMP (Linux, Apache, MySQL and PHP) environment. The improved user interface incorporates current third-party plugins for better visualization of protein and ligand molecules, and it provides features like sorting, filtering and filtered downloads. In addition to the field-based searching, Binding MOAD now can be searched by structural queries based on the ligand. In order to remove redundancy, Binding MOAD records are clustered in different families based on 90% sequence identity. The new Binding MOAD, with the upgraded platform, features and functionalities, is now equipped to better serve its users.
The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) is an open access database that houses genetic and protein interactions curated from the primary biomedical literature for all major model organism species and humans. As of September 2014, the BioGRID contains 749 912 interactions as drawn from 43 149 publications that represent 30 model organisms. This interaction count represents a 50% increase compared to our previous 2013 BioGRID update. BioGRID data are freely distributed through partner model organism databases and meta-databases and are directly downloadable in a variety of formats. In addition to general curation of the published literature for the major model species, BioGRID undertakes themed curation projects in areas of particular relevance for biomedical sciences, such as the ubiquitin-proteasome system and various human disease-associated interaction networks. BioGRID curation is coordinated through an Interaction Management System (IMS) that facilitates the compilation interaction records through structured evidence codes, phenotype ontologies, and gene annotation. The BioGRID architecture has been improved in order to support a broader range of interaction and post-translational modification types, to allow the representation of more complex multi-gene/protein interactions, to account for cellular phenotypes through structured ontologies, to expedite curation through semi-automated text-mining approaches, and to enhance curation quality control.