This document summarizes a presentation about accessing chemical data using the ChEBI database. It introduces ChEBI as a manually annotated database and ontology of small chemical entities. It covers searching and browsing ChEBI, understanding the ChEBI ontology structure, and methods for programmatic access including downloads of the ChEBI data in different file formats and via a web service API.
1. Accessing small molecule data using ChEBI Janna Hastings, Duncan Hull and Nico Adams Programmatic Access to Biological Databases (Perl) 22-26 February 2010 @ EBI
4. Small Molecules within Bioinformatics Literature Nucleotide sequences Genomes Expressions Protein sequences Protein domains, families 3D structures Enzymes Small molecules Pathways Systems
5. Literature Nucleotide sequences Genomes Expressions Protein sequences Protein domains, families 3D structures Enzymes Small molecules Pathways Systems Small Molecules within Bioinformatics Small molecules Small molecules Small molecules Small molecules Small molecules
13. Drug types 2003 - 2009 'Small molecules' in various shades of blue (http://chembl.blogspot.com/)
14.
15. Small molecule data sources Deposition-driven publicly available compound repository, containing more than 25 million unique structures. http://pubchem.ncbi.nlm.nih.gov/ http://www.chemspider.com/ Automatic aggregation of publicly available chemistry data with crowdsourced annotation. http://www.ebi.ac.uk/chebi/ Manually annotated database and ontology
46. Viewing ChEBI ontology [2] ChEBI – Chemical Entities of Biological Interest 25.02.10 Tree view
47. Browsing ChEBI ontology (OLS) ChEBI – Chemical Entities of Biological Interest 25.02.10 Browse the ontology Ontology Lookup Service (OLS): http://www.ebi.ac.uk/ontology-lookup/
48.
49. OBO Foundry “ The OBO Foundry is a collaborative experiment involving developers of science-based ontologies who are establishing a set of principles for ontology development with the goal of creating a suite of orthogonal interoperable reference ontologies in the biomedical domain.” ChEBI – Chemical Entities of Biological Interest 25.02.10
52. ChEBI domain model ChEBI – Chemical Entities of Biological Interest 25.02.10 Self-referencing - merging
53.
54. Compound IDs and Merging [2] ChEBI – Chemical Entities of Biological Interest 25.02.10 Additional acc Parent ID This compound ID = additional acc ID STATUS CHEBI_ACCN SOURCE PARENT_ID NAME DEFINITION 15377 C CHEBI:15377 ChEBI null water null 5585 C CHEBI:5585 KEGG 15377 null null ID COMPOUND ACCN_NUMBER TYPE STATUS SOURCE URL_ABBR 16213 5585 C00001 KEGG accn C KEGG KEGG 17314 5585 7732-18-5 CAS Registry C KEGG null
55.
56.
57.
58.
59. SDF File complete format ChEBI – Chemical Entities of Biological Interest 25.02.10 Entries separated by $$$$
60.
61.
62.
63.
64. Web service client object model ChEBI – Chemical Entities of Biological Interest 25.02.10 getLiteEntity getCompleteEntity getOntology (Parents and Children)
Time taken to perform a full substructure search increases exponentially with the number of atoms. So, running the full search against the entire database is an intractable problem.
Molecular formula provides a crude heuristic for narrowing the number of search candidates in a substructure search. Fingerprints are a much more powerful device.
An algorithm generates patterns for each atom, each bonded group of two atoms, three… up to 8 bonds long. Each pattern is then hashed into a bit string, and the hashed results are all then added together using the logical OR relationship to create the final fingerprint.
Identity search is subject to the limitations of InChI uniqueness, however, in general, identity search will find exactly the structure you have entered, if it exists in the database. For substructure searching, the fingerprint is used to narrow the range of search candidates from the database based on the fingerprint property that all bits set in the substructure fingerprint, are also set in the structure fingerprint. For similarity, the Tanimoto coefficient is calculated from the fingerprints based on T = c/(a + b – c).
When trying to retrieve the compound accession from a data item such as a database accession or compound name, the relevant entry in the Compounds table must also be retrieved and the parent_id field examined. If the parent_id is not empty, then it links to the compound containing the primary identifier for this merged group of entities. There are more ID’s than just one, for a given compound,