A couple of weeks ago we wrote about scientific data enrichment. Our goal was to highlight the use of ontologies, taxonomies, and controlled vocabularies to facilitate research across multiple disparate databases. This post is about early academic work in biomedical knowledge graphs, which require a significant amount of data enrichment to perform well and deliver on a range of use cases. In future posts we will delve more deeply into graph databases, and how ResoluteAI is connecting concepts that would not easily be revealed through other types of data structures.
Visualizing biomedical knowledge
Biologic systems are incredibly complex and the need to understand these systems to study human disease has taken researchers down many diverse paths. Occasionally these paths lead to connections between unexpected entities or ideas, leading to a progress or a breakthrough. As Isaac Asimov once said:
The most exciting phrase to hear in science, the one that heralds new discoveries, is not “Eureka!” (I found it!) but “That’s funny …”
Connections, or relationships between concepts are often difficult to identify when looking at unstructured data. For that reason, for the past 10 or so years researchers have been building biomedical graph databases and knowledge graphs to help analyze and visualize large quantities of biomedical research and data. Most of these initiatives have been academic exercises that have been completed but then left unmaintained, where they have quickly lost their value as the amount of new data has grown rapidly over time. Based on our research of these projects, ResoluteAI has begun building its own biomedical graph database that we plan to maintain and grow. Here are some examples of what has come before:
Hildebrandt, Michael Kaufmann, Oliver Kohlbacher, and Hans-Peter Lenhof in the Journal of Integrative Bioinformatics in 2006. In this paper, the authors “present BN++, the biochemical network library, a powerful software package for integrating, analyzing, and visualizing biochemical data in the context of networks. BN++ is based on a comprehensive and extensible object model (BioCore), which has been implemented as a C++ framework, a Java class library, and a relational database.”
BN++ connected sequence data, metabolic and regulatory networks, and protein interaction data among other datasets, and since this work was conducted prior to the advent of graph databases, all the data was stored in a relational database. The advent of graph databases has dramatically increased query speed for large, connected datasets. In 2008 Jan Kuntzer published a 104 page book, BN++: A Biological Information System.
“Integration of heterogeneous data types is a challenging problem, especially in biology, where the number of databases and data types increase rapidly. Amongst the problems that one has to face are integrity, consistency, redundancy, connectivity, expressiveness and updatability.” This is the introduction to Aaron Birkland and Golan Yona’s February 2006 paper BIOZON: a system for unification, management and analysis of heterogeneous biological data.2
Biological entities are strongly related and mutually dependent on each other. Therefore, there is a growing need to corroborate and integrate data from different resources and aspects of biological systems in order to analyze them effectively. Biozon is a unified biological database that integrates heterogeneous data types such as proteins, structures, domain families, protein–protein interactions and cellular pathways, and establishes the relationships between them.
An interesting feature of Biozon was their use of a ranking algorithm to surface results with the most connections within the graph, much like PageRank works for Google search results.
When the results are ranked, the top ranked hits tend to be the most highly connected on the graph, as compared to a random ordering. These high ranked objects can direct biologists that study speciﬁc systems to other equivalent systems that were studied extensively, thus providing them with a myriad of relevant information.3
It’s unclear how long Biozon was available, but it is no longer accessible on the internet.
Shortly after the Biozon paper was published, in June of 2012 Lauri Eronen and Hannu Tiovonen published Biomine: predicting links between biological entities using network models of heterogeneous databases4 in BMC Bioinformatics. This was accompanied by the availability of Biomine Explorer, an online tool that facilitated link discovery and interactive exploration in biological databases. In the image below connections are being made between BRCA2, the breast cancer type 2 susceptibility protein, and various genes, phenotypes, biological processes and PubMed articles. Databases connected in this project included Uniprot, GO, Entrez Gene (now Gene), PubMed and others.
Biomine Explorer appears to be still active, but the last database updates were several years ago and so the usefulness of the tool has therefore diminished significantly over time.
Linking Life Sciences Data Using Graph-Based Mapping5 was presented at the 6th International Workshop on Data Integration in the Life Sciences in July of 2009 and was followed up in 2013 with a paper in Bioinformatics titled Ondex Web: web-based visualization and exploration of heterogeneous biological networks.6 In between, Jan Taubert wrote a 200+ page book called Ondex.
There are over 1100 different databases available containing primary and derived data of interest to research biologists. It is inevitable that many of these databases contain overlapping, related or conflicting information. Data integration methods are being developed to address these issues by providing a consolidated view over multiple databases. However, a key challenge for data integration is the identification of links between closely related entries in different life sciences databases when there is no direct information that provides a reliable cross-reference.
Ondex was an open source data integration platform that was built to provide a general framework for the integration of heterogeneous biological data. ChEMBL and UniProtKB were two of the initial datasets that were integrated. Support for Ondex has apparently been discontinued, but Ondex is now powering a different tool called Cytoscape. Cytoscape is an open source software platform for visualizing molecular interaction networks and biological pathways and integrating these networks with annotations, gene expression profiles and other state data. Although Cytoscape was originally designed for biological research, now it is a general platform for complex network analysis and visualization. This is a screenshot of the Cytoscape desktop application:
In March of 2015 five scientists from a company called Era7Bioinformatics published Bio4j: a high-performance cloud-enabled graph-based data platform7 on bioRxiv. Era7 was focused on next generation sequencing, and built Bio4j to assist in that process.
A key need for achieving fast, reproducible, and cost-effective data analysis at such scale is being able to access and query the vast amount of publicly available data, specially so in the case of knowledge-intensive, semantically rich data: incredibly valuable information about proteins and their functions, genes, pathways, or all sort of biological knowledge encoded in ontologies remains scattered, semantically and physically fragmented. Methods and Results. Guided by this, we have designed and developed Bio4j. It aims to offer a platform for the integration of semantically rich biological data using typed graph models.
Bio4j integrated most of the data from UniprotKB, Gene Ontology (GO), UniRef, NCBI Taxonomy, RefSeq, and enzymeDb. As the name suggests, Bio4j was built on an early version of Neo4j, which has grown to be one of the largest vendors of graph database technology. This is a simple diagram of the Bio4j structure:
It does not appear as if Bio4j has been updated recently, and there is not much new on the Era7 Bioinformatics website. As with the cited examples that came before, the updating and maintenance of a biological graph database is difficult and relentless. And the growth in biological and related information is growing at an accelerating rate.
ResoluteAI will be releasing its biological knowledge graph shortly that will include many of the databases mentioned in this paper, as well as several others that are frequently used by our customers. Stay tuned.
1 Küntzer, Jan & Blum, Torsten & Gerasch, Andreas & Backes, Christina & Hildebrandt, Andreas & Kaufmann, Michael & Kohlbacher, Oliver & Lenhof, Hans-Peter. (2006). BN++ - A Biological Information System. http://journal.imbio.de/index.php?paper_id=34. 3. 10.2390/biecoll-jib-2006-34
2 Birkland, A., Yona, G. BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics 7, 70 (2006). https://doi.org/10.1186/1471-2105-7-70
3 Birkland, Aaron & Yona, Golan. (2006). BIOZON: a hub of heterogeneous biological data. Nucleic acids research. 34. D235-42. 10.1093/nar/gkj153.
4 Eronen, L., Toivonen, H. Biomine: predicting links between biological entities using network models of heterogeneous databases. BMC Bioinformatics 13, 119 (2012). https://doi.org/10.1186/1471-2105-13-119
5 Taubert, Jan & Hindle, Matthew & Lysenko, Artem & Weile, Jochen & Köhler, Jacob & Rawlings, Chris. (2009). Linking Life Sciences Data Using Graph-Based Mapping. 16-30. 10.1007/978-3-642-02879-3_3.
6 Taubert J, Hassani-Pak K, Castells-Brooke N, Rawlings CJ. Ondex Web: web-based visualization and exploration of heterogeneous biological networks. Bioinformatics. 2014 Apr 1;30(7):1034-5. doi: 10.1093/bioinformatics/btt740. Epub 2013 Dec 20. PMID: 24363379; PMCID: PMC3967113.
7 Pareja Tobes, Pablo & Tobes, Raquel & Manrique, Marina & Pareja, Eduardo & Pareja-Tobes, Eduardo. (2015). Bio4j: a high-performance cloud-enabled graph-based data platform. bioRxiv. 10.1101/016758.