The problem: Enriching siloed and unstructured scientific data
There are hundreds of publicly available databases containing information about the life sciences and health care. PubMed Central® (PMC), for example, is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). EuropePMC is the European version of PMC. There are specialized databases for almost every therapeutic area, and for some therapeutic areas there are aggregators of public domain databases, such as the Cancer Data Research Commons. But while there is an abundance of data, much of it is siloed and is available “as is,” meaning that there’s been little value added on top of these disparate, often unstructured databases. Another way of putting this is that the databases are “unenriched.”
Data enrichment can take many forms. Data can be classified and tagged to make searching easier and facilitate the clustering of results. Sentiment scores can be added to individual records. There are a variety of machine learning techniques that can be used to identify the data that would be of most interest to a particular researcher. A data enrichment challenge is applying these techniques to disparate datasets from different sources in a consistent manner.
Our solution: Apply consistent structured metadata
Foundation, from ResoluteAI, is a service that aggregates about 20 databases, most of them in the public domain, unstructured, and related to life sciences. We enrich the content in these databases by adding a consistent layer of structured metadata to facilitate granular searching, information discovery, and uncovering hidden connections. The tags and categories we assign to each piece of content are generated by our proprietary, machine-learning-driven tagging engine. As our business has grown we have moved up the science ladder, from initially engaging with business development and partnership executives to now working increasingly with scientists and researchers. To better serve more advanced scientific research, we needed to up our data enrichment game.
Our first enhancement in this area was adding PubChem tagging to several of our key databases: PubMed, Patents, Clinical Trials, Grants, and Tech Transfer. This allowed more fine-tuned searches for documents mentioning a specific chemical compound, and allowed a researcher to quickly get basic information about a compound that was mentioned in a document.
Customer response to this capability was extremely positive, so we proceeded to look for more metadata to enrich our corpus. We were aware of the Unified Medical Language System (UMLS) and the National Center for Biomedical Ontology (NCBO). The goal was to further annotate our key datasets and simultaneously build a process by which we could tag our customers’ proprietary datasets with these more specific taxonomies.
The challenge: Fast and secure data enrichment at scale
The NCBO hosts BioPortal, “the world’s most comprehensive repository of biomedical ontologies”. To annotate a corpus with these ontologies, a tool has been developed called the OntoPortal Virtual Appliance which has been successfully implemented in a number of environments. Unfortunately, we tried to use the Appliance but ran into many hurdles that were not going to be easy to overcome.
The two most critical issues were:1) that the appliance didn’t scale - we wanted to tag many, many millions of documents, and the Appliance was not designed for that, and 2) there were security vulnerabilities such that the OntoPortal appliance would not be SOC 2 compliant, an important requirement for ResoluteAI customers. The BioPortal presented a treasure trove of data enrichment opportunities waiting to be accessed, but we needed to find or build a new way to deploy it.
We then looked into SciSpacy, an open-sourced library by the Allen Institute of AI containing spaCy models for processing biomedical, scientific or clinical text. SciSpacy can link entities to UMLS concepts1, though not to specific ontologies within UMLS. The pipeline uses a neural network2 trained on several datasets including MedMentions and Ontonotes.
SciSpacy is able to take as input a given text and output extracted entities and map those entities to concepts in UMLS. However, SciSpacy's actual entity linker (to UMLS) was too slow for our purposes due to the large number of records and entities we needed to match. Furthermore, we needed to link entities to various specific ontologies, rather than just to higher level concepts. Lastly, we could not rely on them updating their files as new revisions get released by UMLS. However, the NCBO tools were useful for seeing how they handled UMLS metafile parsing edge cases, as well as their heuristics for choosing ontologies relevant to a text.
Although we did not use the NCBO BioPortal nor most of the SciSpacy Entity Linker pipeline, looking at the methodology and implementation of these tools, as well as recognizing their shortcomings, drove the evolution of our ontology linker. Our implementation uses SciSpacy's entity extraction models with our own UMLS parser and entity linker. The entity linker is much like SciSpacy's except that it is built to scale using Spark, a distributed data processing cluster. We make use of approximate nearest neighbor search3 in Spark between billions of mentions and millions of candidate aliases. Unlike SciSpacy, we are able to match on specific UMLS ontologies. We intend to build on this further to improve both recall and precision on the matched ontologies, as well as build this capability into our enterprise search platform, Nebula.
In all, the ResoluteAI solution is two orders of magnitude faster than the SciSpacy method.
The improvement: Enriching data with new ontologies, taxonomies, and controlled vocabularies
Despite a number of fits and starts, we were ultimately able to annotate a number of large research databases with several ontologies4, taxonomies, and controlled vocabularies available through UMLS. This has dramatically improved ResoluteAI’s faceted search feature, allowing users to filter their search results by using combinations of metadata. Analytics, especially heatmaps, have also been improved. For large datasets such as PubMed and Patents, this accelerates the research process, and in several cases to date has been extremely helpful in white space searches.
We are currently working on three next steps:
- Enrichment of publicly available databases with additional hierarchical metadata has proven to be valuable to researchers. We plan to extend this capability to customer-proprietary datasets as part of our Nebula enterprise search platform. Most corporate research data is poorly tagged and haphazardly managed; internal research can be dramatically improved by adding more structured, domain-specific metadata.
- We selected the taxonomies from UMLS that we believed would have the most immediate impact on our customers’ workflow. We have already had requests to further enrich our public domain databases with more UMLS taxonomies, as well as several from the The Open Biological and Biomedical Ontology (OBO) Foundry.
- While data enrichment has obvious benefits, in some respects it can contribute to information overload. To address this challenge ResoluteAI is incorporating many of these ontologies, taxonomies, and controlled vocabularies into a graph database, which will allow users to view the connections between content from disparate databases more efficiently. By clustering search results around annotations or combinations of annotations, users will be able to quickly identify papers, patents, clinical trials, etc. that meet highly granular specifications.
1 SciSpacy's entity linker matches entities to UMLS concepts based on textual similarity. It works as follows: it extracts entities from the text, called mentions. For each mention, it does a nearest neighbor search in a pre-computed index of UMLS concept aliases in the UMLS knowledge base. In order to do the nearest neighbors search (ANN), a mention of text is converted into a numerical vector of numbers based on a TF-IDF analysis of its 3-character n-grams. In other words, the text is broken down into n-grams like this: "biomedical" → [bio iom ome med edi dic ica al] and then each individual n-gram is assigned a weight indicating how "important" it is based on its frequency of occurrence within each concept alias vs. across all concepts. The final result is a vector that can be compared for similarity to other vectors such that similarity between the vectors indicates a strong similarity between the texts they represent and is robust enough to ignore minor spelling differences and word inversions. This robustness is another advantage over the OntoPortal usage of mgrep, since mrep is not able to recognize that minor variations are likely equivalent.
2 A specialized entity extractor powered by a neural network trained on medical and scientific corpora does a better job at entity extraction than simpler methods such as word boundary tokenization. For example, it will recognize phrases like "Post Traumatic Stress Disease" as a single concept.
3 Approximate Nearest Neighbor Search is a proximity search for points (e.g. UMLS aliases) close to a given query point (e.g. extracted mentions). We use LSH (locality sensitive hashing) but have begun transitioning to HNSW (hierarchical navigable small worlds).
4 MeSH, MedDRA, RxNorm, OMIM, GO, SNOMED CT, and ICD-10.