Semantic graph database underpins healthcare data lake

A semantic graph database developer uses triple store technology to create healthcare data lakes so providers can use their own and open data for predictive analytics.

Each month, the editors at SearchHealthIT recognize an innovative software, service or technology approach. Franz, Inc.'s AllegroGraph semantic graph database and the healthcare data lake it underpins are our March 2016 selections.

Products: AllegroGraph, Semantic Data Lake for Healthcare

Release Dates: AllegroGraph has been in use since 2005; Semantic Data Lake for Healthcare has been used in beta format at Montefiore Medical Center in York since November 2015.

What AllegroGraph and health data lake do

Franz Inc., in partnership with Montefiore Health System, is bringing the data lake to health IT using Franz's semantic graph database technology.

Until its venture into the healthcare and pharmaceutical industries over the past few years, the 31-year-old Oakland, Calif., company had done business mainly in the worlds of national defense and intelligence, into which it sold its artificial intelligence-based triple store database that uses semantic, instead of relational, database technology.

The system Franz has adapted for health IT, with partners such as Montefiore in the Bronx, N.Y., is based on AllegroGraph, one of its flagship products. Montefiore is using the system, called the Semantic Data Lake for Healthcare, to perform sophisticated predictive analytics in a quest to improve patient care and lower hospital costs.

We will save more lives, with less cost to the hospital and less pain to the patients.
Parsa Mirhaji, M.D.

AllegroGraph uses the resource description framework (RDF) standard known as a "triple" to process and represent data semantically, and graph visualization software for visual discovery. Graph databases, also known as triple stores, include sets of three semantic elements -- noun, verb and object -- to support turning chains of data into statements. Using RDF technology, triple stores are a way to manage, manipulate and query many triples.

Unlike most relational databases' linear representation and analysis of data, Franz's semantic graph database technology employs visual and spatial charting with which users can graphically see data elements and their relationships.

Why the semantic data graph database and health data lake matter

Instead of having to reach into many siloed data warehouses in the hospital system, providers such as Montefiore can query one large data lake for the information they need.

"What [Montefiore] wanted was one platform, in the form of graphs, where you could build all these blocks without having to go through all the data again," Franz CEO Jans Aasman said. "In the data platform we have, we have all the information about patients you can ever think of."

Semantic graph database
Semantic graph database linking data sources from inside and outside the hospital.

For the health IT data lake, which at Montefiore is built on Intel servers, a Hadoop stack and a Cloudera-certified monitoring system, AllegroGraph stores and integrates healthcare-specific semantic elements such as medical vocabularies, taxonomies and ontologies.

As a result, if a physician wants to find out what happened with every patient who took morphine sulfate, a common ingredient in many drugs, in connection with cancer, the clinician can readily query the data lake for all cases involving morphine sulfate and cancer.

That is possible because the data lake also incorporates troves of structured patient data from ICD-9 codes -- for retrospective analytics -- and ICD-10 codes going forward. Beyond anatomical data, the system also includes data from hospitals' EHRs, tissue banks, connected medical devices, metadata from medical images and genomic data for precision medicine.

"Everything goes into these triples," Aasman said.

Importantly, the health data lake also incorporates open health data from sources such as the U.S. Census, Centers for Disease Control and Prevention, the National Institutes of Health and publicly available clinical trials.

Franz sees the current era of healthcare financing as ripe for predictive analytics powered by the health data lake because the federal government and private payers are heading quickly toward value- and merit-based reimbursement.

Semantic Data Lake sources
Data sources and uses for Semantic Data Lake.

On the financial side, the partnership will market the data lake system -- against stiff brand name competition from rival cognitive computing vendors such as IBM Watson Health -- to healthcare organizations for sophisticated business intelligence.

"It's doctors, clinicians, researchers that will use this platform, but one very important one is the CEO, CFO and CMO," Aasman, a native of Holland, said. "They want to measure and predict reimbursements, the costs and the outcomes of clinical pathways."

What a user says

Parsa Mirhaji, M.D., director of clinical research informatics at Montefiore and Albert Einstein College of Medicine in the Bronx, is a cardiologist with a doctorate in medical informatics.

Mirhaji launched a real-world application of the AllegroGraph-based data lake in November 2015 and is now nearing the end of a six-month trial period. The data lake-based system -- which Montefiore plans to expand to other applications within its medical system, and also perhaps to market to other providers in partnership with Franz -- is already producing results, Mirhaji said.

Using the Semantic Data Lake, Montefiore has been running a preclinical predictive model to assess whether high-risk patients would have a life-threatening respiratory event.

Mirhaji said most existing models predict mortality or respiratory failure within a range of 70 to 80% accuracy and within a 72-hour window.

"We are now able to predict with more than 85% accuracy the likelihood of respiratory failure or life-threatening events throughout the entire hospital within 48 hours," he said. "This will have huge implications in terms of quality of care and patient experience. We will save more lives, with less cost to the hospital and less pain to the patients."

"We are live. The system is working," Mirhaji said.

Montefiore, which has one of the busiest inner-city emergency departments in the country, is also an accountable care organization based on the Institute for Healthcare Improvement's "Triple Aim": improving patients' experience of care, improving population health and reducing the per capita cost of healthcare.

The ability to meet those goals will be greatly aided by the healthcare data lake, said Mirhaji who represented Montefiore and the Einstein medical school at a recent White House summit roundtable focused on President Obama's Precision Medicine Inititiative.

"It saves the hospital a lot of money if you can deal with the patient on the floor before the patient goes into the ICU, which is orders of magnitude more risky and more expensive to manage once that terminal condition happens," Mirhaji said. "Montefiore is really committed to building an analytics infrastructure because we want to be in the business of making predictions and prescriptions that are evidence based and data driven."

Montefiore also recently started another program using the data lake to do cardio-genetic predictive analytics to determine the degrees of possibility of patients having sudden cardiac death based on their genetic background.

"That requires a lot of data management doing cross-family analysis," Mirhaji said. "This is a project that will basically test the limits of the data lake."


Franz and Montefiore position the Semantic Data Lake for Healthcare as a single data platform for all health IT analytics, including:

Examples of queries providers can make to the data lake, according to Franz, are:

  • For readmission reduction: "Create a cohort for patients with pneumonia that were readmitted and a cohort for the same disease that were not readmitted."
  • High-cost patient prediction: "Find all patients with cardiac disorder of a genetic origin. Find the drugs prescribed. Use public linked data to find potential drug interactions. Break down by hospitals within 50 miles of San Jose, [Calif.], and treated by the coronary care units from June 2004 to May 2014."
  • Liability and risk mitigation: "Create cohorts of patients with similar diagnoses, but with different clinical outcomes and which resulted in legal expenses. Compare the cohorts and find the top 20 factors that might explain different outcomes, such as department, provider, socioeconomic factors, treatment plan, physician, geographical location."


AllegroGraph is licensed per CPU core and is a component of the Semantic Data Lake for Healthcare. Exact pricing for the data lake for healthcare is not available.

Editor's note: This story was updated to clarify the nature of the partnership between Franz, Inc. and Montefiore Health System.

Next Steps

Enterprise data architecture and the data lake

Data warehouses in healthcare

Precision medicine potential not realized yet

Dig Deeper on Clinical data analytics software and systems