Blog

From chaos to clarity: Unlearn’s quest to build the ultimate clinical dataset

April 9, 2024

By Evan Estola, Head of Data
‍

At Unlearn, we have an ambitious mission—to advance AI to eliminate trial and error in medicine. Our journey to eventually get there begins with building one of the largest and richest health outcomes datasets in the world. After all, no health dataset has everything: rich prognostic information on every patient, outcomes that match clinical trial objectives, and thousands and thousands of patients across every indication. That’s why we’ve set out to make it.

Like any AI company, data is at the heart of what we do at Unlearn. We create advanced AI models called digital twin generators that forecast a clinical trial participant’s clinical outcomes over time, which we call their digital twin. By harnessing the power of AI and data, we are able to leverage digital twins to run faster and more efficient clinical trials. Digital twin generators are trained on extensive patient-level data from past clinical trials and observational studies. To create digital twins for trial participants with a specific indication, like Alzheimer’s disease, the AI models need to learn from many, many examples of patients with that condition.

Step one in building a game-changing health dataset is partnering with research institutes, advocacy groups, academics, commercial vendors—and anyone else collecting health information to advance medicine. Through these partnerships, we collect de-identified, longitudinal data—i.e., patient data collected over time where identifiers have been removed to protect privacy. We’ve built a dataset of over 1 million patients across 30+ indications to date.

Step two is organizing, cleaning, and harmonizing that data, ensuring every piece of information can seamlessly integrate into our AI models. No two data sources are organized the same way. Every field name and acronym needs to be tracked down and standardized; we need to understand every protocol and assessment. We’ve built tools that allow us to process and integrate data from these disparate sources, but it’s still a monumental task.

Our efforts so far have culminated in a cleaned and harmonized dataset that now includes over 300,000 patients and more than 1 million patient-provider interactions—and we’ll continue to add to it all the time. This dataset provides a robust foundation for training our 13 DTGs and spearheading future innovations.

Behind Unlearn’s vision is a team of dedicated experts from healthcare, data science, machine learning, and engineering who are working together to revolutionize drug development with AI. Visit our publications page to learn more about our AI innovation, and check out our open roles to join us at the forefront of AI in medicine.