rf-fullcolor.png

 

September 22, 2025
by Mary Ellen Schneider

TRIALSCOPE: Using AI to scale real-world data

A framework that uses generative artificial intelligence (AI), such as biomedical language models and probabilistic modeling, was able to extract unstructured patient data from electronic medical records (EMRs) and reproduce the results of lung and pancreatic cancer clinical trials.
 
The findings, published in the NEJM AI on 22 September, suggest that the framework, called TRIALSCOPE, could be used to generate population-level real-world evidence (RWE) from EMRs at scale.
 
“TRIALSCOPE combines automated EMR curation from unstructured and noisy data sources with a clinical trial simulator for causal treatment effect estimation. We integrate state-of-the-art methods to enhance data quality, with a quality evaluation process designed for robustness and transparency,” Javier González of Microsoft Research in Cambridge, UK, and colleagues wrote.
 
TRIALSCOPE has five components: a data structuring pipeline that extracts and structures EMR data, a probabilistic latent variable model to handle “denoising” of data and filling missing data, a patient triaging pipeline that identifies patients for target trials, a causal model for simulating a trial, and a set of validation tests for the simulation.
 
Using a cancer patient cohort from a large US health network, the researchers simulated 11 previously published clinical trials of advanced non-small cell lung cancer (NSCLC) to gauge whether the RWE produced using TRIALSCOPE was consistent with validated trial evidence. The key metric for each simulation was the hazard ratio for overall survival, comparing the simulated hazard ratio against the reference hazard ratio. In cases where a reference hazard ratio was unavailable, the researchers applied a series of other tests, including computing the hazard ratio with a random down-sampling of data.
 
Nine of 11 published trials had hazard ratios. In those nine trials, the results of the simulated trials and the published trials were statistically equivalent.
 
In the case of one trial (EMPHASIS-lung), which was discontinued due to lack of enrolled patients, the TRIALSCOPE framework was able to simulate the unconducted trial and predict a hazard ratio suggesting that the treatment being studied was effective. The simulation passed five diagnostic tests of its validity.
 
Finally, to assess TRIALSCOPE’s generalizability outside of lung cancer, the researchers simulated a trial of treatment regimens for metastatic pancreatic cancer patients. The simulated trial produced a statistically equivalent hazard ratio. They also ran an additional simulation that removed the trial's eligibility criteria and included all patients with pancreatic cancer in the database, showing the potential to use TRIALSCOPE to show how treatments will perform in real-world populations.  
 
“Our experimental results demonstrate that combining structuring capabilities powered by biomedical language models and causal inference capabilities can transform EMRs into scalable RWE generators. We have also shown that evidence about the efficacy of drugs does not need to be limited to cases where a randomized trial exists,” González and colleagues wrote.
 
However, Issa J. Dahabreh of Harvard T.H. Chan School of Public Health in Boston and colleagues wrote in an accompanying editorial that the TRIALSCOPE simulations did not go far enough in addressing “confounding by unmeasured variables” in the target trials, even with the current validation steps. In the case of high-stakes clinical and regulatory decisions, direct trial evidence may still be necessary.
 
Dahabreh and colleagues suggested that the TRIALSCOPE framework could still be valuable for informing randomized trial design and augmenting randomized trials. 
 
In another editorial on TRIALSCOPE, David Ouyang of Kaiser Permanente Northern California and Jeffrey M. Drazen of Brigham and Women’s Hospital in Boston suggest that the framework has potential in cancer, where clear-cut diagnosis measures and outcomes mean that EMR data and trial data are often aligned. However, it is unclear if TRIALSCOPE would work as well in other conditions, such as asthma or headache, where diagnosis and outcomes carry more subjectivity, they wrote. 
 
“Many areas of medicine have a paucity of well-conducted clinical trials as well as few treatment options. For such cases, it would be hard to confirm the accuracy of TRIALSCOPE. For TRIALSCOPE to prove its real worth, it will need to show that ‘real-world data’ can be used to successfully emulate clinical trials for conditions outside of cancer,” Ouyang and Drazen wrote.
 
NEJM AI study
Dahabreh et al, editorial
Ouyang and Drazen editorial
 
×

Welcome to the new RAPS Digital Experience

We have completed our migration to a new platform and are pleased to introduce the updated site.

What to expect: If you have an existing login, please RESET YOUR PASSWORD before signing in. After you log in for the first time, you will be prompted to confirm your profile preferences, which will be used to personalize content.

We encourage you to explore the new website and visit your updated My RAPS page. If you need assistance, please review our FAQ page.

We welcome your feedback. Please let us know how we can continue to improve your experience.