What makes Cochrane’s new AI study innovative?

Cochrane has released a pre-print of the protocol for our innovative study that will test whether artificial intelligence (AI) tools can support or enhance evidence synthesis. Here Gerald Gartlehner from Cochrane Austria, Principal Investigator for the study, discusses how the study works, and what makes it interesting.

About the study

In this study, the Cochrane Evaluation of (Semi-) Automated Review methods (CESAR) project, we’re putting a range of AI tools to the test across roughly 15 Cochrane review updates, comparing their performance against traditional methods through the work of author teams. The study is being set up as an adaptive platform study within a review (SWAR), an innovative and flexible design that lets us evaluate multiple interventions simultaneously under a single protocol.

The protocol outlines criteria we expect the AI tools to meet, for example, sensitivity, error proportion, and usability. Because this is a platform study with an interim analysis, tools that fall short can be removed, while new AI tools can be added over time.

Following a selection process guided by Responsible AI in Evidence SynthEsis (RAISE) principles, two tools (Laser AI and Nested Knowledge) have been chosen for the study. Five additional tools remain on a reserve list and may be incorporated later as the study progresses.

Why is this innovative?

Providing flexibility

We believe that the study combines methodological rigour with a design that reflects the rapidly evolving AI landscape. Adaptive platform trials transformed clinical research during COVID‑19 by allowing rapid, evidence‑based decisions within a single ongoing study.

By adapting the logic of adaptive platform trials from clinical research, we have an important methodological advance over static AI tool evaluations. We have the flexibility to bring in new tools, discontinue underperforming tools, or add more stages of the review as technologies evolve, while preserving a structured comparative framework. This flexibility is especially important in a field where AI systems change rapidly and results from conventional evaluations may become outdated before they are used.

Navigating grey areas

Working in an emerging field sometimes means navigating areas where there is a lack of clarity. For example, there is currently no agreement across the evidence synthesis community on what constitutes “good enough” performance for AI tools. However, our study needed defined thresholds for performance metrics. To tackle this, the study built on findings from a community survey that explored expectations for evidence synthesis when using AI (led by Cochrane as part of their work in the Destiny project) and also drew expert advice from the joint AI Methods Group. These insights helped us define some of the performance metrics that reflect community expectations.

Being the first

Another unresolved question is how AI might influence review findings. Many people assume AI is acceptable as long as it does not affect the review’s findings and conclusions, but so far, no one has quantified this risk. Ours is the first study that aims to quantify the downstream effects of errors on the findings and conclusions.

It is also the first real-world effectiveness validation that's built on the foundations of RAISE that includes metrics in all key areas. In addition to the potential for downstream errors on the findings and conclusions, we will be assessing accuracy, stability, efficiency, and usability.

Encouraging others

If we want to know whether AI actually helps in evidence synthesis, we have to test it where the work really happens, not just on tidy benchmark datasets. Our study is based on a Study Within a Review (SWAR) protocol, and it is designed to be adapted. We can plug in, update, or retire AI tools across screening and data extraction tasks without disrupting the framework. And embedding evaluations in ongoing reviews means that findings will reflect real-world performance rather than benchmark artifacts.

In addition, the study provides a best-practice blueprint for other teams seeking to validate AI tools within their evidence synthesis workflows. We encourage other review teams to use the protocol to adapt to their needs.

Setting new standards

This study combines adaptive design, responsible AI principles, informed performance thresholds, and real‑world testing within ongoing Cochrane reviews. We hope it will not only tell us whether specific AI tools can support evidence synthesis – it will help establish how AI can be evaluated and what standards should be applied.

In a landscape where AI is advancing quickly and claims are plentiful, this study offers careful, transparent, and evidence‑based evaluation, and that is what makes it truly innovative.

Find the full protocol here

Part of the CESAR study was supported by the Wellcome Trust grant number 323143/Z/24/Z.