From searching and screening to drafting text, artificial intelligence (AI) tools offer the promise of increased efficiency. But with promise comes responsibility. Ella Flemyng, Cochrane’s Head of Editorial Policy and Research Integrity, highlights what you need to consider when deciding whether to use an AI tool, and when you might decide not to.
We want to help empower evidence synthesists to be critically cautious. That means asking the right questions to help justify your decision to use a tool. The considerations in this blog are grounded in the Responsible use of AI in evidence SynthEsis (RAISE) guidance, particularly RAISE 3, which focuses on selecting and using AI tools.
Based on RAISE, Cochrane’s position statement outlines four expectations for evidence synthesists using AI:
- You are ultimately responsible for your research, including the decision to use AI and how it is used
- You can use AI provided you can demonstrate it will not compromise the methodological rigour or integrity of your synthesis
- AI use should be fully and transparently reported
- AI should be used with human oversight
Expectations 1 and 2 are where you assess an AI tool to help make an informed decision on whether to use it. But what steps do you need to take to do that?
Expectation 1: assess the tool before you use it
The first step is to assess the tool using the responsible handover framework, provided in RAISE 3. This covers five areas:
- What is the purpose of the tool?
- Where have the training, testing and validation data come from?
- Is the tool validated and performing sufficiently well for use?
- Usability and user capability
- Transparency, licenses, availability & documentation
This process relies on publicly available information, but don’t hesitate to contact developers directly to highlight the information you need. Many tool developers aren’t aware of the expectations and are still learning what information users need. And we’ve seen that engagement can lead to improved transparency, some developers we’ve contacted have made information publicly available following our enquiries.
While working through the framework, you can decide not to proceed at any point. Reasons for this include (but aren’t limited to!):
- unvalidated tools, weak evidence, or risks that can't be mitigated
- no published validation in a relevant context
- validation that is not replicable
- performance claims are based solely on developer‑led studies or weak methods
- a lack of compliance with organizational, national, or international legal/policy requirements
- terms of use allow content to be reused for model training without an opt‑out
- inadequate human oversight (e.g. no monitoring or auditability)
- lack of responsiveness or transparency from the AI developer
If you do not find any of these types of red flags, you may decide to proceed if it’s a validated tool with strong supporting evidence, well understood risks and manageable limitations. Alternatively, you might decide to proceed with mitigations. This applies to promising tools with evidence gaps or moderate, monitorable risks.
Expectation 2: show the tool won’t compromise your review
This brings us on to expectation 2, demonstrating a tool will not compromise the methodological rigour or integrity of your synthesis. The evidence gathered during the responsible handover framework helps you do this, and if you’ve decided to proceed with mitigations, additional verification or validation is needed.
To help systematic reviewers, RAISE 3 includes an overview of how AI is being used in different types of tools at different stages of the review process, alongside a recommendation on their use. The recommendations range from:
- acceptable for use, which would require a disclosure with a brief justification
- human verification, which would require a disclosure with description of verification methods and justification
- requires validation within the review, which would require a full disclosure with a description of validation methods within your review and justification
- exploratory and supplementary, which would also require a full disclosure with a description of validation methods within your review methods and justification
- not acceptable, which, obviously, should not be used!
At present, the recommendation for all large language model or generative AI-based tools is to “proceed with mitigations”. Because they function as “black boxes” and their inner workings are poorly understood, you will either need to carry out human verification or validate the tool in your review, by carrying out a study within a review (SWaR).
A real-world example of a SWaR is the Cochrane Evaluation of (Semi-) Automated Review methods (CESAR) project. This study will test whether AI tools can support or enhance evidence synthesis by putting different tools to the test across roughly 15 Cochrane review updates and comparing their performance with traditional methods through the work of author teams.
Because CESAR is a platform study, tools that don’t meet the study’s expectations can be removed, while new tools can be added over time. However, to know if the tools are meeting expectations, the study needs defined performance thresholds.
One thing to note is that these tools are meant to make processes more efficient and review creation faster. Therefore, if after the standard onboarding the tool is making your life more difficult, the process inefficient, and support to overcome these challenges is inadequate, this could be a reason not to proceed with the AI tool.
Defining performance thresholds
Why do thresholds matter for validation studies for AI tools in evidence synthesis?
One of the biggest challenges in AI for evidence synthesis is determining what counts as “good enough.” Is this AI tool appropriate, or isn’t it?
Although there are recommendations in RAISE around rigour, there is currently no agreement on what constitutes “good enough” for AI tool performance, both in terms of the level of confidence and the level of performance. Thresholds or benchmarks may vary based on the AI tool or system and currently, there isn’t consensus on what these should be.
As we are working in an emerging field it sometimes means we are navigating areas where there is a lack of clarity. However, defining thresholds is essential for ensuring transparency and accountability, supporting consistent decision making, and preventing the degradation of methods.
Some studies are beginning to define informed thresholds that we can build on, and CESAR is an example of this. As a platform study, we will stop using a tool if they don’t meet expectations. This required predefined thresholds for performance metrics, for screening, data extraction and usability.
CESAR thresholds were informed by findings from a community survey that explored expectations for evidence synthesis when using AI (led by Cochrane as part of their work in the Destiny project), RAISE and expert advice from the joint AI Methods Group.
| Performance metrics | Futility boundaries (point estimate) | Non-inferiority margins (Upper limit of 95% CI)* | Decision rules |
|---|---|---|---|
Screening | |||
| Sensitivity | <80% | <95% | Stop if either boundary is crossed |
| Specificity (for full-text screening only) | <50% | <60% | Stop if either boundary is crossed |
Data extraction | |||
| Sensitivity | <92% | <97% | Stop if either boundary is crossed |
| Major error proportion | >3% | >2% | Stop if either boundary is crossed |
Usability | |||
| System usability scale (score) | <57 | <75 | Stop if threshold is not met |
The futility boundary is the minimum acceptable level of performance and so a point estimate below the futility boundary indicates clear underperformance.
The non-inferiority margin is the performance we are aiming for and so if even the most optimistic estimate (upper confidence level) cannot reach this level, the tool lacks sufficient promise to justify continued use.
These thresholds have helped for this project, but we know that agreement on thresholds will evolve as a result of the evidence synthesis community sharing learning, publishing evaluations, and using adaptive study designs.
If you are setting your own thresholds, you should consider them prospectively. You should have a clear idea of acceptable thresholds when you are working through the responsible handover framework and critiquing validation studies, but you also need to be clear when you are validating the tool within your own review. Even in an emerging field, clarity about what “good enough” means is essential.
What next?
If you’re using AI you need to be rigourous in what you do, but also how you report it. This takes us to expectation 3 and you should follow the reporting guidance in the position statement and Cochrane’s review template (see the ‘Disclosure of AI use’ section for the full guidance on what to report):
“We will use [AI system/tool/approach name, version, date of use] developed by [organization/developer] for [specific purpose(s)] in [the evidence synthesis process]. The [AI system/tool/approach] will [state it will be used according to the user guide, and include reference, and/or briefly describe any customization, training, or parameters to be applied].
Outputs from the [AI system/tool/approach] are justified for use in our synthesis because:
- [state the degree of human oversight such as any steps taken to review, verify, or override AI-generated outputs.]
- [describe how you have determined it is methodologically sound and will not undermine the trustworthiness or reliability of the synthesis or its conclusions (e.g., model validation, feature validation)]
- [describe how it has been validated or calibrated to ensure that it is appropriate for use in the context of the specific evidence synthesis, to include degree of author involvement, if not covered in the user guide, evaluations or elsewhere (e.g., real-world effectiveness)].
Limitations [of the AI system/tool/approach] include [describe known limitations, potential biases, and ethical concerns]/ [are included as a supplementary material]. [If applicable] A detailed description of the methodology, including parameters and validation procedures, is available in [supplementary materials].”
Your report should clearly describe which tools were used, how they were used, any validation or verification methods and your justification for using them. Transparency is essential for reproducibility and trust.
If you meet expectation 1 to 3 then you meet expectation 4: AI must be used with human oversight! At its core, this expectation is about intentionality and accountability, it's the conscious decision and justification to use AI.
Meeting this expectation is the overall process of using public information, published evidence, and the verification methods and / or real-world validation methods you decided on to justify how you plan to use it.
Cochrane and the wider community are continuing to develop this field through work including the CESAR project, Destiny project and the Joint AI Methods Group. We hope these efforts will help refine standards, improve guidance, and continue to understand the consensus on what “good enough” really means.
Part of the CESAR study was supported by the Wellcome Trust grant number 323143/Z/24/Z.
Find out more
- What makes Cochrane’s new AI study innovative?
- Cochrane announces selected AI tools for innovative platform study
- Read more about the selection process and how Cochrane values were applied
- Cochrane launches innovative study to assess AI tools for evidence synthesis
- AI tools for evidence synthesis - Are you a tool provider? Register your expression of interest here