When planning an evaluation with finite resources, which evidence shall we gather? Shall we interview this key informant, or hold a focus group with those people, or do a questionnaire with everyone, or even conduct a randomised controlled trial (RCT)? Should we take any notice of proposed systems of “levels of evidence” (Berriet-Solliec, Labarthe, and Laurent, 2014) which give absolute scores for each kind of source, and which would always rate an RCT over an interview?
Here’s an interesting made-up case which kicks the tyres of the problem a little.
Drug X has been licensed for several years now to treat illness J. We work for the regulatory authority, and we have a broadly-defined responsibility for evaluation too. We already have a reasonably adequate set of randomised controlled trials (RCTs) which demonstrate the efficacy of X. We are considering commissioning another RCT to replicate one of the studies in this evidence base. But then we receive an anonymous call from someone who is able to prove that they used to work for the manufacturer of drug X. This informant claims that the method used to assess symptoms in all preceding studies about X was flawed due to an unintentional error in instrument design. Our experts confirm that the claim is plausible, though not obviously true. The informant says that she has video evidence which will demonstrate “beyond reasonable doubt” that the flaw exists and that the size of the benefit due to the drug is actually so small that it is outweighed by the risks, which are small but not negligible. So it might be better to take the drug off the market. What’s more, the flawed method is the only one available to reliably assess the symptoms, so any additional trial would just repeat the error.
In this situation, we are almost certain to decide to spend a day interviewing the informant and seeing the video, and to at least delay the planned trial. Let’s be clear what that means: in this case, our first choice for improving our causal model is a single interview rather than a whole RCT.
Note that the strength of each arrow is negative as more treatment means less symptoms.
The scenario additional RCT conducted, does not confirm existing findings has not been shown.
Our (correct) gut feeling is to prefer the interview because it has a reasonable chance of producing evidence which would substantially change our knowledge (it would substantially change our best estimate of the size of the influence of the drug on symptoms – a causal parameter).
This power of a piece of evidence to update our knowledge is called probative value. “By probative value, we mean the power of specific items of evidence to increase or decrease our confidence in a specific claim” (Befani and Stedman-Bryce, 2017)1. The RCT, on the other hand, has a very good chance of adding more evidence which would hardly change the content of our knowledge but would usefully improve our confidence in it (it would not change our best estimate of the influence but it would narrow the confidence intervals, or, from a Bayesian perspective, tighten the probability distribution). That improvement in confidence sounds very desirable, but not as desirable as the potential change in knowledge offered by the interview: its probative value is not as high.
(Our decision will most likely also add estimates of benefits and dis-benefits of the different outcomes into the mix. The likely change in our knowledge due to the interview would bring substantial benefit in terms of risks avoided and resources saved in the future, again more than we would expect with the RCT. We could even include costs – interview versus RCT – into the calculation. In order to do these “calculations”, we’d have to work with not with just our single best guess of strength of the drug’s impact on symptoms, but with its probability distribution. That’s quite tricky in practice. Anyway, for now, let’s just focus on the knowledge outcome, the probative value, rather than the benefits and costs of that knowledge, what one might perhaps call “probative consequences”.)
Sceptic: “I object! The interview only has good probative value because it is adding to pre-existing causal knowledge which was gained using RCTs.”
Me: “Well yes, but all our knowledge acquisition, including via RCTs, depends on pre-existing causal knowledge to a greater or lesser extent – for example, in this case, our knowledge about the accuracy of the instrument, which was probably based on some experiment in years gone by and which turned out to be possibly erroneous. In fact, the interpretation of any piece of evidence always stands on the shoulders of other giants, depending on many other pieces of knowledge gained in various different ways, including by experiment.”
Sceptic: “Anyway, the interview only serves to update our knowledge about a causal parameter – the strength of a causal link. It doesn’t identify new variables or prove or disprove the existence of causal links between variables.”
Me: “That’s true in this case. It’s true that a particular strength of RCTs is to identify causal links, i.e. to show that the strength of the causal link from one variable to some other variable is not zero. It’s true that this task is different, and in some ways harder, than showing that the strength of the link is, say, .5 rather than .8. Anyway, my next task is to construct an example in which an interview (or some other trivial-seeming method) can give evidence about the existence or non-existence of causal links.”
To sum up, this example illustrates how we might make decisions about using one source of evidence rather than another by comparing probative value. We didn’t need to consider any of the various rather arbitrary schemes for scoring “Levels of Evidence” (Berriet-Solliec et al., 2014) according to which an RCT would automatically score above an interview.
(Befani and Stedman-Bryce, 2017) suggest getting a panel of experts to actual quantify probative value to help decide which evidence to gather in difficult and high-stakes cases. But my intention here is to show that it can be useful even to do the thought-experiment of imagining “putting numbers on” these kinds of problems. They can help cure us of the idea that one type of evidence is always going to be better than another just because of the “method” involved.
Befani, B., and Stedman-Bryce, G. (2017). Process Tracing and Bayesian Updating for impact evaluation. Evaluation, 23(1), 42–60. https://doi.org/10.1177/1356389016654584
Berriet-Solliec, M., Labarthe, P., and Laurent, C. (2014). Goals of evaluation and types of evidence. Evaluation, 20(2), 195–213. https://doi.org/10.1177/1356389014529836
Note that the examples in that paper involve only binary outcomes, so knowledge (increase) can be expressed only in terms of confidence. Whereas the imaginary case I present here is more general because it involves a continuous outcome, so I have to treat the estimate and our confidence in the estimate as separate aspects of our knowledge. So my version here is a modest if vague generalisation of the idea.↩