I took the National Institutes of Health‘s course Protecting Human Research Participants this week. It’s aimed at people setting up experiments involving human subjects and covers areas such as risk (including identification, minimisation, compared to benefits – personal and communal), recruitment (including coercion, balance), rights of the participants (including consent, welfare, vulnerable groups) and statutes (including international research, differences between definitions in different US bodies).
In a section on the design of a clinical trial (where interventions – such as drugs – are being compared for efficacy in treating some illness in humans), the term equipoise was introduced. The definition given is:
Substantial scientific uncertainty about which treatments will benefit subjects most, or a lack of consensus in the field that one intervention is superior to another.
and the course notes say:
A state of equipoise is required for conducting research that may pose risks to research participants.
For a clinical trial to be in equipoise, investigators must not know that one arm of a clinical trial provides greater efficacy over another, or there must be genuine uncertainty among professionals about whether one treatment is superior than another.
Equipoise is essential for obtaining generalizable knowledge. If a clear and agreed-upon answer exists, asking research participants to assume the risks of research that will provide the same information is not acceptable; no new knowledge will be gained from the study.
According to Wikipedia regulatory bodies do not agree on the utility of the concept, but the pro argument runs like this: the need for equipoise arises from a potential ethical dilemma. For example, is it ethical to run a trial in which the researchers are certain that one treatment is substantially better than another, i.e. where some of the participants with an illness will almost certainly receive other than the optimal treatment? Or, even if an experiment started with genuine uncertainty, imagine that the researcher has strong evidence that one drug is dramatically outperforming the others under investigation before the experiment end. Should they switch all of the participants to the better treatment? If so, at what point?
Opponents of the use of equipose argue that it mixes up concepts of therapy and research to create the ethical dilemma. Only by regarding the investigators as providing treatment to the subjects, is there any medical-ethical obligation on them to provide the best treatment. In research, the greater good of the population as a whole might outweigh the needs of a specific individual – presumably up to some point where the treatment under test is proving actively detrimental to the subject.
And so to software testing. We’re in the risk business too. Could equipoise be of use to us?
At one level there’s the idea that peer decision-making is appropriate under some circumstances. I think we’d probably all accept that. Which of us doesn’t sometimes ask a colleague or the community for advice or a second opinion? As stated here, though, the need for equipoise is predicated on the possibility of risk to participants. Glossing this as simply risk to someone who matters, there’s unlikely to be a test that doesn’t pose some risk if only because by performing one test, in a fixed budget, there’s almost certainly some other test you didn’t perform.
Continuing that thought, the notion of equipoise itself doesn’t seem to take account of the level of risk or the cost of an experiment. If it’s cheap to perform a test with low risk to the people who matter, perhaps the level of uncertainty required to motivate the test should not be the same as an expensive, high-risk test.
The text I quoted includes the phrase “if a clear and agreed-upon answer exists … no new knowledge will be gained from the study”. When trying to decide whether to perform a particular test, we’ll ask ourselves whether we think we might find something new – a debate that frequently squirts out of the side of discussion on regression testing such as this one yesterday on Twitter – or, perhaps better, whether the cost (including opportunity cost) of running the test justifies the risk that we won’t generate novel results.
In the general case, trials are long-running with reasonably specific aims couched in terms of hypotheses. The investigators will designate confidence thresholds above which an hypothesis will be said to be true. For example: drug A is better on the general population than drug B because the statistics we have done, based on the attributes of the patients that we measured and our sample set, which we selected carefully to be representative of the population and randomised for the trial, which was also double-blinded, were significant at the 95% confidence level. Most software testing isn’t framed this way or, at least not formally. Much software testing is framed in the form of a question and a binary opposition – does it X (by whatever criteria)? Pass or fail.
Looking for insight I tried analogy (leaving aside any scenarios where the software is medical in nature).
The researcher is probably uncontroversially the tester. The subject of the experiment has a couple of obvious candidates: the software and (by proxy) its stakeholders. And if the subject is the software then the risks might be to do with performance, robustness, scalability, functionality and so on. If the subject is the stakeholders, then risks are to the value that the stakeholders obtain from the software.
What is analogous to the trial? Is it simply a test? Or is it a sequence of tests? What if the trial is asking multiple questions? Is each of them a test? In the clinical trial, would a single dose of some compound be a test? What about if data was collected and analysed after that dose? And how about the intervention? Perhaps that’s somewhat like test data, or the steps used to perform the test, or a configuration for the system under test?
Under what circumstances could we have an ethical dilemma? In the trial, the health of the subjects and the obligation or otherwise of the researchers motivate it. In our scenario, what would the health of our subjects be? Well, a program might perform better (be faster, use less resource or whatever) in some contexts. A stakeholder might be happier if the product can do some things rather than others.
Which brings us back to the intervention – under what circumstances could a test provide these “health” benefits? Well, configuration options could tune a product, or a particular environment could allow it be more or less performant. In the case of stakeholders, perhaps their confidence is boosted by test results. Or maybe we need the intervention to be something that could change the product and more directly affect value, perhaps a test and corresponding software change, if required?
So are we in a position to recast equipoise and maybe find a potential ethical dilemma? Here’s one attempt:
Equipoise is substantial uncertainty about which tests and corresponding fixes will benefit the software most, or a lack of consensus that some test/possible fix cycle is superior to another.
A state of equipoise is required for conducting tests and applying any corresponding fixes that may pose risks to the product’s performance.
Casting the product health in terms of performance, there’s a possible facsimile ethical conflict. Imagine an experiment on a complex system with many configuration options. If some combination of settings is found to tune the system for incredible speed and low resource usage then we might be tempted to immediately apply that configuration elsewhere. However, if the result was found in the test lab and then applied to production instances, it’s not a dilemma because those machines were not subjects of the test.
To have the dilemma, we’d need to be applying it to the test machines. So perhaps we’re testing for ways to improve the reliability of the test machines; we find that one particular GPU provides sufficient uptime. Would it be a dilemma to simply stop the test and fit that GPU to them all? Perhaps we’re testing in production when we find the magically performant setting combination. Perhaps that’s closer to the human case – should we let some of our customers continue on the non-performant settings until the end of the experiment?
I don’t think I’ve convinced myself that there’s much of value to testers in the notion of equipoise. But on balance I probably need a peer consensus. What do you think?