Spoiler alert: there is no happy ending here.
Often a blog post or article will talk about some inspiration gained from outside of testing which was applied successfully to the author’s testing and yielded useful results. Those are great posts to read … but this isn’t one of them.
This one starts with inspiration from outside of testing followed by research, discussion, criticism and thought experiment but ultimately no application. The yield here is the documentation of all of that. Hopefully it’s useful and still a good read.
What was the inspiration?
I’ve long had the notion – sadly dormant and largely sidelined – that I’d like to look into the use of statistics in testing. (One of my earliest posts here was on entropy.) As testers we talk a lot about such things as the scientific method and about data, about running experiments, about crafting experiments such that they generate usable and valuable data, about making decisions on where to test next based on that data, about reporting on those experiments and about the information we can give to stakeholders. Statistics has the potential to help us in all of those areas.
the purpose of statistics is to analyse the data in such a way as to lead to new conjecture.
Limiting ourselves to basic counts and averages – as I see us doing and do myself – is like a carpenter who only uses a tape measure and his bare hands. Statistics gives us not only new avenues to explore, but a lexicon for talking about the likelihood of some result and also about the degree of confidence we can have in the likelihood itself. It provides tools for helping us to calculate those values and, importantly, is very explicit about the assumptions that any particular set of tools requires.
Frequently, it will also provide methods to permit an assumption to be violated but at some other cost – for example, penalising the confidence we have in a result – or to calculate which parameters of an experiment need to be changed by what amount in order to get a level of confidence that would be acceptable – say, by collecting twice as much data. It can be used to organise, summarise, investigate and extrapolate.
And then I read this sentence:
Ask any question starting with ‘how many…?’ and capture-recapture stands a good chance of supplying an answer.
It’s the last sentence in an article in Significance, the Royal Statistical Society’s magazine, that’s sufficiently light on the statistics that I could actually read it all the way to the end [Amoros]. The piece describes how, in 1783, Laplace used a reasonably simple statistical technique for estimating the population of France based on a couple of sets of incomplete data and some simplifying assumptions. Essentially the same technique is these days called Capture-Recapture (CR) and is used to estimate populations of all kinds from the number of fish in a pool to the incidence of cancer in a city.
I wondered whether there was a possibility of estimating bug counts using the technique and, if there was, under what assumptions and conditions it might apply.
Why would you want to estimate bug numbers?
Good question. Among the many challenges in software testing are deciding where to apply test effort and when to stop applying that effort. Many factors help us to make those decisions, including our gut, the testing we’ve already done in this round, our experience, our expertise, information from users, information about the users, information from developers, information about the developers, historical information about the system under test and so on. While bare bug counts might not be that interesting, a relative bug “density” could be, in addition to other information.
Imagine that you could point to an analysis of the new features in your latest release, and see that, with a high degree of (statistical) confidence, a couple of them were more likely to contain more bugs than the others proportionate to some factor of interest (such as their perceived size or complexity or potential to interact with other parts of the product). Mightn’t that be an interesting input to your decision making? How about if the statistics told you that (with low confidence) there wasn’t much to choose between the features, but that confidence in the estimates would be increased with a defined amount of investigative effort. Could that help your planning?
Err, OK. How might it work?
CR basically works with two samples of items from a population and uses a ratio calculated from those samples to estimate the size of the population. Let’s say we’d like to know how many fish there are in a pond – that’s the population we’re trying to estimate. We will catch fish – the items – in the pond on two occasions – our two samples. On the first occasion we’ll mark the fish we catch (perhaps with a tag) and release them back. On the second occasion we’ll count how many of the marked fish we recaught:
Sample 1: 50 fish caught and marked
Sample 2: 40 fish caught, of which 10 are marked
Which gives an estimated number of fish in the pool of 40 * 50/10, i.e. 200. This accords reasonably well with naive intuition: we know there are 50 marked fish; we caught 20% of them in the second sample; the second sample was size 40 so we estimate that 40 is 20% of the total population, which would be 200.
In Laplace’s work, pre-existing data was used instead of physical captures. The same thing is still done today in, for example, epidemiological studies. [Corrao] attempts to estimate the number of people with alcohol problems in the Voghera region of Italy using lists such as the members of the local Alcoholics Anonymous (AA) group and people discharged from the area’s hospital with an alcohol-related condition. The lists form the samples but there is no capture and no marking. Instead we identify which individuals are the same in both data sets by matching on names, dates of birth and so on. Once we have that, we can use the same basic technique:
AA: 50 people
hospital: 40 people, of which 10 also attended AA
and now the calculation is that the estimated number of people in the region with alcohol problems is, as before, 40 * 50/10.
If it could work, the analogy for bugs might be something like this: two periods of testing on the same system (or feature or whatever) generate two sets of bugs reports. These are our samples; bugs which are considered duplicates across the two periods are our captured items and so we have:
Period 1: 50 bugs found
Period 2: 40 bugs found, of which 10 are duplicates of those from Period 1
giving 40 * 50/10 or 200 bugs as the estimate of the number of bugs to be found in the system under test. Note that this is not an estimate of undiscovered bugs, but the total including those already found.
Seems straightforward enough, but …
You’re right to be doubtful. There are a bunch of assumptions imposed by the basic technique itself:
- capture and recapture (or the data sets) are independent
- all items have the same probability of being captured
- we can identify instances of the same item in both samples
- there is no change to the population during the investigation
It’s not hard to find other questions or concerns either. With your tester head on you’re already probably wondering about that estimation formula:
- if the two samples are identical does the estimate say that we have captured the entire population? (Using the same formula as above, with N bugs found in both samples, we’d have N*N/N = N bugs estimated to be found, i.e. we estimate we’ve found them all.) Under what realistic circumstances would we be prepared to believe that? With what degree of confidence?
- if the two samples don’t overlap at all the formula results in a division by zero: N*M/0. In statistics this is generally dealt with by invoking a more sophisticated estimator (the formula) such as the Lincoln-Petersen model [WLF448].
- is the method valid when samples are small or even zero? [Barnard], applying CR to software inspection, throws away data where 0 or 1 defects are found. [WLF448] says that the Lincoln-Petersen model can apply here too.
We can come up with issues of variability:
- can we account for different sampling conditions? When catching fish, perhaps the nets are bigger for the second sample, or the water is rougher; in the case of bugs, perhaps the testers have different approaches, skill levels, domain knowledge and so on.
- can the method account for variation in opportunity to sample e.g. short time spent on sample 1 but large time spent on sample 2?
- can attributes of the items be taken into account? For example some bugs are much more important than others.
- can different “kinds” of items be collected in a single sample? Is it right that “CR can only work with one kind of bug at a time. Stats on recapture of a particular fish says nothing about the population of snails.” [Twitter]
There is discussion of these kinds of issues in the CR literature. The state of the art in CR is more advanced than the simplistic walk-through I’ve given here. More than two samples can be combined (e.g. in the [Corrao] paper I mentioned above there are five) and, as we’ve seen already, more sophisticated estimators used. [Tilling] talks about techniques for overcoming violation of the methodological assumptions, such as stratification of the samples (for example, looking for different “kinds” of bugs) and notes that appropriate experimental design can reduce bias and improve estimate accuracy. [Barnard] and [Miller] discuss models which attempt to account for variability the in detection probability (to permit different kinds of bugs to be found with different probabilities) and detection capability (to permit different testers to have different probabilities of finding bugs).
I didn’t come across references which said that sampling methodology had to be identical in each capture, although there is the underlying assumption that the probability of any item being captured is equal. I found examples in which the sampling method was undoubtedly different – although this was not part of the experimental design – but the analysis proceeded, e.g. [Nichols] where external factors – a raccoon interfering with the vole traps used in the study – caused some data to be discarded. Further, epidemiological studies frequently use data collected by disparate methods for reasons other than CR.
My understanding is that the items captured need only be as homogeneous as the study context requires. For instance, in the [Corrao] study multiple different alcohol problems were identified as relevant to the hospital sample. This was an experimental design decision. Increasing or decreasing the set would likely change the results. It’s up to the experimenter to be clear about what they want to study, the way they are going to approach sampling it, and the strength of claim they can make based on that. More specific sampling may mean a less general claim. Statistics, like testing, is a sapient activity where judgement and intelligence are critical.
In addition to variability in the methodology, there is the question of variability – or uncertainty, or error due to the probabilistic nature of the techniques – in the estimates produced by it. As one correspondent put it “[I] suspect too many unknown [variables] to accurately estimate unknown population” [Twitter]. There’s a very good chance this is true in the general case. The level of statistical error around the numbers might be sufficiently high to make them of limited utility. There are established techniques in statistics for estimating this kind of number. Using one online calculator and the figures in my examples above suggests a sample error of up to 14%.
And specifically for testing?
Good point, let’s consider a few questions directly related to the application of CR in testing, some of which came from a thread I started on [Twitter]:
- “it is necessary to assume that finding bugs is anything like finding animals.” [Twitter]
- “[CR] assumes same methods find same bugs, and that all bugs can be found via same method?” [Twitter]
- “counting defects has very little practical use or pragmatic meaning.” [Twitter]
I don’t think it is true that we need to assume that finding bugs and animals are “anything like” each other: the epidemiological literature is good evidence of that. I also don’t think it’s true that CR assumes that all bugs can be found by the same method: again, the epidemiological application does not make this assumption.
We’ve already discussed why we might be prepared to consider estimating defect counts. I’d agree in general counting bugs is of little practical use, but I don’t think I would advocate using the technique for finding counts as an end in itself, only as additional evidence for decision making, treated with appropriate caution. I’m interested in the idea that we might be able to derive some numbers from existing data or data that’s gathered in the process of performing testing in any case.
It’s not hard to think of other potential concerns. In testing we’ll often choose not to repeat tests and the second sampling could be seen as repetition. But, is it the case that different testers in the same area with the same mission are repeating tests? Is it even the case that the same tester with the same mission is necessarily repeating a test? But then again, can you afford two testers with the same mission in the same area for gathering CR data when there are other areas to be tested?
If testers were being used to sample twice in the same area we might worry that their experience from the first sampling would alter how they approached the second. Certainly, they will have prior knowledge – and perhaps assumptions – which could influence how they tested in the second sample. Protection against this could include using different testers for each sample, deliberately assigning testers to do different things in each sample.
In order to make the most of CR we have to ensure that the sampling has the best chance to choose randomly from the whole of the population of interest. If the two samples are set up to choose from a only subset of the population (e.g. a net with a 1m pole in a pool 3m deep can never catch fish in two-thirds of the pool) then the estimate will be only of the subset population. Cast this back to testing now: in order to provide the most chance of sampling most widely we’d need different techniques, testers etc. But this is likely to either increase cost or introduce data sparsity.
Can we agree on what a bug is anyway and does it matter here? Is it important to try to distinguish fault from failure when counting? [Barnard] applies CR to Fagan inspections. The definition of this on Wikipedia describes a defect as a requirement not satisfied. If this is the same definition used in [Barnard] then it would appear to exclude the possibility of missing or incorrect requirements. In [LSE] there’s no definition but implicitly it seems broader in their attempts to find bugs in code review. Again, it’s on the experimenter to be clear about what they’re capturing and hence how they interpret the results and what they’re claiming to estimate. Unless the experiment collects specific meta data accurately, the results won’t be able to distinguish different classes of issue, which may make them weaker. And, of course, more data is likely to mean higher cost.
Bugs are subject to the relative rule; they are not a tangible thing, but only exist in relation to some person at some time, and possibly other factors. In this sense they are somewhat like self-reported levels of happiness found in psychological tests such as the Subjective Happiness Scale and less like a decision about whether the thing in the net is a fish or the patient in question has cancerous cells. The underlying variability of the construct will contribute to uncertainty in any analysis built on top of data about that kind of construct.
An interesting suggestion from [Klas] is that CR “can only be applied for controlling testing activities, not for planning them, because information collected during the current testing activity is required for the estimates.” I’m not sure that the controlling/planning distinction would bother most context-driven testers in practice, but it’s a valid point that until some current data is collected the method necessarily provides less value.
So, has anyone made it work?
[Petersson] suggests that (at the time of his writing) there weren’t many industrial studies of capture-recapture. Worryingly, [Miller] reviews a study that concludes that four or five samples will be needed (in software inspections rather than testing) to show value from any of the estimators considered although he does go on to point out shortcomings in the approach used. However, this might tie in with the thought above that there are too many unknown variables – getting more data can help to ameliorate that issue, but at the cost of gathering that extra data.
As noted earlier [Barnard] claims to have successfully applied the approach to defect detection using software inspections giving evidence of an improvement in the prediction of the need for reinspection over methods based purely on the feelings of the inspectors. The development methodology here (on software for the Space Shuttle) is pretty formal and the data used is historical, so provides a good basis for evaluation. This kind of inspection appears to exclude the possibility of identifying missing or incorrect requirements. In principle I don’t think this affects the result – the Fagan inspection has a very restrictive oracle and so can only be expected to identify certain kinds of defects. With broader oracles, the scope for defect detection could presumably increase.
Have you tried to make it work youself?
Well, I’ve been wondering about what might make a reasonable experiment, ideally with some validation of the estimates produced. I have ready access to several historical data sources that list defects and are of a reasonable size. They were collected with no intention of being used in CR and so are compromised in various ways but this is not so different from the situation that epidemiological studies encounter. The sources include:
- Session-Based Test Management reports
- customer support tickets
- bug database
- informal and ad hoc test and experiment reports
When CR is used in biology, it is not generally possible to capture the same item twice in the same sample. In the epidemiological case it is possible, for example if a patient has presented multiple times with different identities. In the bug case, it is perfectly possible to see the same issue over and over. The relative frequency of observation might be an interesting aspect of any analysis that CR will not exploit, if the data even existed.
We could consider randomly sub-dividing this report data into sets to provide multiple artificial samples. If we did that, we might initially think that we could use meta data that flags some bug reports as duplicates of one another. However, it will often be the case that bugs are observed but not reported because they’re a duplicate and so no data will be gathered.
Further, if we use only the observed bugs – or perhaps, only the observed and recorded bugs – then we stand a good chance of making biased inferences [Tilling]. All sorts of factors determine whether observed issues will end up in whatever bug tracking tool(s) are being used.
Our session reports from SBTM are more likely to contain notes of existing issues encountered, so it might be more tractable to try to mine them. Other data collected in those reports would permit partitioning into “areas” of the product for consistency of comparison, at the expense of making data sparser. Typically, we won’t run sessions with the same charter frequently, so objections about the probability of finding issues would be raised. However, we have sometimes run “all-hands” testing where we set up a single high-level mission and share a reporting page and test in areas guided by what others are doing.
Given the way my company operates, we’re frequently working on daily builds. Effectively we’re changing the environment in which our population lives and that churns the population, violating one of the assumptions of CR. To make any experiment more valid we’d probably need to try to use consistent builds.
To consider one scenario: I might like to be able to use data from reports generated in the late stages of a release cycle to estimate relative numbers (or maybe density, as above) of issues found in parts of the software and then compare that the relative figures obtained from our customer support tickets. If they match I might feel justified in having some faith in the predictive power of CR (for my application, tested the way we test it and so on). If they don’t match, I might start casting around for reasons why. But, of course, this is confirmation bias. Any reasons I can come up with it failing to work, could have analogues which inadvertently caused it to appear to work too.
Right now, I don’t have a good experiment in mind that would use the data I already have.
So what might a new experiment look like?
Practical constraints based on what’s been discussed here might include:
- try to vary the testers across the two samples – to avoid prior knowledge tainting a second sample.
- even if the two sets of testers are distinct, try to stop them talking to one another – to preserve independence of the samples.
- ensure that the reporting framework for a sample permits issues seen to be recorded independently of the other sample – so the testers don’t not file dupes of existing issues.
- direct testers to use particular methods or be sure to understand the methods used – so as to give some idea of the space in which there was potential for bugs to be found in each sample, and where there was not.
- be aware of the opportunity cost of collecting the data and the fact that multiple rounds might be needed to get statistical confidence
We might consider random assignment of testers to samples – treating it like assignment to experimental groups in a study, say. But this may be problematic on smaller teams where the low number of people involved would probably give a high potential for error.
I’d like to think of a way to evaluate the estimates systematically. Perhaps more directed testing in the same areas, by different testers again, for some longer period? Comparison against an aggregate of all issues filed by anyone against the product for some period? Comparison to later customer reports of issues? Comparison to later rounds of testing on different builds? Informal, anecdotal, retrospective analysis of the effects of using the results of the CR experiment to inform subsequent testing?
All of these have problems but it’s traditional in this kind of scenario to quote Box’s observation that while all models are wrong some are useful in this kind of situation. It would not be unreasonable to use CR, even in contexts where its assumptions were violated, if it was felt that the results had utility.
Even given this, I don’t have a new experiment in mind either.
Hah! Plenty of words here but ultimately: so what?
When we talk about risk-based testing, how often do we attempt to quantify the risk(s) we perceive? Knight made an interesting distinction between risk and uncertainty where the former is quantifiable and the latter is not (and is related to Taleb’s Black Swan events). Statistics can’t tell you the outcome of the next coin toss but it can tell you over the longer term what proportion of coin tosses will be heads and tails. Likewise, statistical techniques are not going to guarantee that a particular user doesn’t encounter a particular bug, but could in principle give bigger picture information on the likelihood of some (kinds of) bugs being found by some users.
You might already use bug-finding curves, or perhaps the decreasing frequency of observation of new issues, as data for your decision making. With appropriate caution, naturally [Kaner]. Statistics has the potential to help but I don’t really see testers, myself included, using it. I was delighted when one of my team presented his simulations of a system we’re building that used Poisson process to model incoming traffic to try to predict potential bottlenecks, but it’s symptomatic (for me) that when I was thinking about how I might set up and evaluate an experiment on the utility of CR I did not think about using hypothesis testing in the evaluation until long after I started.
Even if were were using statistics regularly, I would not be advocating it as a replacement for thinking. Take CR: imagine a scenario where the pond contains large visible and small invisible fish. Assume that statistical techniques exist which can account for sample biases, such as those listed above and others such as the bait used, the location and the time of day. If we sample with a net with holes bigger than the small fish, we will never catch any of them. If we don’t use observation techniques (such as sonar) that can identify the invisible (to our eyes) we won’t even know they’re there. The estimates produced will be statistically valid but they won’t be the estimates we want. Statistics gives answers … to the questions we ask. It’s on us to ask the right questions.
I’m very interested in practical attempts to use of statistics in software testing – with successful or unsuccessful outcomes. Perhaps you’ve used them in performance testing – comparing runs across versions of the software and using statistics to help see whether the variability observed is down to software changes or some other factor; or to perform checks to some “tolerance” where the particular instances of some action are less important than the aggregate behaviour of multiple instances; or perhaps modelling an application or generating data for it? If you have have experiences or references to share, I’d love to hear about them.
Amoros: Jaume Amorós, Recapturing Laplace, significance (2014)
Barnard: Julie Barnard, Khaled El Emam, and Dave Zubrow, Using Capture-Recapture Models for the Reinspection Decision Software Quality Professional (2003)
Corraoa: Giovanni Corraoa, Vincenzo Bagnardia, Giovanni Vittadinib and Sergio Favillic
Capture-recapture methods to size alcohol related problems in a population, Journal of Epidemiology and Community Health (2000)
Kaner: Cem Kaner, Software Testing as a Social Science (2008)
Klas: Michael Kläs, Frank Elberzhager, Jürgen Münch, Klaus Hartjes and Olaf von Graevemeyer, Transparent Combination of Expert and Measurement Data for Defect Prediction – An Industrial Case Study, ICSE (2010)
LSE: The capture-recapture code inspection
Miller: James Miller, Estimating the number of remaining defects after inspection, Software Testing, Verification and Reliability (1998)
Nichols: James D. Nichols, Kenneth H. Pollock and James E. Hines, The Use of a Robust Capture-Recapture Design in Small Mammal Population Studies: A Field Example with Microtus pennsylvanicus, ACTA THERIOLOGICA
Petersson: H. Petersson, T. Thelin, P. Runeson and C. Wohlin, “Capture-Recapture in Software Inspections after 10 Years Research – Theory, Evaluation and Application”, Journal of Software and Systems, (2004)
Pitt: Capture Recapture Web Page
Scott: Hanna Scott and Claes Wohlin, Capture-recapture in Software Unit Testing – A Case Study, ESEM (2008)
Tilling: Kate Tilling, Capture-recapture methods—useful or misleading? International Journal of Epidemiology (2001)
Twitter: @jamesmarcusbach @raine_check @HelenaJ_M @JariLaakso @kinofrost @huibschoots, Twitter thread (2014)
WLF448: Fish & Wildlife Population Ecology (2008)