I have talked about and against bad metrics during my Rapid Software Testing courses and at conferences for a couple of years now. The “Lightning Talk” metrics rant that I did at CAST 2011 is available on YouTube (click here to view) although it is very “rough”. I felt that it was time that I cleaned up my thoughts on this subject and so I have created my first blog post.
Earlier this year I was very flattered to be the keynote speaker at the KWSQA conference and my topic was “Bad metrics”. I felt that this was not enough information for a keynote so I added: “and what you can do about it”. For this post I will just cover off the bad metrics aspect. I will leave the discussion alternatives for a later post. I don’t want to run out of blog ideas too quickly.
Characteristics of Bad Metrics:
I have identified four characteristics of “bad metrics“, in addition to “needing to serve a clear and useful purpose”. I do not expand much on this point because many organizations feel that their metrics are providing them clear and useful information with which they can make decisions on the quality of their product. Simply stating that their metrics do not meet this description is of little use. The problem is that they are typically unaware that their metrics are not only unclear, but are probably also causing highly undesirable behaviour within their teams.
1. Comparing elements of varying sizes as if they are equal
What is a test case? If you ask 10 testers you will likely receive 10 different answers. How long does it take to execute a test case? Again, this question does not have a simple answer. The effort could vary from less than a minute to over a week – yet when tracking progress many companies track test case completion and do not differentiate the effort required. If you work at a company that still counts test cases and the execution rate of test cases and you are looking to instigate change then you can ask your management this:
“How many containers do you need for all of your possessions?”
They will likely answer that it depends on the size of the container – and then you can say “Exactly!” I have heard the argument “we are tracking the average execution time” but that has issues, too. The easy stuff tends to happen first and goes quickly. Then the slower harder stuff comes along and before you realize what is happening the test team is holding up the release.
Heck, I’ll admit it. I used to track test case completion. I created pretty charts that showed my progress against my plan. All the quick and easy tests (including automated tests that counted the same as manual tests) were executed first – giving the illusion that we were well ahead of schedule. Unfortunately, when the harder, slower tests were the only tests left the executives felt that we had started to fall behind.
2. Counting the number of anything per person and/or team
On the surface this one sounds like it isn’t a problem. Why not measure how many bugs each tester (or team) raises? You will be able to identify your top performers, right? Wrong! Metrics like “number of bugs raised per tester” or “number of test cases executed per tester” cause competition which results in a decrease in teamwork within teams and information sharing between teams.
Imagine the situation where the amount of my annual raise depends on my bug count (regardless of whether I’m being measured on sheer volume of bugs, or hard to find bugs, or severity 1 bugs). One day I will be happily testing and I will find an area that has a large number of bugs. Will I help the company by immediately mentioning to my manager that I found this treasure chest of bugs? Perhaps, but more likely I would mention it only after I have spent a few days investigating and writing bug reports thus securing myself a nice raise. Wouldn’t the company benefit more if the tester shared the information sooner rather than later?
What if I had a cool test technique that frequently found some interesting bugs. Would I share that technique with others and potentially let them find bugs that I would otherwise find myself? Or, would I keep the technique to myself to help me outperform my peers?
If members of a team are measured on the number of test cases they each execute per week, then some testers would probably decide to execute the quick and easy tests to pad their numbers instead of executing the tests that make the most sense to execute.
Will testers take the time extra required to investigate strange behaviour if that would mean that they will fall behind in their execution rate? Likely not, thus leaving more bugs undiscovered in the product.
3. Easy to game or circumvent the desired intention
Making a metric into a target will cause unwanted behaviours. If you have a target of a “95% pass rate” before you can ship then your teams will achieve that pass rate no matter how much ineffective testing they have to perform to meet the target. I used to think that looking at pass rate was a good metric until I had a discussion with James Bach about 8 years ago. We had a 10 minute long discussion where I was trying my hardest to fight for the validity of the coveted “pass rate” metric. Here is a brief summary of the conversation:
If you had a pass rate of 99.9% but the one failure caused data base corruption and occurred fairly regularly, would you ship the product? (“No”, I answered) OK. What if you had an 80% pass rate but all the failures were minor and our customers could easily accept them in the product (obscure features, corner cases, etc), would you ship the product? (“Probably”, I answered) So, what difference does it make what the pass rate is? (“We use it as a comparison of general health of the product”, I attempted) That is nonsense. My previous questions are still valid. If the pass rate dropped but all the new failures are minor inconveniences what difference does it make? If the pass rate climbed by 5% by fixing minor bugs, but a new failure was found that caused the product to crash, is the product getting better? Why not just look at the actual list of bugs and your product coverage and ignore the pass rate?
I really felt that the metric was good before that conversation. Now I am invited to talk at conferences to help spread the word about bad metrics.
4. Single numbers that summarize too much information for executives (out of context)
Some companies require 100% code coverage and/or 100% requirements coverage before they ship a product. There can be some very useful information gathered by verifying that you have the coverage that you are expecting. These metrics are very similar to testing in general. We cannot prove coverage (prove no bugs) but we can show a lack of coverage (find a bug). These metrics may help the test team identify holes in their coverage but they cannot show the lack of holes. For example, a single test may test multiple requirements but only touch a small portion of requirement “A”. As long as that test is executed, the requirements coverage will show that requirement “A” has been tested but if no other tests have been executed against requirement “A” it is actually not being very well at all. This fact is hidden by the summary of the coverage into a single number taken out of context.
If a product is tested to a 100% code coverage that only means that each line of code was executed at least once. That can provide useful information to the test team much in the way that designers find information in their code compiling without warnings. There is some merit to the test team seeing that they have executed every line of code, but the code also needs to be executed with extreme values, different state models, and varying H/W configurations (to name only a few variants).
When executives see “100% code coverage, 100% requirement coverage, and a 99% pass rate” they will likely feel pretty good about shipping the product. The message they think they are seeing is that the product was very thoroughly tested but that may not be the case.
Coverage metrics can be useful to the test team to show them areas that may have been missed completely but will not replace other means of determining the coverage of their testing.
I hope this post will help some people explain to their management just why some (most) metrics are misleading and can cause unwanted and unexpected actions by their test teams.
Instead of using traditional metrics why not look at:
- the actual list of open defects
- an assessment of the coverage
- progress against planned effort (not test cases but actual effort)
There will be more to come in a future blog on the alternatives to the typical bad metrics commonly used today.