The other day I left a highly fashionable shop (ok, a supermarket) with some clothes and, when I got home, I realised that the tags were still attached. So the following day I went to a different store of the same chain to get the tags removed, and was pleased to note that the the dreaded alarm didn’t go off as I initially walked in the door. Being an enthusiastic test engineer, this made me think about errors and how we handle them.
“When and where” data
The importance of errors can vary depending on the type of work you’re doing. As a general rule, if an error is encountered in a system, the ideal result would be that the user is given enough information to work around the issue or at least understand why it happened, and the product designers get the information they need in order to know what happened and why.
I’m not saying that we should focus our energy on the errors – of course it’s more important that the system works correctly – however, we should definitely not ignore them. At Red Gate, the tools have the ability to automatically send in error reports. These error reports should provide us with the essential information, which in many circumstances will enable us to understand how the error occurred and thus determine why. These error reports then automatically generate a new bug in our repository or update an already existing one if they are linked.
When it comes to deciding which functions and errors are the most important, Elisabeth Hendrickson raises the concept of the Never and Always heuristic in her book Explore it: what must your application always do and what must it never do? For these security tags (as a trivial example), presumably they should never fall off and they should always set off the alarm when they pass through the detectors at the exit. Any information relating to a failure in those areas is critical for the designers and store-owners to know “When & Where” in detail in case it’s a failure in the tag, or just in that store’s tag-scanner. The end-users don’t particularly care beyond whether or not the tags set off alarms incorrectly!
Testing that errors behave correctly
Recently there were several articles about a widely used GnuTLS library where the error case not working correctly actually opened several operating systems and applications up to a security hole. The issue, which left many systems vulnerable to connection monitoring and package decryption, came about because certain errors that might occur during x509 certificate verification were not being handled correctly and were being reported as successful. This flaw would enable someone to create an invalid certificate that would pass verification and fool otherwise-secure systems, and thus allow for the decryption of protected communications. This is a perfect example of why even error cases demand attention – let’s take a quick look into what lead up to this issue.
Coding errors vs. missing tests
Surprisingly to me, the Ars Technica article I read talks of this being “the result of someone making critical mistakes in source code that controls critical functions of the program”, whereas my internal testing alarm bell is going off, asking where were the tests for this? This is a testable case, just like everything else in the code-base should be.
Of course, just with any other bugs in the code, the problem doesn’t boil down to either poor coding or lack of testing, but it is the combination of both that has left this library so vulnerable. Looking at the code itself, there appeared to be confusion over C error handling, where less than zero is returned for failure and zero for success, as compared with 0 meaning false and 1 meaning true – something so simple that can have such far reaching implications.
The article goes on to mention that it is “significant that no one managed to notice such glaring errors” which makes me suspect that the author hasn’t done much production coding. We all know that even, with the best of intentions, we occasionally commit large chunks of code which result in code reviews becoming more of a shoulder surfing exercise!
The lesson to be learned here is that attention to detail is absolutely vital, and the best way to maintain that is to encode it in your test suite. In addition, make sure to keep code commits small. If you don’t do code reviews and pairs coding, consider adding them into your organisation.
If the code seems hard to test, that can sometimes imply that we have not designed it correctly in the first place. When designing a new piece of code try to bear in mind this question: how will it be tested? Consider using elements of test driven development, as it can really help ensure testability as well as correct code.
How do you test for correct errors?
Our team deliberately has an error in our system that we use for testing – specifically, we check it’s displaying the correct information, which gives us confidence that other errors will also be handled in the same manner. That said, we recently found out that that was not enough – when we made changes to log4net, this error functionality got broken and the relevant information was suddenly not being sent in. On the plus side, we were already looking for our errors to behave in a certain way, which gave us a fighting chance of spotting this change in behaviour.
We’ve now extended our test to check that the information is being displayed correctly and that this information is then handed over to the application engine which will report it back to the team. As you can see, this isn’t a horrendously complex case, so there’s no reason you shouldn’t be running tests for it!
I know the tags at a supermarket are only a deterrent, but if they don’t work the cost to the business could be huge. We should always focus on making sure that the product does what it’s supposed to, but we should also always allocate some time to determine what will happen if (when) an error occurs.
What information will be collected? can you automatically send it back into the organisation? Could you even automatically open an issue from it? And just as we prioritise what functionality is important, it’s important to identify where an error would be most costly, and then to try and provoke those errors to check how they’re handled. It is always better to know.