The Oracle Problem and the Teaching of Software Testing (Cem Kaner, J.D., Ph.D.)

On September 11, 2012, in Syndicated, by Association for Software Testing

When I first studied testing, I learned that a test involved comparison of the test result to an expected result. The expected result was the oracle: the thing that would tell you whether the program passed or failed the test. We generalized this a bit, especially for automated testing–we would look to a reference program as our oracle. The reference program is a program that generates the expected results, rather than the results themselves. But the idea was the same: testing involved comparison with a known result.

The Oracle Problem

Oracle is an interesting choice of terminology, because the oracles of Greece (the original “oracles”) were mythological. And Greek tragedies are full of stories of people who misinterpreted what an oracle told them, and behaved (on the basis of their understanding) in ways that brought disaster on them.

If we define a software testing oracle as a tool that tells you whether the program passed your test, we are describing a myth–something that doesn’t exist. Relying on the oracle, you might make either of the classic mistakes of decision theory:

  • The miss: you believe the program has passed even though it did something wrong.
  • The false alarm: you believe the program has failed even though it has behaved appropriately.

So we soften the definition: a software testing oracle is a tool that helps you decide whether the program passed your test.

Seen this way, oracles are heuristic devices: they are useful tools that help us make decisions, but sometimes they point us to the wrong decision.

If you don’t have authoritative oracles (“authoritative” = an oracle that is always correct), then how can you test? How can you specify a test in a way that a junior tester or a computer can run the test and correctly tell you whether the program passed it?

The Instructional Problem

I’ve been emphasizing the oracle problem in my testing courses for about a dozen years. I see this as one of the defining problems of software testing: one of the reasons that skilled testing is a complex cognitive activity rather than a routine activity. Most of the time, I start my courses with a survey of the fundamental challenges of software testing, including an extended discussion of the oracle problem.

If you’ve seen the BBST-Foundations courses (for example, the Association for Software Testing teaches a version of this course), you’ve seen my introduction to the oracle problem and to heuristic oracles. Students typically work through one or more theoretical and/or practical labs in BBST. Once they understand the oracle problem, the course presents two approaches for using oracles:

  • One approach, that I associate with James Bach and Michael Bolton, lists 8 general types of expectations. For example, we expect a product to operate consistently across all of its features. We expect it to operate consistently from version to version, etc. (I’ll list the set of consistencies later.)
  • A different approach, that I associate with Doug Hoffman (but I think it’s been independently developed and followed by lots of people), lists specific heuristics. None of them is complete. Each focuses on a specific prediction about the results of a test and ignores other aspects of the test. For example, if we are testing a program that does calculations, checking whether it says 2+3=5 (5 is the expectation) is not complete, but it is useful. Testing whether we can invert an operation (take the square root of a square of a number) isn’t a complete test of a square-a-number function, but it is useful. (More examples below…)

Bach and Bolton have done a good job of explaining their approach. Mike Kelly provides a great summary of it. I follow an explanatory approach that I learned from an early version of Bach’s RST course in my lectures and it works well. Students understand the consistencies and (generally) find them compelling. The list feels complete–any specific oracle you can think of can be classified as an example of one of their consistencies.

I think there are three problems with Bach and Bolton’s consistencies:

  1. These provide a useful way to think about a bug after you find it and try to report it. The list of consistencies can structure your thinking as you try to figure out how to explain to someone else why a particular program behavior feels wrong (what feels wrong about it?). However, even though I do find them helpful for evaluating test results, I don’t find them helpful for designing tests.
  2. I think they are particularly worthless for designing automated tests. Automated testing depends on oracles–the automated-testing-program that runs a zillion tests has to decide whether the software under test passed each test or not. The consistencies don’t guide testers (not me, not the testers I know, not the students I teach) toward oracle ideas that are specific enough to be programmed and used by an automaton.
  3. In my courses, the consistencies capture my students’ imagination and interfere with their thinking about oracles that would be useful for designing tests, especially automated tests.

It is the third problem that I have been wrestling with for several years and that will cause me to rewrite the BBST-Foundations course.

In exam after exam, when I give students a specific scenario that clearly involves automated testing and ask them to suggest oracles they would use to support their automated testing and how they would use them, they ramble through a memorized list of consistency heuristics and don’t come up with ideas–some students don’t come up with any ideas–for oracles that support the automation.

  • I tried to correct this problem by raising the problem in supplementary lectures. It didn’t work.
  • I went even further, telling students that this was a classic problem in this course and they needed to answer questions about oracles in test design with specific oracles. It didn’t work.
  • I went even further, telling students as part of the exam question itself that this question called for specific ideas about the oracles they would design into specific tests and they shouldn’t rely on general descriptions of consistencies. It didn’t work.
  • Even when I give a set of exam questions to students in advance, and they draft an answer in advance with full benefit of time, course notes, lecture notes and videos, discussions with each other, and anything they can find on the web–even when these questions have cautionary notes about this being a question about test automation and they shouldn’t just present a consistency oracle–it still didn’t work.

They just keep giving back worthless ideas about test design and flunk that part of the exam.


An Important Instructional Heuristic

When a few students give bad answers on an exam, the problem is in the students. They don’t understand the material well enough.

When a lot of students give bad answers on the exam, the problem is in the instruction. It’s the responsibility of the teacher to troubleshoot and fix this.

When a lot of students give bad answers that are weak in a consistent way, something specific in the instruction leads them down that path. In my experience, that something is often something the instructor is particularly attached to.

I like the consistencies a lot. But I think that in the Foundations course, they are an attractive nuisance (an almost-irresistable invitation to take a hazardous or counterproductive path).


  • In the next generation of the Foundations course, I will probably drop the oracle consistencies approach altogether.
  • In the next generation of the Bug Advocacy course, I will probably add the oracle consistencies as a useful tool for persuasive bug report writing.

Appendix: More Details on Oracles

  1. The underlying problem: Oracles are necessarily incomplete
  2. Oracles are heuristics
  3. Bach and Bolton’s consistencies
  4. Hoffman’s approach
  5. Applying this to test automation

Oracles are Necessarily Incomplete

Back when dinosaurs roamed the earth, some testasauruses theorized that a properly designed software test involves:

  • a set of preconditions that specify the state of the software and system when you start the test
  • a set of procedures that specify what you do when you do the test
  • comparison of what the software under test does with a set of postconditions: the predicted state of the system under test after you run the test. This set of postconditions make up the expected results of the test.

We can call the postconditions the oracle or we can say that a program that generates the expected results is the oracle, but in either case, the testasauruses said, good testing involves comparing the program’s test behavior to expected results, and to do good testing, you need an oracle. (Fossils from this era have been preserved in IEEE Standard 829 on software test documentation.)

Elaine Weyuker’s (1980) On Testing Nontestable Programs shattered that view. Weyuker argued that “it is unusual for … an oracle to be pragmatically attainable or even to exist” (p. 3). Instead, she said, testers rely on partial oracles. For example:

  • A tester might recognize a result of a calculation as impossibly large even though she doesn’t know what the exact result should be. (You might not know offhand what 1.465732 x 2.74312 is, but if a program said 7,000,000 you could reject that as obviously wrong without doing any calculations.)
  • A tester might recognize behavior as inappropriate, even if she doesn’t know exactly how the program should behave.

Weyuker’s paper wasn’t widely noticed in the practitioner community. I don’t think we appreciated the extent of this problem until the Quality Week conference in 1998, when Doug Hoffman (A Taxonomy for Test Oracles) explained this problem and its implications this way:

Suppose that we specify a test by describing

    • the starting state of the system under test
    • the test inputs (the data and operations you use to carry out the test)
    • the expected test outputs

We can still make mistakes in interpreting the test results.

    • We might incorrectly decide that the program passed the test because its outputs matched the expected outputs but it misbehaved in some other way. For example, a program that adds 2+2 might get 4, but it is clearly broken in some way if it takes 10 hours to get that result of 4.
    • We might incorrectly decide that the program failed the test because its outputs did not match the expected results, but on more careful examination, we might realize that it did the right thing. For example, imagine testing to a network printer with the expectation that the printer will page a specific page within 1 minute–but during the test, another computer sent a long document to the printer and so it didn’t actually get to the test document for a long time. This might be the exactly correct behavior under the circumstances, but it doesn’t match the expectation.

Most testers, doing manual testing, would probably not make either mistake. But an automated test would make both mistakes. So would a manual tester who was trying to exactly follow a fully-detailed script.

Doug argued that both types of mistakes were inevitable in testing because no one could fully specify the starting state of the system and no one could fully specify the ending state of the system. There are too many potentially-relevant variables. For example, suppose in your 2+2 test, you do specify the expected time for the test to complete:

  • Did you specify the contents of the stack? What if the program adds stuff to the stack but doesn’t remove it, or corrupts the stack in some other way?
  • Did you specify the contents of memory? Memory leaks are common bugs. And buffer overflows are a common example of a class of bug that corrupts memory.
  • Did you specify the contents of the hard disk? What if the program saves something or deletes something?
  • Did you specify what the printer would do during the test? What if the program sends something to the printer, even though it is not supposed to, or sends unauthorized email, etc.?

If you don’t have experience thinking about the diversity of ways that something can go wrong, but you have a bit of technical savvy, the Hewlett-Packard printer diagnostics can be eye-opening. You can find documentation of these in Management Information Bases (MIB’s) published by HP. I find these at but if this source goes away, you can find third party sites like OiDView. For example, the MIB file for the LaserJet 9250c runs 8506 lines, documenting 176 commands, many of them with many possible parameters. A program can go wrong in hundreds (or thousands) of different ways.

From a diagnostic point of view, imagine running a test and checking the state of the printer. For example, you might check how much free memory there is, or how long the last command took to execute, or the most recent internal error code. Each diagnostic command that you run changes the state of the machine, and so the results of the next diagnostic are no longer looking at the system as it was right after the test completed.

So in practical terms, even if you could fully specify the state of the system after a test (you can’t, but pretend that you could), you still couldn’t check whether the system actually reached that state after the test because each of the diagnostics that you would run to check the state of the system would change the state. The next diagnostic tests the machine that is now in a different state. In practical terms, you can only run a few diagnostics as part of a test (maybe just one) before the diagnostics stop being informative. If  these diagnostics don’t look for a problem in the right places, you won’t see it. This is sometimes called the Heisenbug problem, in honor of the Heisenberg Uncertainty Principle.

Oracles are Heuristics

Hoffman argued that no oracle can fully specify the postcondition state of the system under test and therefore no oracle is complete. Given that an oracle is incomplete, you might use the oracle and incorrectly conclude that the program failed the test when it didn’t or passed the test when it didn’t. Either way, reliance on an oracle can lead you to the wrong conclusion.

A decision rule that is useful but not always correct is called a heuristic.

My favorite presentations of the ideas underlying Heuristics were written by Billy V. Koen. See his book (I prefer the shorter and simpler ASEE early edition used in introductory engineering courses, but the current version is good too) and a wonderful historical article that he wrote for BBST.

The Bach / Bolton Consistency Heuristics

Imagine running a test. The program misbehaves. The tester notices the behavior and recognizes that something is wrong. What is it that makes the tester decide this is wrong behavior?

In Bach’s view (as I understand it from talking with him and teaching about this with him), what happens is that the tester makes a comparison between the behavior and some expectations about the ways the program should (or should not) behave. These comparisons might be conscious or unconscious, but Bach posits that they must happen because every explanation of why a program’s behavior has been evaluated as a misbehavior can be mapped to one of these types of consistency.

Here’s the list:

  • Consistent within product: Function behavior consistent with behavior of comparable functions or functional patterns within the product.
  • Consistent with comparable products: Function behavior consistent with that of similar functions in comparable products.
  • Consistent with history: Present behavior consistent with past behavior.
  • Consistent with our image: Behavior consistent with an image the organization wants to project.
  • Consistent with claims: Behavior consistent with documentation, specifications, or ads.
  • Consistent with standards or regulations: Behavior consistent with externally-imposed requirements.
  • Consistent with user’s expectations: Behavior consistent with what we think users want.
  • Consistent with purpose: Behavior consistent with product or function’s apparent purpose.

(If you reorder the list, you can use a mnemonic abbreviation to memorize it: HICCUPPS.)

For example, imagine that there is a program specification and that the program behaves differently from what you would predict from the specification. The behavior might be reasonable, but if it contradicts the specification, you should probably write a bug report. Your explanation of the problem in the report wouldn’t be “this is bad”. It would be “this is bad because it is inconsistent with the specification.”

The list is designed to cover every type of consistency-expectation that testers rely on. If they realize the list is incomplete, they add a new type.

For the sake of argument, I will assume that this list is complete, i.e. that every rationale that a tester provides for why a program is misbehaving can be mapped to one of these 8 types of consistency.

I have seen it argued (mainly on Twitter) that this is the “right” list. That every other oracle can be mapped to this list (this oracle tests for this type of inconsistency) and therefore they are all special cases. If you know this list, the argument goes, you can derive (or imagine) (or something) all the oracles from it.

As far as I know, there is no empirical research to support the claim that testers in fact always rely on comparisons to expectations or that these particular categories of expectations map to the comparisons that go on in testers’ heads.

  • That assertion does not match my subjective impression of what happens in my head when I test. It seems to me that misbehaviors often strike me as obvious without any reference to an alternative expectation. One could counter this by saying that the comparison is implicit (unconscious) and maybe it is. But there is no empirical evidence of this, and until there is, I get to group the assertion with Santa Claus and the Tooth Fairy. Interesting, useful, but not necessarily true.
  • The assertion also does not match my biases about the nature of concept formation and categorical reasoning. As a graduate student, I studied cognition with Professor Lee R. Brooks. Some of his most famous work was on nonanalytic concept formation (see his 1978 chapter In Rosch & Lloyd’s classic Cognition & Categorization or his paper with Larry Jacoby (1984) on Nonanalytic cognition. A traditional view of cognition holds that we make many types of judgment on the basis of rules that put things into categories–something is this or that because of a set of rules that we consult either consciously or unconsciously. Bach and Bolton’s consistencies are examples of the kinds of categories that I think of when I think of this tradition. A very different view holds that we make judgments on the basis of similarity to exemplars. (An exemplar is a memorable example.) A person can learn arbitrarily many exemplars. Experts have probably learned many more than nonexperts and so they make better evaluations. One of the most interesting experiments in Lee’s lab required the experimental subject to make a judgment (saying which category something belonged to) and explain the judgment. The subjects in the experiments described what they said were their decision rules to explain each choice. But over a long series of decisions, you can ask whether these rules actually describe the judgments being made. The answer was negative. The subject would describe a rule that recently he hadn’t followed and that he would again not follow later. Instead, the more accurate predictor of his decisions was the similarity of the thing he was categorizing to other things he had previously categorized. It appeared that unconscious processing was going on, but it was  nonanalytic (similarity-based), not analytic (rule-based). I found, and still find, this line of results persuasive.

A list can be useful as a heuristic device, as a tool that helps you consciously think about a problem, whether the list describes the actual underlying psychology of testing or not.

But if it is to be a good heuristic device, it has to be more useful than not. As a tool for teaching oracles as part of test design, my experience is that the consistency list fails the utility criterion.

I don’t have any scientific research to back up my conclusion, just a lot of personal experience. But when dealing with a heuristic device that is not backed up by any scientific research (just a lot of personal experience), I get to rely on what I’ve got.

Doug Hoffman’s Approach

I first saw Doug talk about oracles in 1998, at Quality Week. That was the start of a long series of publications on oracles and the use of oracles in test automation. Along with the papers, I have the benefit of having taught courses on test automation with Doug and having talked at length with him while he struggled to get his ideas on paper.

Doug made two key points in 1998:

  • All oracles are heuristic (we’ve already covered that ground)
  • There are a lot of incomplete oracles available. Given that we have to rely on incomplete oracles (because no oracles are complete), we should think about what combinations of oracles we can use to learn interesting things about the software.

Doug’s work was so striking that we opened the Fifth Los Altos Workshop on Software Testing with it. That meeting became an intense, 2-day long, moderated debate between Doug and James Bach. We learned so much about managing difficult debates in that meeting that we were able to create what I think of as the current structure of LAWST, adopted in LAWST 6.

Doug has published several lists of specific types of oracles. Unfortunately, each of the ones I’ve read has its own idiosyncrasies that can be confusing, so I won’t try to restate them. Instead, I’ll work from BBST-Foundations-2013 (in preparation), which refines a list that I prepared for the current BBST Foundations with Doug’s coaching.

  • We use the constraint oracle to check for impossible values or impossible relationships. For example an American ZIP code must be 5 or 9 digits. If you see something that is non-numeric or some other number of digits, it cannot be a ZIP code. A program that produces such a thing as a ZIP code has a bug.
  • We use the regression oracle to check results of the current test against results of execution of the same test on a previous version of the product.
  • We use self-verifying data as an oracle. In this case, we embed the correct answer in the test data. For example, if a protocol specifies that when a program sends a message to another program, the other one will return a specific response (or one of a few possible responses), the test could include the acceptable responses. An automated test would generate the message, then check whether the response was in the list or was the specific one in the list that is expected for this message under this circumstance.
  • We use a physical model as an oracle when we test a software simulation of a physical process. For example, does the movement of a character or object in a game violate the laws of gravity?
  • We use a business model the same way we use a physical model. If we have a model of a system, we can make predictions about what will happen when events X take place. The model makes predictions. If the software emulates the business process as we intend, it should give us behavior that is consistent with those predictions. Of course, as with all heuristics, if the program “fails” the test, it might be the model that is wrong.
  • We use a statistical model to tell us that a certain behavior or sequence of behaviors is very unlikely, or very unlikely in response to a specific action. The behavior is not impossible, but it is suspicious. We can test whether the actual behavior in the test is within the tolerance limits predicted by the model. This is often useful for looking for patterns in larger sets of data (longer sequences of tests). For example, suppose we expect an eCommerce website to get 80% of its customers from the local area, but in beta trials of its customer-analysis software, the software reports that 70% of the transactions that day were from far away. Maybe this was a special day, but probably this software has a bug. If we can predict a statistical pattern (correlations among variables, for example), we can check for it.
  • Another type of statistical oracle starts with an input stream that has known statistical characteristics and then check the output stream to see if it has the same characteristics. For example, send a stream of random packets, compute statistics of the set, and then have the target system send back the statistics of the data it received. If this is a large data set, this can save a lot of transmission time. Testing transmission using checksums is an example of this approach. (Of course, if a message has a checksum built into the message, that is self-verifying data.)
  • We use a state model to specify what the program does in response to an input that happens when it is in a known state. A full state model specifies, for every state the program can be in, how the program will respond (what state it will transition to) for every input.
  • We can build an interaction model to help us test the interaction between this program and another one. The model specifies how that program will behave in response to events in (actions of) this program and how this program will behave in response to actions of the other program. The automaton triggers the action, then checks the expected behavior.
  • We use calculation oracles to check the calculations of a program. For example, if the program adds 5 numbers, we can use some other program to add the 5 numbers and see what we get. Or we can add the numbers and then successively subtract one at a time to see if we get a zero.
  • The inverse oracle is often a special case of a calculation oracle (the square of the square root of 2 should be 2) but not always. For example, imagine taking a list that is sorted low to high, sorting it high to low and then sorting it low to high. Do we get back the same list?
  • The reference program generates the same responses to a set of inputs as the software under test. Of course, the behavior of the reference program will differ from the software under test in some ways (they would be identical in all ways only if they were the same program). For example, the time it takes to add 1000 numbers might be different in the reference program versus the software under test, but if they ultimately yield the same sum, we can say that the software under test passed the test.

You can probably imagine lots of other possibilities for this list.

Applying This to Automation

What is special about these oracles is that they are programmable. You can create automated tests that will check the behavior of the program against the result predicted by (or predicted against by) any of these oracles or by any (well, probably, almost any) combination of these oracles.

Given a programmable oracle you can do high volume automated testing. Have the test-design-and-execution program randomly generate inputs to the software under test and check whether the software responds the way the oracle predicts. You might use some type of model to drive the random number generator (making some events more likely than others). You might randomly create a long random sequence of tests (e.g. regression tests) by randomly selecting which test to run next from a pool of already-built tests (each of which has an expected result that you can check against). Given an oracle, you can detect whatever failures that oracle can expose. For example, you might test with several oracles:

  • one oracle predicts how long an operation should take (or a range of possibility). If the program takes substantially more or less time, that’s a problem.
  • another oracle can predict the calculation result of the operation (or the functional result if you’re doing something else, like sorting, that isn’t exactly a calculation)
  • another oracle might predict the amount of free memory, or at least might tell you whether a large data set (or memory-intensive calculation) should fit in memory. If so, you can detect memory leaks this way.

No matter what combination you choose, you will miss some types of errors. You cannot test all the dimensions of the result of a test with any oracle or any combination of oracles. But if you test a feature by machine using some oracles, then when you test that feature with a human painstakingly designing and running each test individually, that person will know that she doesn’t have to waste time checking whether certain types of bugs are there or not, because if they were there, the automaton would already have exposed them.

An Exam Question

Here’s an exam question (that students in the current version of BBST Foundations have often handled poorly):

Suppose you have written a test tool that allows you to feed commands and data to Microsoft Excel and to Open Office Calc and to see the results. The test tool is complete, and it works correctly. You have been asked to test a new version of Calc and told to automate all of your testing. What oracles would you use to help you find bugs? What types of information would you expect to get with each oracle?

Note: Don’t just echo back a consistency heuristic. Be specific in your description of a relevant oracle and of the types of information or bugs that you expect.

Look back at the Hoffman list and think of what oracles you could use for this test, to facilitate extensive automated testing.

Now look further back to the Bach / Bolton list and think of what oracles their lists suggests, that would work well for designing extensive automated testing.

For me, the Hoffman list works better. (And if I thought about additional oracles that were specific enough to support automation, I would add them to the Hoffman list, growing it into something that gets longer and longer the more I use it.) What about for you?



Comments are closed.

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!