Catching Heisenbugs in Test Automation

„Ah, but I may as well try and catch the wind.“ (Donovan)
GUI-based testautomation (hopefully done with Selenium WebDriver) is programming. No, it’s even harder than common programming, because you have to cope with insane effects. Even if your test suite works fine locally and even if it works for a while on the server, it’s no guarantee that it works reliable. That’s because the AUT is a living thing. Everyone trying to testautomate for few weeks knows what I mean. Sometimes the performance of the AUT’s server is bad and fails your beautiful testcases without reason. Sometimes the AUTs developers feel that they have to change the IDs and Bang! Sometimes heaven decides to change something in the layout and your f*** testcase waits for a f*** link that isn’t visible anymore (without scrolling) because the f*** floating menu decided to shift 50 pixels. Sometimes this, sometimes that. Ok, one could live with it: being called a bungler, analyse, fix and getting more and more stable with every new release. But things aren’t that „easy“: you got a ticket from the defect manager that your testcase works incorrectly (proven with a screenshot). You are running the test suite again and again without beeing able to even reproduce the bug. This type of bugs is called a heisenbug and unfortunately that’s not a rare case in testautomation. A bug in your test suite that comes in let’s say 7 of 100 runs and happens non-deterministic. Without being able to reproduce it, of course you can’t verify if any of your fixes works. Dead end!
But there is a way to „reproduce“ the bug: use the bulk! Run your test-suite 100 times and count the number of appearances of that bug (in above case = 7). Try a fix in your code and rerun the test-suite 100 times. If the appearance of that bug is 0 you managed to fix it.
You can implement the bulk in various ways – my favorite is Jenkins:
0.) Given you’re running your nice testsuite in a Jenkins Job yet and called it „RottenTest“
1.) Create 100 jobs and run them in parallel.
Execute in jenkins script console:

def jobName = „RottenTest“
def job = Jenkins.instance.getItem(jobName) //get a reference to the job containing the heisenbug
def i = 1
while (i < 101) { def newJobName = "CatchingTheWind" + jobName + i def newJob = Hudson.instance.copy(job, newJobName) //create new jobs to get the execute in parallel for shortest possible total execution time newJob.scheduleBuild(new hudson.model.Cause.UserIdCause()) //start the new job i++ } [/sourcecode] 2.) Create a listview to directly compare the results of your 100 runs. Filter it with the following Regex: [sourcecode language="text"] CatchingTheWind.* [/sourcecode] 3.) Delete the 100 jobs after successful bugfix: [sourcecode language="text"] def jobName = "RottenTest" def job = Jenkins.instance.getItem(jobName) //get a reference to the job containing the heisenbug def i = 1 while (i < 101) { def newJobName = "CatchingTheWind" + jobName + i newJob = Jenkins.instance.getItem(newJobName) newJob.delete() i++ } [/sourcecode] Thanks to Karen, the unknown hero, for helping me with the quirks of Jenkins API.
BTW:
Run the complete test-suite because you can’t anticipate all dependencies of your buggy testcase.
The night is your friend.

IT Kosmopolit

Catching Heisenbugs in Test Automation

Anpacken!