Back to CarlSpeare.com

Making a good test harness

If you are going to test anything to a degree of statistical significance, you will need to create a test harness that automates the majority of the work. Humans are generally fairly poor at doing complex tasks repeatedly, so automation will allow multiple people to run the test in a reproducible and reliable manner.

First, be specific about what you are testing. Distill the test down to the most fundamental element: are you testing bandwidth? That's a broad test. You are really going to test a specific pattern of traffic between a dedicated pair of endpoints, with the medium in between being the test case. You can vary the pattern, but you need to make multiple identical tests of each pattern in order to ensure you have really reproduced the results. The first test might be a simple HTTP transfer; subsequently, you might test UDP traffic, or multicast traffic, or something else — but the essence should be that you are testing one thing, and can run the test frequently enough that you've removed "luck of the draw" situations.

Second, determine the best tools to run the test. In the bandwidth example, you might use curl or wget for HTTP patterns, and then switch to iperf3 for UDP patterns, and finally use msend/mdump for multicast patterns. However, do some investigation and make sure that you've at least done a superficial survey of what tools are available — it might be possible that a better testing tool exists compared to what you might have used in the past. (In the old days, I used ttcp extensively; now, almost nobody uses it, because tools like iperf3 have better features or more robust measurement methods.)

Third, once you have defined your test and selected the tool to run the test, do some preliminary testing to ascertain that you really are measuring what you think you are measuring — and that the results make sense. For example, it might be possible that someone uses wget to test network bandwidth, but ends up putting in an HTTPS URL. This might seem like an acceptable test, but you end up measuring two things at the same time: the bandwidth, and the encryption/decryption speed of both endpoints. Switch to HTTP, and run the tests many times to ensure that the sender has the file cached (if possible); similarly, have the client write the file to the bit bucket. Both situations are the same: you are removing disk speeds as part of the test, and trying to remove everything but the network throughput. Do the HTTP speeds match what you expect? For example, if you are getting 150 MB/s on a link that is claimed to be 1 Gb/s, you should distrust the situation: it is not possible to transfer at 150 MB/s on a gigabit line. Perhaps there is compression that is being done to the data before being sent? Make sure you are using random data to avoid compression effects.

Finally, once you have a reasonable certainty you are measuring the correct aspect, and have some guidance on what results are sensible, create the automation of the test. If you have a wget recipe for the test, put it into a loop and collect the results in a file. If you have an FTP job, get the .netrc set and create a script to capture all the results. Make sure systems and output have accurate time stamps so you can look at the results later and understand ordering. Do a few smaller test runs of the harness before you scale it up. If you are running a loop of wget operations, try 3-4 to make sure everything is working. Then, unleash it on a much larger sample size: 50, 100, whatever is enough to prove things. When in doubt, add more samples — not fewer.

The best verification of a test harness is to let someone else try it. Give them a guide — written, not shown in real-time — and ask them to try it. See what results are produced. If a 3rd person is available, let that person try it as well. Make sure there aren't assumptions being made, which might be obvious to you but not someone else. The purpose of justifying multiple runs across multiple people is easy: a test harness, if done correctly, can be run by anyone with access to a suitable environment. That's the point of documenting experiments in the scientific method: you want someone else to try it, and see if they get the same results.

On the defense of the scientific method

One of my earliest encounters with troubleshooting a non-trivial technology problem occurred at IBM Research, sometime around 1998-1999. Around that time, we were experimenting with 100 Mb/s Ethernet; prior to this point, the main Watson center building had a campus-wide 16 Mb/s Token Ring network. As Token Ring is a single-talker and non-collision medium, you expected that the throughput would be exactly the raw bit rate minus overhead — and it was, consistently. However, there was a delay in when you could send, so while the active throughput was predictable, the "experienced" throughput could be lower in a busy and large ring.

The Networking team had recently acquired a 100 Mb/s layer 2 switch, which would allow us to test providing faster access for applications that could benefit from the increased bandwidth. Since it was a many-talker situation, you do expect some degradation of performance in a busy network. However, we were testing a brand new switch in isolation — so our expectations were high.

We had two test systems to give for the purpose of testing: a Solaris 7 system that I had put together, and a newer RS/6000 running AIX 4.3 with an add-in Ethernet card. So, I set up an FTP server on the AIX system, used the FTP client on Solaris to grab an ISO or some other reasonably large file (large for the time — 650 MB today is considered "small"). We measured the first transfer — about 9.8 MB/s. Ok, that's fairly good! Converting to Mb/s, that's about 78.4 Mb/s, or near 78% efficiency. I think we should be able to get higher though.

We did some investigation, and ran some tweaks to window sizes on both the AIX and Solaris system. We ran the test again — 10.2 MB/s. That's about 82% efficiency. Seems fairly good. We change a few more options, reboot the switch, apply a few settings on the switch side, and so forth. We finally get the best run we've gotten: 10.8 MB/s. That's 86% efficient. We assume we cannot possibly do better, so we call it a day. (It is possible to do better, but that would be later on, long after this examination.)

Just for the fun of it, I decide to run the test a few times, and we see small variations — maybe 10.5 MB/s here or there, 10.7 MB/s to make it interesting. However, it's always in the collar between about 10.5 and 10.8 MB/s. I move on to the next project, but I left the test running. Over and over again, the Solaris system would get the remote file, record the time it took, and remove the local copy of the file. I left for the day, the test still running.

The next day I came in, and could not believe what I saw: the rate was about 450-500 KB/s. Wait, how could this be? I called the Networking person I had been working with and showed him the results. He saw nothing on the switch out of the ordinary. We honestly had no clue what was going on. Fine, let's reboot both the AIX and Solaris systems, and see what happens, one at a time. Post-reboot of the AIX system, the same results: not more than 500 KB/s. Then we reboot the Solaris system — and find the same result: not higher than about 500 KB/s. Since we rebooted both endpoints, our next tactic was to reboot the switch. We tested again, and poof! The first post-switch-reboot run came in at 10.7 MB/s.

So did the second run, and the third. We left it running. Later that day, we saw it drop off again to 500 KB/s. We did the same thing: reboot both endpoints, one at a time with a test in between; each was still slow. Reboot the switch; fast again. So, at least it was repeatable. We agreed that we'd give it a rest for the day, but the Networking guy would start the process of opening a case with the switch vendor — Cisco.

The next day, we reproduced the results again, and sent all the information to Cisco. At first, they did not accept that their switch could have this problem. We sent all the data — in full detail — showing exactly what we did. We explained how we could reproduce it over and over again, and did so 3 times already. Finally they relented, took a look at the situation and discovered that yes — there was a bug in their switch. Once some counter reached a 32-bit boundary, it wrapped around and would start generating errors — but not any administrator-visible error that would appear in the "show" commands, and nothing was otherwise visibly wrong in the status of the switch. After a reboot, the counter would start off at zero and be fine until it hit 2^32-1, at which point the next packet would cause a wrap and start the errors. They made a fix, we got the updated firmware, and the problem went away. We tested for week, no issues. Eventually, we brought more systems onto the switch and tested more, never crossing that problem again.

Had we not been careful to document everything, and to systematically approach the discovery of the problem, we would not have a good way to determine what was really going on. Even though it wasn't formally done, we were effectively engaging in a simplified case of the scientific method. We had a hypothesis — there was an overflow problem, or something similar — and we took steps to reproduce the issue consistently. Had we control over the situation, we could have taken more steps: adding other systems, having multiple consumers for one sender, multiple consumers and multiple senders, etc. We could have tested other operating systems to ensure it wasn't a specific flaw between Solaris and AIX. There is a lot more we could have done, but the basic premise was still reliable: document everything tested, take a methodical approach to narrowing down the next tests, and refine the hypothesis with each additional data point.

Cisco switches have come a long way since 1999; Solaris is a minor inhabitant of the computing world, and AIX is even less popular than Solaris. And yet, today we do the same thing: we use Linux systems now, and the tools might have changed (we don't generally use FTP for testing throughput) — but the approach is the same. Be methodical, document everything, let the data refine the hypothesis even if it means your original hypothesis is wrong.

If you can't embrace the scientific method, working in technology will be rather difficult.

Back to CarlSpeare.com