Back to CarlSpeare.com
Making a good test harness
If you are going to test anything to a degree of statistical
significance, you will need to create a test harness that automates
the majority of the work. Humans are generally fairly poor at doing
complex tasks repeatedly, so automation will allow multiple people to
run the test in a reproducible and reliable manner.
First, be specific about what you are testing. Distill the test down
to the most fundamental element: are you testing bandwidth? That's a
broad test. You are really going to test a specific pattern of
traffic between a dedicated pair of endpoints, with the medium in
between being the test case. You can vary the pattern, but you need
to make multiple identical tests of each pattern in order to ensure
you have really reproduced the results. The first test might be a
simple HTTP transfer; subsequently, you might test UDP traffic, or
multicast traffic, or something else — but the essence should be that
you are testing one thing, and can run the test frequently enough
that you've removed "luck of the draw" situations.
Second, determine the best tools to run the test. In the bandwidth
example, you might use curl or wget for HTTP patterns, and then
switch to iperf3 for UDP patterns, and finally use msend/mdump for
multicast patterns. However, do some investigation and make sure
that you've at least done a superficial survey of what tools are
available — it might be possible that a better testing tool exists
compared to what you might have used in the past. (In the old days,
I used ttcp extensively; now, almost nobody uses it, because tools
like iperf3 have better features or more robust measurement methods.)
Third, once you have defined your test and selected the tool to run
the test, do some preliminary testing to ascertain that you really
are measuring what you think you are measuring — and that the results
make sense. For example, it might be possible that someone uses wget
to test network bandwidth, but ends up putting in an HTTPS URL. This
might seem like an acceptable test, but you end up measuring two
things at the same time: the bandwidth, and the encryption/decryption
speed of both endpoints. Switch to HTTP, and run the tests many
times to ensure that the sender has the file cached (if possible);
similarly, have the client write the file to the bit bucket. Both
situations are the same: you are removing disk speeds as part of the
test, and trying to remove everything but the network throughput. Do
the HTTP speeds match what you expect? For example, if you are
getting 150 MB/s on a link that is claimed to be 1 Gb/s, you should
distrust the situation: it is not possible to transfer at 150 MB/s on
a gigabit line. Perhaps there is compression that is being done to
the data before being sent? Make sure you are using random data to
avoid compression effects.
Finally, once you have a reasonable certainty you are measuring the
correct aspect, and have some guidance on what results are sensible,
create the automation of the test. If you have a wget recipe for the
test, put it into a loop and collect the results in a file. If you
have an FTP job, get the .netrc set and create a script to capture
all the results. Make sure systems and output have accurate time stamps so you
can look at the results later and understand ordering. Do a few
smaller test runs of the harness before you scale it up. If you are
running a loop of wget operations, try 3-4 to make sure everything is
working. Then, unleash it on a much larger sample size: 50, 100,
whatever is enough to prove things. When in doubt, add more samples
— not fewer.
The best verification of a test harness is to let someone else try
it. Give them a guide — written, not shown in real-time — and ask
them to try it. See what results are produced. If a 3rd person is
available, let that person try it as well. Make sure there aren't
assumptions being made, which might be obvious to you but not someone
else. The purpose of justifying multiple runs across multiple people
is easy: a test harness, if done correctly, can be run by anyone with
access to a suitable environment. That's the point of documenting
experiments in the scientific method: you want someone else to try
it, and see if they get the same results.
On the defense of the scientific method
One of my earliest encounters with troubleshooting a non-trivial
technology problem occurred at IBM Research, sometime around
1998-1999. Around that time, we were experimenting with 100 Mb/s
Ethernet; prior to this point, the main Watson center building had a
campus-wide 16 Mb/s Token Ring network. As Token Ring is a
single-talker and non-collision medium, you expected that the
throughput would be exactly the raw bit rate minus overhead — and it
was, consistently. However, there was a delay in when you could
send, so while the active throughput was predictable, the
"experienced" throughput could be lower in a busy and large ring.
The Networking team had recently acquired a 100 Mb/s layer 2 switch,
which would allow us to test providing faster access for applications
that could benefit from the increased bandwidth. Since it was a
many-talker situation, you do expect some degradation of performance
in a busy network. However, we were testing a brand new switch in
isolation — so our expectations were high.
We had two test systems to give for the purpose of testing: a Solaris
7 system that I had put together, and a newer RS/6000 running AIX 4.3
with an add-in Ethernet card. So, I set up an FTP server on the AIX
system, used the FTP client on Solaris to grab an ISO or some other
reasonably large file (large for the time — 650 MB today is
considered "small"). We measured the first transfer — about 9.8
MB/s. Ok, that's fairly good! Converting to Mb/s, that's about 78.4
Mb/s, or near 78% efficiency. I think we should be able to get
higher though.
We did some investigation, and ran some tweaks to window sizes on
both the AIX and Solaris system. We ran the test again — 10.2 MB/s.
That's about 82% efficiency. Seems fairly good. We change a few
more options, reboot the switch, apply a few settings on the switch
side, and so forth. We finally get the best run we've gotten: 10.8
MB/s. That's 86% efficient. We assume we cannot possibly do better,
so we call it a day. (It is possible to do better, but that would be
later on, long after this examination.)
Just for the fun of it, I decide to run the test a few times, and we
see small variations — maybe 10.5 MB/s here or there, 10.7 MB/s to
make it interesting. However, it's always in the collar between
about 10.5 and 10.8 MB/s. I move on to the next project, but I left
the test running. Over and over again, the Solaris system would get
the remote file, record the time it took, and remove the local copy
of the file. I left for the day, the test still running.
The next day I came in, and could not believe what I saw: the rate
was about 450-500 KB/s. Wait, how could this be? I called the
Networking person I had been working with and showed him the
results. He saw nothing on the switch out of the ordinary. We
honestly had no clue what was going on. Fine, let's reboot both the
AIX and Solaris systems, and see what happens, one at a time.
Post-reboot of the AIX system, the same results: not more than 500
KB/s. Then we reboot the Solaris system — and find the same result:
not higher than about 500 KB/s. Since we rebooted both endpoints,
our next tactic was to reboot the switch. We tested again, and
poof! The first post-switch-reboot run came in at 10.7 MB/s.
So did the second run, and the third. We left it running. Later
that day, we saw it drop off again to 500 KB/s. We did the same
thing: reboot both endpoints, one at a time with a test in between;
each was still slow. Reboot the switch; fast again. So, at least it
was repeatable. We agreed that we'd give it a rest for the day, but
the Networking guy would start the process of opening a case with the
switch vendor — Cisco.
The next day, we reproduced the results again, and sent all the
information to Cisco. At first, they did not accept that their
switch could have this problem. We sent all the data — in full
detail — showing exactly what we did. We explained how we could
reproduce it over and over again, and did so 3 times already.
Finally they relented, took a look at the situation and discovered
that yes — there was a bug in their switch. Once some counter
reached a 32-bit boundary, it wrapped around and would start
generating errors — but not any administrator-visible error that
would appear in the "show" commands, and nothing was otherwise
visibly wrong in the status of the switch. After a reboot, the
counter would start off at zero and be fine until it hit 2^32-1, at
which point the next packet would cause a wrap and start the errors.
They made a fix, we got the updated firmware, and the problem went
away. We tested for week, no issues. Eventually, we brought more
systems onto the switch and tested more, never crossing that problem
again.
Had we not been careful to document everything, and to systematically
approach the discovery of the problem, we would not have a good way
to determine what was really going on. Even though it wasn't
formally done, we were effectively engaging in a simplified case of
the scientific method. We had a hypothesis — there was an overflow
problem, or something similar — and we took steps to reproduce the
issue consistently. Had we control over the situation, we could have
taken more steps: adding other systems, having multiple consumers for
one sender, multiple consumers and multiple senders, etc. We could
have tested other operating systems to ensure it wasn't a specific
flaw between Solaris and AIX. There is a lot more we could have
done, but the basic premise was still reliable: document everything
tested, take a methodical approach to narrowing down the next tests,
and refine the hypothesis with each additional data point.
Cisco switches have come a long way since 1999; Solaris is a minor
inhabitant of the computing world, and AIX is even less popular than
Solaris. And yet, today we do the same thing: we use Linux systems
now, and the tools might have changed (we don't generally use FTP for
testing throughput) — but the approach is the same. Be methodical,
document everything, let the data refine the hypothesis even if it
means your original hypothesis is wrong.
If you can't embrace the scientific method, working in technology
will be rather difficult.
Back to CarlSpeare.com