Rules of Debug and Testing

This week I am debugging code. I am doing some pro bono for a customer that asked for a test feature to be added to a product so that performance can be evaluated and improved. Rule #1, you can’t improve what you can’t measure. The customer has been good to me in the past and I am expecting that the result of the evaluation will possibly lead to new business. So, I am writing a SPI interface that will allow the product to offload a ridiculous amount of raw signal data that can be fed into an ideal system model, evaluated and compared to the product’s performance. Rule #2, compare multiple interpretations of a system and validate each individually. There really isn’t a “golden” model there are just multiple interpretations and verification is a process of evaluating consensus. Is the implementation wrong, is it the test or is it the presumption of desired behavior.

I spent an hour and described, in a document, the interface and the data stream format which will be captured by SPI to USB adapters connected to a PC and stored as raw data files on an HDD. Then I wrote the code in a couple of hours and installed it into the code database for the product. Since I architected the code database and already had a SPI slave module in the project library all I had to code was the state machine that captured the data,  formatted it and FIFO’d it to the SPI module. The stream is real-time and capture rate can be either slower or faster than the serial rate so the state machine and the data format has to handle both underflow and overflow conditions and the data captured does not match the transfer size of the SPI so the stream also has to be “packed”. And, since experience with the high speed adapter in the past has shown that transfer errors can occur when pushing the rate limit, a CRC is added to the stream at “block” intervals to protect data integrity.

Now it is time to debug my code, and that in any effort is N times more work than the concept or the implementation phase. Rule #3 schedule 2 units of time in concept, 1 unit of time in implementation and a minimum of 3 units of time in testing. When I was in charge of large ASIC designs, it often meant 2 to 6 months in concept which included a lot of detailed documentation, then 1 to 3 months writing code, followed by 6 to 18 months of testing. What I find is that if you skimp on the concept phase the implementation phase is longer due to re-visiting. And if you skimp on the last phase, which, once they hear that coding and initial testing is complete, is generally what upper management and marketing will push you to do, the product will fail miserably in the field. During the first phase design for test was always part of the product documentation and during the second phase I also allocated resources to developing a test plan for the product.

On this particular project the concept wasn’t totally new so I did skimp on phase 1 by leveraging past experience and so far I have got away with it in phase 2. However, because phase 1 and 2 went so fast I have a feeling phase 3 is going to take longer. This is not because I shorted the first 2 phases but because this particular rule is not going to scale to the smallness of this effort. In other words, I am probably looking at a 1-2-6 ratio where 1 unit is half of a day. Fortunately the original product already has an extensible design for test architecture and testbench so adding the new feature to that part of the database was no big effort.

Rule #4 Testing is an exponential problem. Thorough test grows exponential with the number of  modes and state. For example in this simple little add-on I have defined 8 modes for collecting the data and there is the overflow and underflow conditions as well as several exceptions to verify. Exceptions are conditions that the state machine can’t control and that are outside the list of defined conditions and expected operation. These conditions are often the most difficult to test and verify. I generally break the verification process into 3 steps, first is to get a basic mode operational. I pick the simplest scenario and write a test to find the easiest implementation failures, which usually include syntax and interface issues as well as initialization, idle and return to idle type issues. This is where I am after about a day of testing.

The second phase is a thorough test of defined modes and scenarios. Ideally these tests can be automated in both execution and evaluation. In an ongoing system development these become known as the regression test which is run on a snapshot of the database at a repeating schedule to maintain verified functionality is stable as the implementation evolves with features and corrections. This step is usually fairly quick to develop but time consuming execute. If you have the resources this can be parallelized, even with the other two steps.

The third step I refer to as stress testing. In this phase the boundaries of operation are explored for both proper operation and proper handling of exception scenarios. This step usually involves at least two types of testing, directed run-up to a boundary and randomized testing. Where boundaries of operation are known and have defined limits then testing that runs-up to the limit, crosses the limit and returns to within limits are written specifically. However with many modes of operation and many limits defined and many externally controlled inputs it is often difficult to prioritize the testing of every possible exception condition. It may even be difficult to test every possible proper operational condition. This is where randomized testing can be applied.

Randomized testing used to be the Holy Grail and was the part of testing that would get thrifted when marketing and budgetary pressures pushed. Now, however, due to the exponential rule it is replacing directed mode and boundary testing. This is what has driven the System Verilog language and the formal verification methodologies. The proper term might be constraint based testing and the difficulty is measuring coverage and tracing failures to root cause, especially in fault tolerant systems.

Even if you do a thorough job of testing, the product will fail in the field. Rule #5 all products have failures lurking regardless of the amount of testing or experience in the field. The concern is the likelihood, frequency, severity and recovery of the failures. I offer the following wager to anyone. $1 Million USD to anyone that can prove a product has no failure modes as long as they will match the wager if I can prove it has one. The value of a product can best be measured by the thoroughness of its errata sheet. We even see this today by people comparison shopping by looking at online customer reviews (really customer complaints). If a product has well documented complaints we can evaluate them and determine if those shortcoming affect how we are going to use (value) the product. Rule #6 Testing and documenting errata is often more valuable than fixing errata.

So, am I doing all three steps of testing including randomized testing on my SPI feature? Probably not, what do you expect for free? I am currently testing the basic mode and when done I will test all 16  mode scenarios, I might add it to my regression suite for the whole product and I will look at the most obvious exception modes, primarily transfer truncations due to unexpected negation of SPI slave select lines. I will leave randomized testing to my client in the field and he and I both will benefit from a re-programmable FPGA instead of a multi-million dollar ASIC investment. Rule #7 Product integrity is limited by product value.

 

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s