Rooting out impossible bugs

The use of effective embedded diagnostics to identify failures in today’s high-availability systems becomes more critical as semiconductor and board technologies increase in speed and integration. But many OEMs seem to be overlooking this vital aspect of their products’ post-sales operation, possibly due to their drive for shorter lead times and reduced costs. This myopia has lead to many news-worthy stories in recent times. But it wasn’t always like this…

When I was a young field service engineer with Nortel Networks, I was often assigned the task of troubleshooting the most intermittent, elusive bugs within its DMS-100 central office voice switch software. With my software design background, I had found that I had a talent for looking at other peoples’ code and figuring out where it was broken. But I distinctly remember one bug that tasked my problem-solving skills to the utmost.

Back in the 1980s, Nortel’s flagship product, the DMS-100, was enjoying great success as one of the first digital voice switches on the market. The result of years of R&D effort, the switch was almost totally proprietary: it had a proprietary processor at its heart (the “NT-40”), a proprietary operating system (SOS—the Switch Operating System) and a proprietary programming language (PROTEL—Prototype Type-Enforcing Language). The system was built from the ground up with reliability in mind, since downtime for 911 calls were simply unacceptable. I was lucky enough to work in Customer Service on this new product, and every day was a new adventure. The platform was new, and we had a lot of powerful software debugging tools to help us troubleshoot issues. Customers were always delighted whenever an exotic problem popped up and we were able to dive in and fix the software in fairly short order.

But one day we got a call from the switching center at Teleglobe’s office in Montreal, Canada. Teleglobe’s switches carried all the Canadian international calls to foreign exchanges, and they had just received a new software release. Over a period of several days they found that some of the voice-switching peripherals were spontaneously going into an In-System Trouble (ISTb) state. This was not an emergency because each of the peripherals was mated in a redundant pair configuration. But if the other half of the system were to go in trouble at the same time, it could turn into a full-fledged outage. Something had to be done.

When I started looking at it, I was initially stumped. The system was not putting out any logs when the peripheral went ISTb. Very unusual—normally when a peripheral underwent a state transition, there was some sort of log, alarm, or operational measurement. So I had no marker to track down the root cause. It appeared to be a problem with the low-level maintenance subsystem (module “MTCBASE”) but I could not be sure.

Fortunately, Nortel’s proprietary platform had an excellent embedded diagnostics capability, DEBUG: you could “instrument” a software module by inserting breakpoints at suspicious sections of the code. So I logged into Teleglobe’s switch, placed a few well-chosen breakpoints, and then sat and waited. Sure enough, after a few hours, one of the breakpoints was hit. I was able to extract some key debug information which told me exactly what was happening in the code at the time of the failure. As it turned out, the new software release had left a dangling “else” statement in a key portion of the maintenance software. We created a quick software patch and the problem never occurred again.

Now, over twenty years later, it seems that Nortel’s system debugging capabilities were way ahead of their time. There are many purportedly high-availability systems such as routers, switches, storage systems, servers, gateways and wireless base-stations that even today don’t have equivalent capabilities to Nortel’s DEBUG, PMIST, MSGTRACE, and other utilities. It’s my hope that OEMs will focus more on quality and reliability so they can raise the availability of their systems and get to the root cause of some of the most challenging bugs. Effective embedded diagnostics are the way forward.

Alan Sguigna