Debugging CATERR and other failures

The challenge of system debug on Intel (and other) systems can be huge. What new tricks are available for debugging system hangs, crashes, or application errors?

A system is characterized by the hardware, firmware and software components that comprise it. More specifically, these are:

  • System board
  • CPU
  • Chipset
  • BIOS
  • BMC
  • Virtualization environment
  • Drivers
  • OS
  • Application
  • Storage, I/O, Network, and Graphics devices

The debugging challenge is large because any one component can cause system hangs, crashes and application errors. And the increasing number of buses, component integration, I/O speeds, and complexity of firmware, BIOS and software stacks makes things even tougher.

Some of the most difficult-to-debug problems occur before the OS is run, and/or when processor memory, I/O, or configuration reads time out. Some of these bugs cause a catastrophic error within the system, and the CATERR signal is asserted. An example of this might be when there is a processor reorder buffer timeout: http://download.intel.com/design/intarch/papers/324353.pdf. On some rare occasions, the x86 Machine Check Exception handler is not executed properly. In this case, it is necessary to directly read the MCi_STATUS registers which are cleared at the next boot. The trick is to stop the processor at a certain boot location with a breakpoint and inspect the MC registers at that point. And of course stopping the processor before the Machine Check event and focusing on processes that occur just prior to it can help enormously with debugging.

If the bug youโ€™re trying to troubleshoot happens frequently enough that in a given system it can be reproduced fairly quickly, a benchtop JTAG emulator debugger is ideal. If the failure is more infrequent and/or not easily reproducible, having the debug routines directly in the system (as with our ScanWorks Embedded Diagnostics tool) can contribute greatly to identifying its root cause.

Alan Sguigna