Debugging Dead Boards in Production

A Case Study showed that 50% of circuit boards that tested as “dead” in manufacturing production actually have defects on their memory buses. What categories of memory interconnect defects cause a dead board?

In the blog Defects on High-Speed Memory, I showed that certain defects, such as a DQS shorted to GND, impair the memory’s performance, but otherwise allows the memory to train and run. Other defects, such as an open-circuit on DQ, will cause an outright system failure. In many cases, the BIOS or boot loader determines the behavior of the system in the event of catastrophic defects. Quite commonly, given that the system end user typically has limited to no ability to diagnose such failures, the boot loader will simply crash or hang the system to prevent it from being used. On the production floor, this results in the dead board being brought to a debug workbench, where hopefully diagnostic tools exist to help root-cause and salvage the board. In the field, such behavior results in a warranty return.

In a recent study, ASSET classified the root causes of a production run of dead boards, and categorized the failures in terms of memory-related faults on DQ, DQS, ECC, and other signals:

How to Debug Dead Boards in Production

The results of the Case Study, and a general treatment of debugging dead boards, can be found in the white paper How to Debug Dead Boards in Production

Alan Sguigna