In early 2011, Intel discovered a design issue on their Cougar Point chipset, and took an approximately $700 million charge against earnings to repair and replace affected parts and systems. What may have been the root cause of this, and how may it have been prevented?
System product recalls happen all of the time. Some of them make the news, and some don’t. In previous blogs I’ve written about some of the design issues and manufacturing variances that can affect the operating margins of a product and, especially over time, erode the performance of a system to the point where it becomes visible to the owner/operator as sluggish execution, excess power consumption, or hangs and crashes. And these issues are associated with both the chips and boards that make up a system. The Intel Cougar Point problem is especially interesting because of its financial impact and its association with some of these design errors that I’ve written about.
In the AnandTech interview with Intel’s Steve Smith, the root cause of the SATA problem was traced back to a transistor in the 3Gbps PLL clocking tree. This transistor was biased with too high of a voltage, which could result in a failure of the SATA ports 2 through 5 over time. In fact the problem could be coaxed out by running the part at elevated temperatures and voltage – Intel discovered this problem itself with thermal chamber testing. The differential AC-coupled SATA physical layer uses embedded strobes derived from the PLL to clock the 8b/10b encoding, so leakage and drift in the PLL logic ultimately leads to clocking marginalities and an increasing number of re-transmits over time with the associated performance hit, and ultimately (months? years? never?) failure of the ports.
Aside from testing their designs across a wide swing of temperature and voltage conditions, circuit board signal integrity engineers have other tools at their disposal to catch these kinds of defects. Certainly the use of stressful patterns intended to work the IO logic to its utmost and inflict crosstalk, ISI and clock recovery problems is essential. The challenge is that SI design or manufacturing issues can exist in both the chips and the boards, so marginalities in either or both can exacerbate any system defects and resulting customer dissatisfaction. Tools such as those described in the e-Book Bandwidth Tests Reveal Shrinking Eye Diagrams and Signal Integrity Problems may help catch these before they result in field issues or product recalls.