The Intel Cougar Point SATA Bug

In early 2011, Intel discovered a design issue on their
Cougar Point chipset, and took an approximately $700 million charge against
earnings to repair and replace affected parts and systems.  What may have been the root cause of this,
and how may it have been prevented?

System product recalls happen all of the time. Some of them make
the news, and some don’t. In previous blogs I’ve written about some of the design issues and manufacturing variances that can affect the operating
margins of a product and, especially over time, erode the performance of a
system to the point where it becomes visible to the owner/operator as sluggish
execution, excess power consumption, or hangs and crashes. And these
issues are associated with both the chips and boards that make up a system. The
Intel Cougar Point problem is especially interesting because of its financial
impact and its association with some of these design errors that I’ve written
about.

In the AnandTech interview with Intel’s Steve Smith, the root
cause of the SATA problem was traced back to a transistor in the 3Gbps PLL
clocking tree. This transistor was biased with too high of a voltage, which
could result in a failure of the SATA ports 2 through 5 over time. In fact the
problem could be coaxed out by running the part at elevated temperatures and
voltage – Intel discovered this problem itself with thermal chamber testing. The
differential AC-coupled SATA physical layer uses embedded strobes derived from
the PLL to clock the 8b/10b encoding, so leakage and drift in the PLL logic
ultimately leads to clocking marginalities and an increasing number of
re-transmits over time with the associated performance hit, and ultimately
(months? years? never?) failure of the ports.

Aside from testing their designs across a wide swing of
temperature and voltage conditions, circuit board signal integrity engineers
have other tools at their disposal to catch these kinds of defects. Certainly
the use of stressful patterns intended to work the IO logic to its
utmost and inflict crosstalk, ISI and clock recovery problems is essential. The
challenge is that SI design or manufacturing issues can exist in both the chips
and the boards, so marginalities in either or both can exacerbate any system
defects and resulting customer dissatisfaction. Tools such as those described
in the e-Book Bandwidth Tests Reveal Shrinking Eye Diagrams and Signal
Integrity Problems
may help catch these before they result in field issues
or product recalls. 

Alan Sguigna