Structural Defects on Intel QuickPath Interconnect

A customer shared some empirical results from boundary-scan testing of the Intel QuickPath Interconnect (QPI) nets on their design. These nets cannot be covered using In-Circuit Test (ICT), and some short-circuit and open-circuit defects defy detection using conventional functional test. What did they find?

Intel QPI runs at 9.6 GT/s per lane on Haswell Xeon systems, up from 8 GT/s on Sandy Bridge Xeon, and this speed is only expected to increase in the future. At these speeds, signal integrity issues preclude the placement of ICT test pads on these nets, rendering ICT unable to provide any test coverage. And, given that QPI uses differential signaling, its I/O receivers may be able to reconstruct the incoming data stream even in the presence of board-level structural defects. So, lanes with defects may still initialize at the physical layer and train up, albeit at a degraded level of overall throughput. What happens next depends on the board and chip overall margins, but typically such systems are subject to reduced performance, undefined behavior, lane drop-outs, and even crashes/hangs โ€“ unfortunately, often at the customerโ€™s premises.

One manufacturer recently fired up boundary-scan test for an Intel Xeon-based server platform, and immediately began to see a 2.9% failure rate from this test step. Somewhat perplexed (since the boards were booting fine and they had not been getting any failures from their functional test step), they did some root cause analysis by first performing a 3-D X-ray of the CPU BGA sockets. This is what they saw:

QPI Xray graphic

The above pictures take a bit of explanation.  In the layout picture on the left, the yellow features represent the โ€œdog boneโ€ via and land for node QPI1_RX_4_DP (the land is at the bottom, the via is circled in red) and the green feature circled in orange is a land for node GND. The 3-D X-ray picture on the right suggests that there may be a short between the GND land and the adjacent via belonging to QPI1_RX_4_DP.

When the processorโ€™s BGA socket was removed, a visual inspection of the PCB yielded the following:

QPI optical inspection graphic

It can be seen that the via for QPI1_RX_4_DP (circled in red) is covered by solder. Such a situation makes it easy for it to be shorted against balls at either of the two adjacent lands, which are circled here in green. Whatโ€™s actually happening can be depicted by a graphical representation of a cross-section of the BGA socket ball, lands and vias for the surrounding area:

QPI representation cross section ball land via
 As stated above, these defects on high-speed serial I/O will often escape detection by conventional functional test, as the ports do train successfully and pass traffic, appearing to operate normally. Even more sophisticated functional test algorithms, which report the contents of the QPI error counter registers, may or may not indicate a potential failure depending on the bit error rate induced by the defect and the duration of the functional test (bit error count being a function of the multiple of bit error rate, as observed during the test, and the test duration).

For a more detailed treatment of the effects of defects on high-speed serial I/O and memory and how to detect them, read our white paper: Detection and Diagnosis of Printed Circuit Board Defects and Variances

Alan Sguigna