I was asked recently whether engineers could just check the CRC error counts coming from the Operating System to ensure they had good signal integrity and operating margins. After all, a CRC checks for bit errors, right? Here’s why this is not good enough:
We’re all aware that signal integrity (SI) has gained in importance as bus speeds have increased. With I/O technologies such as PCIe Gen3, SATA 3, USB 3.0, Intel® Quickpath Interconnect, XLAUI and others all at 5.0 GT/s and faster, design defects and/or silicon and board manufacturing variances all contribute to a reduction in the operating margins on today’s board designs. Many of these technologies use schemes such as encoding, scrambling, and adaptive equalization to “open the eyes” on the buses, but the effects of jitter, inter-symbol interference (ISI), crosstalk and other impairments must be simulated and tested for.
Let’s look at PCIe Gen3 for a moment. It’s an AC-coupled, differential bus which uses embedded clocking to provide a robust and survivable data path. It uses 128b/130b encoding and data scrambling to avoid long strings of consecutive individual bits (CIDs) – these long strings are the bane of signal integrity, as they dramatically impact the clock recovery (CR) circuits’ ability to lock and hold. Encoding and scrambling also have the benefit of achieving DC balance in the bit stream, reducing data wander and improving error recovery.
PCI Express encapsulates its data within a TLP (Transaction Layer Packet) which contains a CRC (Cyclical Redundancy Code) which “protects” the entire packet (with the exception of the framing start/end bytes). A TLP looks like this:
For PCIe 3, the link CRC (LCRC) is 32 bits wide based on the large, variable-sized payload. The end-to-end CRC (ECRC) provides some level of data integrity for different link hops. For other buses like QPI, which use a smaller, fixed-sized payload, the link CRC is 8 bits.
Now that we’ve covered that background, let’s look at the four reasons why CRC checking is inadequate, versus pattern-based checking:
1. It takes a long time to detect failures at nominal voltage and time.
PCIe Gen3 runs at roughly 8 Gbps and is rated within the PCI-SIG specification for one bit error in 1012. When I say “rated”, this means that at nominal voltage and time, BER should be below this rate. This is because the signaling schemes across all serial buses are never guaranteed to deliver the bits perfectly across the interconnects. The physical layer is always designed to minimize the probability of incorrect transmission and/or reception of a bit, not down to zero, but down to the rated BER. Below this BER, the physical layer allow for routine transmission errors to occur, and recovery mechanisms in the link layer are employed so that the higher level functions are not aware of or affected by these errors.
So given that, to see errors above the BER threshold, you have to run traffic for a long time. The confidence level in the bus is given by the equation in our whitepaper Platform Validation using Intel Interconnect Built-In Self Test (Intel IBIST). For a bus like QPI, which is rated at 1 in 1014, achieving a high confidence level can take days or weeks. Engineers don’t have weeks to test signal integrity given today’s aggressive design delivery schedules.
2. It doesn’t give the design’s true margins under real-world conditions.
When system-level OS-based testing using CRCs is being used, the test is usually performed at nominal time and voltage, and when the design is “fresh” – that is, under “perfect” conditions. But we’ve seen that Process/ Voltage/ Temperature (PVT) effects can result in a wide swing in margins. That’s why silicon vendors bin their devices based upon their performance at the fringe of the envelope. And drift in high-speed I/O circuits – aging of capacitors, variations in power supplies, the effect of current leakage on gates over time, etc. – will also negatively impact operating margins. So just testing a board design under ideal conditions may mislead you into thinking that signal integrity is OK.
Testing margins with a worst-case extreme stress synthetic pattern which maps to an eye mask that takes drift and PVT effects into account will address this.
3. CRC is not perfect
CRCs use polynomial arithmetic to create a checksum against the data it is intended to protect. The design of the CRC polynomial depends on the maximum total length of the block to be protected (data + CRC bits), the desired error protection features, and the type of resources for implementing the CRC, as well as the desired performance. Trade-offs between the above are quite common. For example, a typical PCI Express 3.0 packet CRC polynomial is:
x32 + x26 + x23 + x22 + x16 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1
Whereas for Ethernet frames, the CRC generator may use the following polynomial:
x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1
The PCI Express 3.0 CRC-32 for the TLP LCRC will detect 1-bit, 2-bit, and 3-bit errors. 4-bit errors may escape detection. Bit slips or adds have no guarantee of detection. “Burst” errors of 32 bits or less will likely be detected.
For QPI, the 8-bit CRC can detect the following within flits:
- All 1b, 2b, and 3b errors
- Any odd number of bit errors
bit errors of burst length 8 or less
- Burst length refers to the number of contiguous bits in error in the payload being checked (i.e. ‘1xxxxxx1’).
- 99% of all errors with burst length 9
- 99.6% of all errors of burst length > 9
4. OS-based traffic CRC checking doesn’t really stress much
Most CRC-based tests saturate the bus with heavy OS-based traffic, i.e. streaming video. This can get bus traffic up over 90% or so. This normal functional traffic is subject to 128b/130b encoding and scrambling which reduces the occurrences of long strings of 31 consecutive identical bits (CIDs) to down below 10-12, which is the BER threshold. But stressing clock recovery (CR) circuits requires checking these with longer CIDs. Just running traffic and checking CRCs doesn’t cut it.
“Synthetic” or “killer” patterns are necessary to aggravate all reasonably likely ISI (intersymbol interference), challenge the ability of clock recovery circuits to lock and hold, and check receiver circuitry against drift. A PRBS31 pattern fulfills these criteria. More detail on PRBS31 is available here. The intent is to generate the most stressful patterns as possible, then check the bits one-by-one. It doesn’t get more precise than that.
So after all this, let’s ask, why is this important? Well, obviously, bad signal integrity in a design runs the high risk of uncorrectable errors or system crashes, resulting in costly field repairs or even product recalls. Also, let’s not forget power consumption: if SI is not optimized, any adaptive equalization can increase power requirements by 15% - 30%. So if your signal integrity is bad…
SI only gets worse over time. All systems run slower and eventually start to hang or crash over time. You want your design to run clean when it is first shipped.