Signal Integrity Review: DDR4/DDR5 routing rules, timing budgets, and topology requirements
In DDR4/DDR5, data signals (DQ) are captured by their associated strobe (DQS). Each byte lane has 8 data bits and one differential strobe pair. The timing relationship between DQ and DQS within a byte lane is the most critical matching requirement.
Byte Lane 0: DQ[0:7], DQS0/DQS0#, DM0/DBI0
Byte Lane 1: DQ[8:15], DQS1/DQS1#, DM1/DBI1
Byte Lane 2: DQ[16:23], DQS2/DQS2#, DM2/DBI2
Byte Lane 3: DQ[24:31], DQS3/DQS3#, DM3/DBI3
(For x32 interface - 4 byte lanes)
Each byte lane operates INDEPENDENTLY:
- DQS0 captures DQ[0:7] - only these 8 bits need to match DQS0
- DQ[0] does NOT need to match DQ[8] (different byte lanes)
- DQS0 does NOT need to match DQS1
| Matching Group | Tolerance | Notes |
|---|---|---|
| DQ to DQS within byte lane | +/- 10 mil (0.25 mm) | Most critical - determines data valid window |
| DQS+ to DQS- (intra-pair) | +/- 5 mil (0.13 mm) | Differential pair skew |
| DM/DBI to DQS within byte lane | +/- 10 mil | Same as DQ-DQS |
| Byte lane to byte lane (DQS-DQS) | +/- 100 mil (2.5 mm) | Controller compensates per-lane |
| Address/Command to CLK | See fly-by section | Different rules - fly-by topology |
Impedance targets (DDR4):
Single-ended DQ: 40 ohm (+/- 10%) = 36-44 ohm
Differential DQS: 80 ohm (+/- 10%) = 72-88 ohm
Single-ended DQ trace width (typical L3 stripline): 5.8 mil for 40 ohm
DQS pair: 4.8 mil width, 5.0 mil space for 80 ohm differential
Spacing rules:
DQ-to-DQ within byte: 3W minimum (14 mil c-c for 4.8 mil traces)
DQS to DQ: 3W minimum (keep DQS centered in byte lane if possible)
Between byte lanes: 5W minimum (24 mil c-c) or greater
DQ/DQS to other signals (non-DDR): 4W minimum (20 mil c-c)
DDR4-2400, Byte Lane 0, all on L3 stripline:
DQS0+ : 1450 mil (reference), DQS0- : 1448 mil (2 mil skew - within 5 mil)
DQ[0]: 1452 mil (+2), DQ[1]: 1448 mil (-2), DQ[2]: 1453 mil (+3)
DQ[3]: 1447 mil (-3), DQ[4]: 1451 mil (+1), DQ[5]: 1449 mil (-1)
DQ[6]: 1455 mil (+5), DQ[7]: 1446 mil (-4), DM0: 1450 mil (0)
Max deviation from DQS: +5/-4 mil (within +/-10 mil spec)
Designer routes DQ[0:3] on L3 (stripline, Tpd=174 ps/in) and DQ[4:7] on L1 (microstrip, Tpd=139 ps/in). Physical lengths are matched to +/-5 mil. But delay mismatch: 1.5 inches * (174-139) = 52.5 ps skew between the two groups. At DDR4-3200 (312 ps UI), this 52.5 ps is 17% of the data window - likely causing failures during memory training.
Pin swap for routability: DQ bits within a byte lane can be swapped (DQ[0] and DQ[3] can exchange pins) because the controller treats them equivalently during training. This is a powerful routing optimization. But DQ bits CANNOT be swapped between byte lanes, and DM/DQS cannot be swapped.
DRAM pin-out matters: Not all DRAM packages have byte-lane-friendly pin assignments. Verify the specific DRAM part number's pin map. Some x16 parts split one "byte lane" across both sides of the BGA, making routing very difficult.
DDR4 address/command (ADDR/CMD) signals use a fly-by topology where the signal visits each DRAM in sequence. The timing is different from DQ - address signals are sampled on both edges of the clock, and the fly-by delay is compensated by write leveling.
Address bus: A[0:16], BA[0:1], BG[0:1] (active LOW during command)
Command bus: RAS#/CAS#/WE# (or unified CS#, ACT#, RAS#, CAS#, WE#)
Control: CKE, ODT, CS#[0:3] (active LOW chip select)
Clock: CK/CK# (differential, fly-by routing)
Reset: RESET# (can be slower, less critical)
Total ADDR/CMD signals: ~25-30 signals (depends on topology)
Key characteristic: ADDR/CMD signals are COMMON to all DRAMs
(unlike DQ which is per-DRAM). They use fly-by topology.
Fly-by means: Signal routes from controller to DRAM0, then DRAM0 to DRAM1, etc.
Each DRAM sees the signal at a different time (intentional skew).
The clock also flies by in the same order, maintaining the clock-data relationship.
Routing requirements:
1. ADDR signals must match each other: +/- 25 mil within the address group
2. CLK must match ADDR at each DRAM (not globally - per-DRAM matching)
3. CKE, ODT, CS# match to within +/- 25 mil of CLK at their respective DRAM
Fly-by segment lengths (example: 2-DRAM rank):
Controller to DRAM0: L0 (this is the first fly-by segment)
DRAM0 to DRAM1: L1 (second segment - "fly-by stub")
Total length to DRAM1: L0 + L1
CLK routing:
CLK must follow the SAME physical path as ADDR (same order of DRAMs)
CLK at DRAM0 must arrive at the same time as ADDR at DRAM0
CLK at DRAM1 must arrive at the same time as ADDR at DRAM1
This means CLK-to-ADDR matching is PER DRAM, not end-to-end!
2-DRAM DDR4 fly-by routing:
CLK route: Controller --[800 mil]--> DRAM0 --[600 mil]--> DRAM1
A[0] route: Controller --[795 mil]--> DRAM0 --[605 mil]--> DRAM1
Matching at DRAM0: CLK=800, A[0]=795: skew = 5 mil (within 25 mil)
Matching at DRAM1: CLK=1400, A[0]=1400: skew = 0 mil (within 25 mil)
Write leveling is a DDR4 training mechanism that compensates for the fly-by clock routing delay. Each DRAM receives the clock at a different time due to fly-by propagation. Write leveling adjusts the DQS phase at each DRAM so writes occur at the correct time.
Problem solved by write leveling:
CLK arrives at DRAM0 first, DRAM1 second (fly-by delay)
DQ/DQS connect directly (point-to-point, not fly-by)
Without compensation: DQS arrives at all DRAMs at the same time,
but CLK arrives at different times. DQS-CLK relationship differs per DRAM.
Write leveling operation:
1. Controller sends DQS to DRAM during write leveling mode
2. DRAM samples DQS with its internal clock (derived from CLK)
3. DRAM returns the sampled value on DQ[0]
4. Controller shifts DQS phase until it sees 0-to-1 transition on DQ[0]
5. This establishes the correct DQS-to-CLK phase for each DRAM
PCB design implication:
Maximum fly-by delay must be within the write leveling range of the controller.
Typical controller WL range: 0 to 2.0 ns (0 to ~12 inches fly-by at stripline speed)
If fly-by delay exceeds WL range: controller cannot compensate = FAILURE
Over-compensating in PCB routing: Some designers try to route DQ to closer DRAMs longer and DQ to farther DRAMs shorter, attempting to manually compensate the fly-by delay. This is WRONG - it fights against the write leveling algorithm and can push the DQS phase outside the controller's compensation window.
Multi-rank considerations: In dual-rank configurations (2 sets of DRAMs sharing DQ bus), write leveling must work for both ranks. The fly-by topology should keep both ranks in a similar position relative to the DQ bus.
On-Die Termination (ODT) in DDR4 provides impedance matching at the DRAM and controller. Proper ODT configuration is critical for signal integrity at high data rates. The ODT values affect signal amplitude, ringing, and power consumption.
| Component | ODT Setting | Typical Value | When Active |
|---|---|---|---|
| DRAM DQ (write) | RTT_WR | 120/240 ohm | During writes to this rank |
| DRAM DQ (read, non-target) | RTT_NOM | 60/120 ohm | During reads from other rank |
| DRAM DQ (idle) | RTT_PARK | 240/Disabled | When rank is idle |
| Controller DQ | Ron (drive) | 34/48 ohm | During writes (driving) |
| Controller DQ | ODT (receive) | 40/60/80 ohm | During reads (termination) |
| DRAM CA (command/addr) | Not available | - | CA uses POD signaling, no ODT |
Single-rank write (controller driving, DRAM receiving):
Driver: Controller Ron = 34 ohm
Trace: Zo = 40 ohm
Load: DRAM RTT_WR = 120 ohm
Reflection at DRAM: rho = (120-40)/(120+40) = 0.5 (significant!)
But POD signaling has Vtt pull-up, so effective load is different:
Effective Z at DRAM = RTT_WR || (pulls to Vddq) = acts as termination
Single-rank read (DRAM driving, controller receiving):
Driver: DRAM Ron = 34 ohm
Trace: Zo = 40 ohm
Load: Controller ODT = 60 ohm
Reflection at controller: rho = (60-40)/(60+40) = 0.2 (acceptable)
Optimal matching for 40-ohm trace:
Ideal termination = Zo = 40 ohm
Controller ODT = 40 ohm gives rho = 0 (perfect match)
But 40 ohm draws more current: I = 1.2V/40 = 30 mA per pin
Trade-off: 60 ohm ODT = slight mismatch but 40% less power
Micron DDR4 IBIS Models: Download from Micron website. Includes IBIS models with configurable ODT settings (RTT_NOM, RTT_WR, RTT_PARK, Ron). Use with HyperLynx or Sigrity.
JEDEC Board SI Tool: Free simulation tool from JEDEC that models DDR4 channels with configurable ODT. Good for initial ODT exploration before full simulation.
DDR4 uses internal VREF (generated inside the DRAM via Mode Register settings) for data signals, but external VREF may be needed for command/address signals on some controllers. VREF quality directly affects the noise margin of the entire DDR interface.
DDR4 VREF for DQ (internal to DRAM):
- Training Range 1: 60% to 92.5% of Vddq (in 0.65% steps)
- Training Range 2: 45% to 77.5% of Vddq (in 0.65% steps)
- Default: 70% of Vddq for Range 1 = 0.7 * 1.2V = 0.84V
- Adjusted during VREF training to center the eye
DDR4 VREF for CA (controller-side):
- Some controllers use internal VREF (generated from Vddq)
- Some require external VREF pin: must be 0.5 * Vddq = 0.6V
- External VREF accuracy: +/- 0.5% (= +/- 6 mV)
- Noise on VREF directly subtracts from noise margin
External VREF circuit (if needed):
Option 1: Resistor divider from Vddq
R1 = R2 = 200 ohm (divides by 2), decoupled with 100nF + 1uF
Accuracy depends on resistor tolerance: 0.1% resistors give 0.1% VREF accuracy
Option 2: Dedicated VREF IC (TI REF3312 - 1.25V, or resistor-programmed LDO)
Better noise performance, but adds cost and complexity
VREF noise budget: With DDR4-3200 having approximately 260 mV noise margin per side, even 5 mV of VREF noise consumes 2% of the budget. At 10 mV, it's nearly 4%. Keep VREF noise below 2 mV peak-to-peak for DDR4-3200 designs.
Internal VREF training: DDR4 DQ VREF is internally generated and trained. The PCB designer doesn't need to provide it externally. But the training algorithm needs sufficient eye opening to converge. If the eye is already marginal from other SI issues, VREF training may not find an optimal point.
DDR4 mandates a fly-by topology for clock and command/address signals. Unlike DDR2's T-branch topology, fly-by routes signals in a daisy-chain fashion past each DRAM. This improves signal quality at high data rates but introduces skew that write leveling must compensate.
Topology comparison:
DDR2 (T-branch): Controller---+---DRAM0
|
+---DRAM1
Clock arrives at both DRAMs simultaneously (matched stub lengths)
DDR4 (Fly-by): Controller----DRAM0----DRAM1----DRAM2----DRAM3
Clock arrives at DRAM0 first, DRAM3 last
Delay from DRAM0 to DRAM3 can be 1-2 ns (compensated by write leveling)
Why fly-by is better at high speed:
- Eliminates T-branch reflection from unterminated stub
- Each DRAM sees only forward-traveling wave (no reflection from next DRAM)
- DRAM input capacitance (2-3 pF) is absorbed into transmission line
- Signal quality at each DRAM is better than T-topology above 1 GHz
Segment matching (ADDR/CMD signals):
All ADDR/CMD signals in a group must be matched to +/- 25 mil at EACH DRAM.
CLK pair must follow the same fly-by path and match ADDR group.
Fly-by stub length (from via to DRAM pad):
Keep stub from fly-by trace to DRAM pad as short as possible.
Maximum stub: 100 mil (absolute max), target: < 50 mil.
Long stubs create reflections that degrade signal quality at high speeds.
DRAM loading:
Each DRAM presents approximately 2 pF input capacitance per pin.
4 DRAMs = 8 pF total loading on fly-by network.
This loading reduces the effective impedance and increases rise time.
Fly-by termination:
Some designs add a termination resistor at the end of the fly-by chain:
Rt = 100-200 ohm to Vtt (reduces ringing at last DRAM)
Check SoC vendor recommendation - not always needed for 1-2 DRAM configs.
DDR4, 2 DRAMs, fly-by routing:
CLK: Controller(0 mil) -> 900 mil -> DRAM0(stub 30 mil) -> 500 mil -> DRAM1(stub 35 mil)
A[0]: Controller(0 mil) -> 905 mil -> DRAM0(stub 28 mil) -> 498 mil -> DRAM1(stub 32 mil)
A[0] matches CLK at DRAM0: |905-900| = 5 mil (within 25 mil)
A[0] matches CLK at DRAM1: |(905+498)-(900+500)| = |1403-1400| = 3 mil (within 25 mil)
Fly-by delay: 500 mil * 174 ps/in = 87 ps (well within WL range)
Designer routes CLK to DRAM1 first, then DRAM0, but ADDR signals go to DRAM0 first, then DRAM1. The fly-by order is reversed between CLK and ADDR. At DRAM0: CLK arrives late (it went to DRAM1 first), ADDR arrives early. The CLK-to-ADDR skew at DRAM0 equals the entire fly-by delay (800 ps), far exceeding the 25 mil (4 ps) matching requirement. Memory fails to initialize.
Cadence Allegro: Use "Route > Interactive Fly-by" feature (DDR routing wizard). Set up the fly-by chain with DRAM order, and the tool maintains matching at each node automatically.
Altium Designer: The xSignals feature supports DDR fly-by routing with per-node matching. Define xSignals from controller to each DRAM, then use length tuning with fly-by awareness.
HyperLynx: Use BoardSim to simulate the complete fly-by topology. Model DRAM input capacitance with IBIS models. Check signal quality at each DRAM independently.