RUE Logo

Module 2.7 - DDR Memory Interface

Signal Integrity Review: DDR4/DDR5 routing rules, timing budgets, and topology requirements

Checkpoint 2.7.1: DQ/DQS Matching Per Byte Lane Critical

In DDR4/DDR5, data signals (DQ) are captured by their associated strobe (DQS). Each byte lane has 8 data bits and one differential strobe pair. The timing relationship between DQ and DQS within a byte lane is the most critical matching requirement.

DDR4 Byte Lane Structure

Byte Lane 0: DQ[0:7], DQS0/DQS0#, DM0/DBI0
Byte Lane 1: DQ[8:15], DQS1/DQS1#, DM1/DBI1
Byte Lane 2: DQ[16:23], DQS2/DQS2#, DM2/DBI2
Byte Lane 3: DQ[24:31], DQS3/DQS3#, DM3/DBI3
(For x32 interface - 4 byte lanes)

Each byte lane operates INDEPENDENTLY:
- DQS0 captures DQ[0:7] - only these 8 bits need to match DQS0
- DQ[0] does NOT need to match DQ[8] (different byte lanes)
- DQS0 does NOT need to match DQS1
            

Length Matching Requirements (DDR4-3200)

Matching GroupToleranceNotes
DQ to DQS within byte lane+/- 10 mil (0.25 mm)Most critical - determines data valid window
DQS+ to DQS- (intra-pair)+/- 5 mil (0.13 mm)Differential pair skew
DM/DBI to DQS within byte lane+/- 10 milSame as DQ-DQS
Byte lane to byte lane (DQS-DQS)+/- 100 mil (2.5 mm)Controller compensates per-lane
Address/Command to CLKSee fly-by sectionDifferent rules - fly-by topology

Routing Rules for DQ/DQS

Impedance targets (DDR4):

Single-ended DQ: 40 ohm (+/- 10%) = 36-44 ohm

Differential DQS: 80 ohm (+/- 10%) = 72-88 ohm

Single-ended DQ trace width (typical L3 stripline): 5.8 mil for 40 ohm

DQS pair: 4.8 mil width, 5.0 mil space for 80 ohm differential


Spacing rules:

DQ-to-DQ within byte: 3W minimum (14 mil c-c for 4.8 mil traces)

DQS to DQ: 3W minimum (keep DQS centered in byte lane if possible)

Between byte lanes: 5W minimum (24 mil c-c) or greater

DQ/DQS to other signals (non-DDR): 4W minimum (20 mil c-c)

Step-by-Step Routing Process

  1. Place DRAM ICs with byte lane pin groupings aligned to controller byte lanes (check pin mapping carefully).
  2. Assign each DQ pin to its byte lane group in the constraint system.
  3. Route DQS pair FIRST for each byte lane (this is the timing reference).
  4. Route all 8 DQ bits to match DQS length within +/- 10 mil.
  5. Route DM/DBI to match DQS within +/- 10 mil.
  6. Keep ALL signals within a byte lane on the same layer (critical for delay matching).
  7. Add serpentine tuning to shorter signals to match the longest signal in the byte lane.
  8. Verify: all signals in byte lane within tolerance of the DQS reference length.

DDR4-2400, Byte Lane 0, all on L3 stripline:

DQS0+ : 1450 mil (reference), DQS0- : 1448 mil (2 mil skew - within 5 mil)

DQ[0]: 1452 mil (+2), DQ[1]: 1448 mil (-2), DQ[2]: 1453 mil (+3)

DQ[3]: 1447 mil (-3), DQ[4]: 1451 mil (+1), DQ[5]: 1449 mil (-1)

DQ[6]: 1455 mil (+5), DQ[7]: 1446 mil (-4), DM0: 1450 mil (0)

Max deviation from DQS: +5/-4 mil (within +/-10 mil spec)

Designer routes DQ[0:3] on L3 (stripline, Tpd=174 ps/in) and DQ[4:7] on L1 (microstrip, Tpd=139 ps/in). Physical lengths are matched to +/-5 mil. But delay mismatch: 1.5 inches * (174-139) = 52.5 ps skew between the two groups. At DDR4-3200 (312 ps UI), this 52.5 ps is 17% of the data window - likely causing failures during memory training.

Pin swap for routability: DQ bits within a byte lane can be swapped (DQ[0] and DQ[3] can exchange pins) because the controller treats them equivalently during training. This is a powerful routing optimization. But DQ bits CANNOT be swapped between byte lanes, and DM/DQS cannot be swapped.

DRAM pin-out matters: Not all DRAM packages have byte-lane-friendly pin assignments. Verify the specific DRAM part number's pin map. Some x16 parts split one "byte lane" across both sides of the BGA, making routing very difficult.

Checkpoint 2.7.2: Address/Command Routing Critical

DDR4 address/command (ADDR/CMD) signals use a fly-by topology where the signal visits each DRAM in sequence. The timing is different from DQ - address signals are sampled on both edges of the clock, and the fly-by delay is compensated by write leveling.

DDR4 Address/Command Signal List

Address bus: A[0:16], BA[0:1], BG[0:1] (active LOW during command)
Command bus: RAS#/CAS#/WE# (or unified CS#, ACT#, RAS#, CAS#, WE#)
Control: CKE, ODT, CS#[0:3] (active LOW chip select)
Clock: CK/CK# (differential, fly-by routing)
Reset: RESET# (can be slower, less critical)

Total ADDR/CMD signals: ~25-30 signals (depends on topology)

Key characteristic: ADDR/CMD signals are COMMON to all DRAMs
(unlike DQ which is per-DRAM). They use fly-by topology.
            

Fly-By Topology Routing Rules

Fly-by means: Signal routes from controller to DRAM0, then DRAM0 to DRAM1, etc.

Each DRAM sees the signal at a different time (intentional skew).

The clock also flies by in the same order, maintaining the clock-data relationship.


Routing requirements:

1. ADDR signals must match each other: +/- 25 mil within the address group

2. CLK must match ADDR at each DRAM (not globally - per-DRAM matching)

3. CKE, ODT, CS# match to within +/- 25 mil of CLK at their respective DRAM


Fly-by segment lengths (example: 2-DRAM rank):

Controller to DRAM0: L0 (this is the first fly-by segment)

DRAM0 to DRAM1: L1 (second segment - "fly-by stub")

Total length to DRAM1: L0 + L1


CLK routing:

CLK must follow the SAME physical path as ADDR (same order of DRAMs)

CLK at DRAM0 must arrive at the same time as ADDR at DRAM0

CLK at DRAM1 must arrive at the same time as ADDR at DRAM1

This means CLK-to-ADDR matching is PER DRAM, not end-to-end!

Step-by-Step Routing

  1. Define the fly-by order (which DRAM is visited first, second, etc.) during placement.
  2. Route CLK differential pair first along the fly-by path.
  3. Route ADDR/CMD signals parallel to CLK, matching at each DRAM connection point.
  4. At each DRAM: ADDR signals and CLK should arrive simultaneously (+/- 25 mil of each other).
  5. Keep ADDR/CMD signals on the same layer as CLK for consistent propagation delay.
  6. Use T-topology or fly-by depending on the number of ranks and loads.
  7. Verify per-DRAM timing with simulation (write leveling will compensate remaining skew).

2-DRAM DDR4 fly-by routing:

CLK route: Controller --[800 mil]--> DRAM0 --[600 mil]--> DRAM1

A[0] route: Controller --[795 mil]--> DRAM0 --[605 mil]--> DRAM1

Matching at DRAM0: CLK=800, A[0]=795: skew = 5 mil (within 25 mil)

Matching at DRAM1: CLK=1400, A[0]=1400: skew = 0 mil (within 25 mil)

Checkpoint 2.7.3: Write Leveling Support Major

Write leveling is a DDR4 training mechanism that compensates for the fly-by clock routing delay. Each DRAM receives the clock at a different time due to fly-by propagation. Write leveling adjusts the DQS phase at each DRAM so writes occur at the correct time.

Write Leveling Mechanism

Problem solved by write leveling:

CLK arrives at DRAM0 first, DRAM1 second (fly-by delay)

DQ/DQS connect directly (point-to-point, not fly-by)

Without compensation: DQS arrives at all DRAMs at the same time,

but CLK arrives at different times. DQS-CLK relationship differs per DRAM.


Write leveling operation:

1. Controller sends DQS to DRAM during write leveling mode

2. DRAM samples DQS with its internal clock (derived from CLK)

3. DRAM returns the sampled value on DQ[0]

4. Controller shifts DQS phase until it sees 0-to-1 transition on DQ[0]

5. This establishes the correct DQS-to-CLK phase for each DRAM


PCB design implication:

Maximum fly-by delay must be within the write leveling range of the controller.

Typical controller WL range: 0 to 2.0 ns (0 to ~12 inches fly-by at stripline speed)

If fly-by delay exceeds WL range: controller cannot compensate = FAILURE

Design Rules for Write Leveling Support

  1. Calculate total fly-by delay from first DRAM to last DRAM in the rank.
  2. Verify this delay is within the controller's write leveling range (check datasheet).
  3. Ensure DQ/DQS routing is point-to-point (not fly-by) - direct from controller to each DRAM.
  4. DQ/DQS length from controller to each DRAM should be roughly similar (+/- 500 mil between DRAMs is typically fine, as WL compensates the rest).
  5. Do NOT try to manually compensate fly-by delay in DQ routing - write leveling does this automatically.

Over-compensating in PCB routing: Some designers try to route DQ to closer DRAMs longer and DQ to farther DRAMs shorter, attempting to manually compensate the fly-by delay. This is WRONG - it fights against the write leveling algorithm and can push the DQS phase outside the controller's compensation window.

Multi-rank considerations: In dual-rank configurations (2 sets of DRAMs sharing DQ bus), write leveling must work for both ranks. The fly-by topology should keep both ranks in a similar position relative to the DQ bus.

Checkpoint 2.7.4: ODT Configuration Major

On-Die Termination (ODT) in DDR4 provides impedance matching at the DRAM and controller. Proper ODT configuration is critical for signal integrity at high data rates. The ODT values affect signal amplitude, ringing, and power consumption.

DDR4 ODT Values and Configuration

ComponentODT SettingTypical ValueWhen Active
DRAM DQ (write)RTT_WR120/240 ohmDuring writes to this rank
DRAM DQ (read, non-target)RTT_NOM60/120 ohmDuring reads from other rank
DRAM DQ (idle)RTT_PARK240/DisabledWhen rank is idle
Controller DQRon (drive)34/48 ohmDuring writes (driving)
Controller DQODT (receive)40/60/80 ohmDuring reads (termination)
DRAM CA (command/addr)Not available-CA uses POD signaling, no ODT

ODT Design Process

Single-rank write (controller driving, DRAM receiving):

Driver: Controller Ron = 34 ohm

Trace: Zo = 40 ohm

Load: DRAM RTT_WR = 120 ohm

Reflection at DRAM: rho = (120-40)/(120+40) = 0.5 (significant!)

But POD signaling has Vtt pull-up, so effective load is different:

Effective Z at DRAM = RTT_WR || (pulls to Vddq) = acts as termination


Single-rank read (DRAM driving, controller receiving):

Driver: DRAM Ron = 34 ohm

Trace: Zo = 40 ohm

Load: Controller ODT = 60 ohm

Reflection at controller: rho = (60-40)/(60+40) = 0.2 (acceptable)


Optimal matching for 40-ohm trace:

Ideal termination = Zo = 40 ohm

Controller ODT = 40 ohm gives rho = 0 (perfect match)

But 40 ohm draws more current: I = 1.2V/40 = 30 mA per pin

Trade-off: 60 ohm ODT = slight mismatch but 40% less power

Step-by-Step ODT Optimization

  1. Start with SoC vendor recommended ODT settings (from reference design or app note).
  2. Run signal integrity simulation with proposed ODT values.
  3. Check eye diagram at receiver for all DQ bits in each byte lane.
  4. If eye is too small: try lower ODT value (stronger termination, better matching).
  5. If power is too high: try higher ODT value (weaker termination, saves power).
  6. Verify for BOTH write and read directions (different ODT active in each direction).
  7. Document final ODT settings for firmware/BIOS memory training configuration.

Micron DDR4 IBIS Models: Download from Micron website. Includes IBIS models with configurable ODT settings (RTT_NOM, RTT_WR, RTT_PARK, Ron). Use with HyperLynx or Sigrity.

JEDEC Board SI Tool: Free simulation tool from JEDEC that models DDR4 channels with configurable ODT. Good for initial ODT exploration before full simulation.

Checkpoint 2.7.5: VREF Generation Major

DDR4 uses internal VREF (generated inside the DRAM via Mode Register settings) for data signals, but external VREF may be needed for command/address signals on some controllers. VREF quality directly affects the noise margin of the entire DDR interface.

DDR4 VREF Architecture

DDR4 VREF for DQ (internal to DRAM):
  - Training Range 1: 60% to 92.5% of Vddq (in 0.65% steps)
  - Training Range 2: 45% to 77.5% of Vddq (in 0.65% steps)
  - Default: 70% of Vddq for Range 1 = 0.7 * 1.2V = 0.84V
  - Adjusted during VREF training to center the eye

DDR4 VREF for CA (controller-side):
  - Some controllers use internal VREF (generated from Vddq)
  - Some require external VREF pin: must be 0.5 * Vddq = 0.6V
  - External VREF accuracy: +/- 0.5% (= +/- 6 mV)
  - Noise on VREF directly subtracts from noise margin

External VREF circuit (if needed):
  Option 1: Resistor divider from Vddq
    R1 = R2 = 200 ohm (divides by 2), decoupled with 100nF + 1uF
    Accuracy depends on resistor tolerance: 0.1% resistors give 0.1% VREF accuracy

  Option 2: Dedicated VREF IC (TI REF3312 - 1.25V, or resistor-programmed LDO)
    Better noise performance, but adds cost and complexity
            

VREF PCB Layout Requirements

  1. If external VREF needed: place generation circuit within 500 mil of the VREF pin.
  2. Decouple VREF with low-ESR capacitors: 100 nF (0402) + 1 uF (0402) to ground.
  3. Route VREF trace away from switching signals (5W minimum separation).
  4. Use a guard trace around VREF if routed near DDR data bus.
  5. VREF resistor divider should be powered from the same Vddq supply rail feeding the DRAMs.
  6. Verify VREF decoupling provides <1 mV ripple at DDR clock frequency and harmonics.

VREF noise budget: With DDR4-3200 having approximately 260 mV noise margin per side, even 5 mV of VREF noise consumes 2% of the budget. At 10 mV, it's nearly 4%. Keep VREF noise below 2 mV peak-to-peak for DDR4-3200 designs.

Internal VREF training: DDR4 DQ VREF is internally generated and trained. The PCB designer doesn't need to provide it externally. But the training algorithm needs sufficient eye opening to converge. If the eye is already marginal from other SI issues, VREF training may not find an optimal point.

Checkpoint 2.7.6: Fly-By Topology for DDR4 Critical

DDR4 mandates a fly-by topology for clock and command/address signals. Unlike DDR2's T-branch topology, fly-by routes signals in a daisy-chain fashion past each DRAM. This improves signal quality at high data rates but introduces skew that write leveling must compensate.

Fly-By Topology Design

Topology comparison:
  DDR2 (T-branch): Controller---+---DRAM0
                                |
                                +---DRAM1
  Clock arrives at both DRAMs simultaneously (matched stub lengths)

  DDR4 (Fly-by):  Controller----DRAM0----DRAM1----DRAM2----DRAM3
  Clock arrives at DRAM0 first, DRAM3 last
  Delay from DRAM0 to DRAM3 can be 1-2 ns (compensated by write leveling)

Why fly-by is better at high speed:
  - Eliminates T-branch reflection from unterminated stub
  - Each DRAM sees only forward-traveling wave (no reflection from next DRAM)
  - DRAM input capacitance (2-3 pF) is absorbed into transmission line
  - Signal quality at each DRAM is better than T-topology above 1 GHz
            

Fly-By Routing Rules

Segment matching (ADDR/CMD signals):

All ADDR/CMD signals in a group must be matched to +/- 25 mil at EACH DRAM.

CLK pair must follow the same fly-by path and match ADDR group.


Fly-by stub length (from via to DRAM pad):

Keep stub from fly-by trace to DRAM pad as short as possible.

Maximum stub: 100 mil (absolute max), target: < 50 mil.

Long stubs create reflections that degrade signal quality at high speeds.


DRAM loading:

Each DRAM presents approximately 2 pF input capacitance per pin.

4 DRAMs = 8 pF total loading on fly-by network.

This loading reduces the effective impedance and increases rise time.


Fly-by termination:

Some designs add a termination resistor at the end of the fly-by chain:

Rt = 100-200 ohm to Vtt (reduces ringing at last DRAM)

Check SoC vendor recommendation - not always needed for 1-2 DRAM configs.

Step-by-Step Implementation

  1. During placement: arrange DRAMs in a line that allows natural fly-by routing from controller.
  2. Define the fly-by order (closest DRAM to controller is first, furthest is last).
  3. Route CLK differential pair first along the fly-by path with short stubs to each DRAM CLK pin.
  4. Route all ADDR/CMD signals parallel to CLK, maintaining the same fly-by order.
  5. Match all ADDR/CMD signals to each other within +/- 25 mil at each DRAM connection point.
  6. Keep fly-by stub length from main route to DRAM pad under 50 mil.
  7. If fly-by termination needed: place at end of chain, value per SoC vendor recommendation.
  8. Simulate the complete fly-by network with all DRAM loads to verify adequate signal quality at each DRAM.

DDR4, 2 DRAMs, fly-by routing:

CLK: Controller(0 mil) -> 900 mil -> DRAM0(stub 30 mil) -> 500 mil -> DRAM1(stub 35 mil)

A[0]: Controller(0 mil) -> 905 mil -> DRAM0(stub 28 mil) -> 498 mil -> DRAM1(stub 32 mil)

A[0] matches CLK at DRAM0: |905-900| = 5 mil (within 25 mil)

A[0] matches CLK at DRAM1: |(905+498)-(900+500)| = |1403-1400| = 3 mil (within 25 mil)

Fly-by delay: 500 mil * 174 ps/in = 87 ps (well within WL range)

Designer routes CLK to DRAM1 first, then DRAM0, but ADDR signals go to DRAM0 first, then DRAM1. The fly-by order is reversed between CLK and ADDR. At DRAM0: CLK arrives late (it went to DRAM1 first), ADDR arrives early. The CLK-to-ADDR skew at DRAM0 equals the entire fly-by delay (800 ps), far exceeding the 25 mil (4 ps) matching requirement. Memory fails to initialize.

Cadence Allegro: Use "Route > Interactive Fly-by" feature (DDR routing wizard). Set up the fly-by chain with DRAM order, and the tool maintains matching at each node automatically.

Altium Designer: The xSignals feature supports DDR fly-by routing with per-node matching. Define xSignals from controller to each DRAM, then use length tuning with fly-by awareness.

HyperLynx: Use BoardSim to simulate the complete fly-by topology. Model DRAM input capacitance with IBIS models. Check signal quality at each DRAM independently.