OCP Server Memory Channel Testing DRAFT

8
OCP Server Memory Channel Testing Overview – Draft Version 0.1 May 4 2015 Contributed by: David Woolf UNH-OL [email protected] and Barbara Aichinger FuturePlus Systems [email protected] Executive Summary: Cloud Computing is pervasive in our society today and at the heart of every cloud computing server is DDR Memory. Some data centers have reported that DDR Memory errors are the #2 failure in their data centers. Error detection and correction techniques fall short if there are more than 1 -2 bit errors in a 64-72 bit line of information. Industry studies have shown that DDR Memory errors are much more pervasive in the field than the vendors data sheets would lead you to believe. Adding a cost effective and relatively quick post validation check of the DDR Memory channel for OCP Servers would add credibility to the OCP brand. In addition it would help identify the elusive cause of post manufacturing field memory errors seen in data centers across the globe. This effort would be a value to OCP manufacturers, OCP customers and the Cloud Computing industry in general. A second goal of this effort would be to further the investigation into the recently publicized failure mechanism of DDR3 memory called ‘Row Hammer’. Google has identified this not only as a reliability issue but as a security risk that can be exploited in order to gain complete control over a targeted Server. Goal #1: Memory Channel Validation Audit: Add value to the OCP brand of Servers by specifying a robust memory channel test procedure. This test procedure is not meant to be a design validation. It is meant to be an audit that a robust electrical and protocol DDR Memory Channel validation was done. As an added benefit this procedure can also be used to: Spot check motherboards from manufacturing to ensure quality. Isolate failing memory channels in the field on servers displaying above average memory errors Check for BIOS bugs that program the Memory Controller incorrectly thus causing JEDEC specification violations Procedure: This testing will be broken into two parts. The first is the electrical audit and the second is the protocol and timing JEDEC specification testing. Electrical Audit: This testing is not meant to be an electrical signal integrity validation. Rather it ensures that a validation has been done and that the signals at the DDR DIMM connector are acceptable with regards to signal swing, alignment, data valid eye size and that none of the strobe signals, data signals, address, command or control signals look appreciably degraded with respect to their form or function. It will be a qualitative measurement.

Transcript of OCP Server Memory Channel Testing DRAFT

Page 1: OCP Server Memory Channel Testing DRAFT

OCP Server Memory Channel Testing Overview – Draft Version 0.1 May 4 2015

Contributed by: David Woolf UNH-OL [email protected] and Barbara Aichinger

FuturePlus Systems [email protected]

Executive Summary: Cloud Computing is pervasive in our society today and at the

heart of every cloud computing server is DDR Memory. Some data centers have reported that

DDR Memory errors are the #2 failure in their data centers. Error detection and correction

techniques fall short if there are more than 1 -2 bit errors in a 64-72 bit line of information.

Industry studies have shown that DDR Memory errors are much more pervasive in the field than

the vendors data sheets would lead you to believe. Adding a cost effective and relatively quick

post validation check of the DDR Memory channel for OCP Servers would add credibility to the

OCP brand. In addition it would help identify the elusive cause of post manufacturing field

memory errors seen in data centers across the globe. This effort would be a value to OCP

manufacturers, OCP customers and the Cloud Computing industry in general.

A second goal of this effort would be to further the investigation into the recently publicized

failure mechanism of DDR3 memory called ‘Row Hammer’. Google has identified this not only

as a reliability issue but as a security risk that can be exploited in order to gain complete control

over a targeted Server.

Goal #1: Memory Channel Validation Audit: Add value to the OCP brand of

Servers by specifying a robust memory channel test procedure. This test procedure is not meant

to be a design validation. It is meant to be an audit that a robust electrical and protocol DDR

Memory Channel validation was done. As an added benefit this procedure can also be used to:

Spot check motherboards from manufacturing to ensure quality.

Isolate failing memory channels in the field on servers displaying above average memory

errors

Check for BIOS bugs that program the Memory Controller incorrectly thus causing JEDEC

specification violations

Procedure: This testing will be broken into two parts. The first is the electrical audit and the

second is the protocol and timing JEDEC specification testing.

Electrical Audit: This testing is not meant to be an electrical signal integrity validation.

Rather it ensures that a validation has been done and that the signals at the DDR DIMM

connector are acceptable with regards to signal swing, alignment, data valid eye size and that

none of the strobe signals, data signals, address, command or control signals look appreciably

degraded with respect to their form or function. It will be a qualitative measurement.

Page 2: OCP Server Memory Channel Testing DRAFT

Protocol Timing Audit: This test procedure is not meant to replace a protocol timing

validation of the server. Rather it ensures that the BIOS has programmed the memory

controller correctly for key timing parameters of the DDR memory. It also ensures that under

heavy traffic loads the memory controller adheres to the JEDEC specification.

Outline: Electrical Audit of the DDR Memory Channel Run Eye Scan1 on all bus signals in all slots:

Address

Command

Control

Data Signals

Data Strobes

Check for

Signal Alignment

o Bytes to each other (fly by)

o Signals within a byte

o Strobes to Data

Read

Write

o Command to Address and Control

Data Valid Eye (all signals)

o Signals within a byte

o Composite eye (millions of samples laid on top of each other)

o Burst Scan (beats within a burst)

Signal Swing

o At the start and end of a burst on Data and Data Strobe (DQ/DQS)

o All address command and control signals

Method of Implementation:

For the audit a qualitative measurement will be performed. An oscilloscope would be the tool

of choice for a robust validation. However that is not our goal. For an audit it is much more cost

effective in money and time to use a logic analyzer and an interposer installed in the DIMM slot.

Using a DIMM slot interposer results in no soldering or delicate time consuming probing.

1 Eye Scan is a general term describing a high speed digital sampling of a signal with either a scope, logic analyzer or protocol analyzer.

Page 3: OCP Server Memory Channel Testing DRAFT

Figure 1: Keysight U4154A/B logic analyzer with a FS2510 interposer from FuturePlus Systems

This setup has a sampling resolution of 5ps x 2mv. It can sample every signal on the DDR4 bus.

simultaneously. All signals are scanned under the same conditions and viewed with respect to

each other. This method does not replace a high bandwidth scope however it does provide a

rapid qualitative insight that will act as an ‘audit’ that the memory slot has good signal quality.

Pass/Fail Criteria

Eye Scan

Address/Command and Control Signals have similar eye shape and opening size. A numeric

output (excel sheet) from the tool can be compared quickly in order to make the comparison.

DQ/DQS signals all have similar eye shape and opening size. Byte fly by is present and

consistent between bytes

Figure 2: Eye Scan showing alignment of Data and Strobe for a Write

Page 4: OCP Server Memory Channel Testing DRAFT

Burst Scan

Beat 0-7 and 0-3 for BC are consistent with good signal quality. This is checked for all DQ/DQS

signals.

U p Figure 3: Example of a Burst Scan on Data byte 0 with associated strobe signal

Outline: Protocol Timing Audit of the DDR Memory Channel The JEDEC specification has hundreds of timing parameters that govern the ordering and timing

between transactions on the DDR memory bus. These are setup by a complicated set of steps at

boot time and if not configured correctly can result in data corruption of the memory.

Ensure that the OCP memory channel adheres to the JEDEC specification with regards to:

o Command to Command timing

o Refresh Rate

o Calibration Commands

o Correct Operation of Mode Register Settings

o ODT operation

o Rank to Rank command timing

o DIMM to DIMM command timing

o Power management operation

Method of Implementation:

For the audit a protocol or logic analyzer based solution will suffice. The measurement is

made at the DIMM slot thus no soldering or delicate time consuming probing needs to be

done.

Page 5: OCP Server Memory Channel Testing DRAFT

Figure 4: Automated Violation Detection with the FS2800 DDR Detective®

Figure 5: DDR Detective interposer installed in 1 slot of a 2 slot memory channel

Pass/Fail Criteria

All JEDEC protocol and timing as specified by the JEDEC Specification pass without error.

Running each test for 1 hour each.

Software In order to exercise the memory channel software benchmarks will be run. They will be selected

using the following criteria:

Page 6: OCP Server Memory Channel Testing DRAFT

o Near theoretical bandwidth created on the data bus

o Variety of Commands caused on the bus

o Exercises all supported power management commands

o Power Spikes and Inter-symbol interference caused

o Creates the Row Hammer event

The challenge will be to find the least number of different software programs to cause the

above events. It is anticipated that once failure mechanisms are uncovered and better

understood software that targets creating those failure mechanisms will be added to the above

list.

Documentation

The following documents will be created and electronically stored for each Server tested, for

each slot for each memory channel.

Eye Scan results with ‘eye’ EXCEL spreadsheets for each signal

Burst Scan for each byte (including ECC if implemented) for both Reads and Writes

Protocol Violation Report

Example: A 4 memory channel 3 slot per channel Server has 12 total slots. The tests are

thus repeated 12 times. Each Eye Scan file will contain all signals for that slot. Each Burst Scan

file will contain all signals for that slot. Each Slot will have a protocol violation check report thus

12 protocol violation reports will be generated. In total this example system will have 36 files. A

summary file can then be created that gives an easy to read pass/fail summary. Thus 37 files will

be generated for each Server.

Goal #2: Row Hammer Detection As geometries shrink and capacities increase DDR Memory cells are susceptible to leakage

current from adjacent cells. In the case of DDR Memory a ROW subjected to excessive

ACTIVATE commands can leak current into adjacent ROWS. This ROW is referred to as the

‘aggressor’. If the adjacent ROWS, called the victim ROWS, are on the tail end of the cyclical

refresh cycle their charge is low. Thus they are susceptible to leakage current that can cause a

bit flip. The failure of the DDR Memory cell to hold its charge due to leakage current from an

adjacent ROW when the adjacent ROW is targeted with excessive ACTIVATE commands is

known as “Row Hammer”. The name was coined because the ROW is being ‘hammered’ with

ACTIVATE commands.

Page 7: OCP Server Memory Channel Testing DRAFT

How should OCP Certification address this problem? The answer is not altogether clear at this

point in time since this is relatively a new issue for the industry. Here are some of the possible

scenarios.

1. Replace all DDR3 Memory with Tested and Certified parts that do not have this failure.

It is not clear that this is an economical or viable option.

2. Implement Row Hammer mitigation strategies for DDR3. There are several but they do

not totally prevent the problem and are a power and performance hit to the server.

3. Does the customers application software create the ROW hammer event? This can be

detected using hardware test equipment that looks for excessive ACTIVATE commands

being generated by the Memory Controller. If it does not then perhaps no action needs

to be taken.

4. The DDR Memory DRAM itself is the culprit. Identify which parts are most susceptible

and purge only those parts from critical applications.

5. Move as quickly as possible to DDR4, which some in the industry claim is not susceptible

to Row Hammer failures. This has not been proven to be correct.

In any event, testing and research by reputable test labs should be undertaken and the results

published. This will arm the OCP community and the Cloud Computing industry with good

information on how to tackle this problem.

Initial Row Hammer Investigation Gather information and repeat Row Hammer experiments

Measure the effectiveness of the mitigation strategies and the power and performance

tradeoffs

Identify what software shows the problem the quickest

Compile Know Good Parts list for DDR3

Identify what types of applications are most susceptible and publish results

Develop a network of industry experts to review and publish results

Current Row Hammer Resources Google’s Article: http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-

bug-to-gain.html

CMU expose: http://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf

Various Papers and articles on the topic:

https://blogs.synopsys.com/committedtomemory/2015/03/09/row-hammering-what-it-is-and-

how-hackers-could-use-it-to-gain-access-to-your-system/

http://www.ddrdetective.com/files/6414/1036/5710/The_Known_Failure_Mechanism_in_DDR

3_memory_referred_to_as_Row_Hammer.pdf

https://www.youtube.com/watch?v=7wIUQ04Vkes

Page 8: OCP Server Memory Channel Testing DRAFT

Wikipedia Link: http://en.wikipedia.org/wiki/Row_hammer#cite_note-googleprojectzero-4

Products identifying Row Hammer events and causing them

http://teledynelecroy.com/pressreleases/document.aspx?news_id=1805

http://www.eurosoft-uk.com/eurosoft-test-bulletin-testing-row-hammer/

http://www.ddrdetective.com/files/3314/1036/5702/Description_of_the_Row_Hammer_featur

e_on_the_FS2800_DDR_Detective.pdf

http://www.memtest86.com/

Summary Memory Channel Validation Audit and developing a well understood Row Hammer mitigation

strategy will put OCP in a leadership position. OCP is the only industry standards organization

that encompasses the entire ‘food chain’ of Cloud Computing from component vendors, OEMs,

Server Vendors, Software vendors to large data center operators. OCP members represent the

spectrum of industries reliant on Cloud Computing from the financial sector to social media.

OCP is poised to address this issue where other standards organizations, due to corporate

malfeasance or just plain ignorance, have failed to do so. These goals are attainable and in the

best interests of the OCP community.