11056930 Existing OCP Certificate Enhancement OCP legacy ...
OCP Server Memory Channel Testing DRAFT
-
Upload
barbara-aichinger -
Category
Documents
-
view
32 -
download
0
Transcript of OCP Server Memory Channel Testing DRAFT
OCP Server Memory Channel Testing Overview – Draft Version 0.1 May 4 2015
Contributed by: David Woolf UNH-OL [email protected] and Barbara Aichinger
FuturePlus Systems [email protected]
Executive Summary: Cloud Computing is pervasive in our society today and at the
heart of every cloud computing server is DDR Memory. Some data centers have reported that
DDR Memory errors are the #2 failure in their data centers. Error detection and correction
techniques fall short if there are more than 1 -2 bit errors in a 64-72 bit line of information.
Industry studies have shown that DDR Memory errors are much more pervasive in the field than
the vendors data sheets would lead you to believe. Adding a cost effective and relatively quick
post validation check of the DDR Memory channel for OCP Servers would add credibility to the
OCP brand. In addition it would help identify the elusive cause of post manufacturing field
memory errors seen in data centers across the globe. This effort would be a value to OCP
manufacturers, OCP customers and the Cloud Computing industry in general.
A second goal of this effort would be to further the investigation into the recently publicized
failure mechanism of DDR3 memory called ‘Row Hammer’. Google has identified this not only
as a reliability issue but as a security risk that can be exploited in order to gain complete control
over a targeted Server.
Goal #1: Memory Channel Validation Audit: Add value to the OCP brand of
Servers by specifying a robust memory channel test procedure. This test procedure is not meant
to be a design validation. It is meant to be an audit that a robust electrical and protocol DDR
Memory Channel validation was done. As an added benefit this procedure can also be used to:
Spot check motherboards from manufacturing to ensure quality.
Isolate failing memory channels in the field on servers displaying above average memory
errors
Check for BIOS bugs that program the Memory Controller incorrectly thus causing JEDEC
specification violations
Procedure: This testing will be broken into two parts. The first is the electrical audit and the
second is the protocol and timing JEDEC specification testing.
Electrical Audit: This testing is not meant to be an electrical signal integrity validation.
Rather it ensures that a validation has been done and that the signals at the DDR DIMM
connector are acceptable with regards to signal swing, alignment, data valid eye size and that
none of the strobe signals, data signals, address, command or control signals look appreciably
degraded with respect to their form or function. It will be a qualitative measurement.
Protocol Timing Audit: This test procedure is not meant to replace a protocol timing
validation of the server. Rather it ensures that the BIOS has programmed the memory
controller correctly for key timing parameters of the DDR memory. It also ensures that under
heavy traffic loads the memory controller adheres to the JEDEC specification.
Outline: Electrical Audit of the DDR Memory Channel Run Eye Scan1 on all bus signals in all slots:
Address
Command
Control
Data Signals
Data Strobes
Check for
Signal Alignment
o Bytes to each other (fly by)
o Signals within a byte
o Strobes to Data
Read
Write
o Command to Address and Control
Data Valid Eye (all signals)
o Signals within a byte
o Composite eye (millions of samples laid on top of each other)
o Burst Scan (beats within a burst)
Signal Swing
o At the start and end of a burst on Data and Data Strobe (DQ/DQS)
o All address command and control signals
Method of Implementation:
For the audit a qualitative measurement will be performed. An oscilloscope would be the tool
of choice for a robust validation. However that is not our goal. For an audit it is much more cost
effective in money and time to use a logic analyzer and an interposer installed in the DIMM slot.
Using a DIMM slot interposer results in no soldering or delicate time consuming probing.
1 Eye Scan is a general term describing a high speed digital sampling of a signal with either a scope, logic analyzer or protocol analyzer.
Figure 1: Keysight U4154A/B logic analyzer with a FS2510 interposer from FuturePlus Systems
This setup has a sampling resolution of 5ps x 2mv. It can sample every signal on the DDR4 bus.
simultaneously. All signals are scanned under the same conditions and viewed with respect to
each other. This method does not replace a high bandwidth scope however it does provide a
rapid qualitative insight that will act as an ‘audit’ that the memory slot has good signal quality.
Pass/Fail Criteria
Eye Scan
Address/Command and Control Signals have similar eye shape and opening size. A numeric
output (excel sheet) from the tool can be compared quickly in order to make the comparison.
DQ/DQS signals all have similar eye shape and opening size. Byte fly by is present and
consistent between bytes
Figure 2: Eye Scan showing alignment of Data and Strobe for a Write
Burst Scan
Beat 0-7 and 0-3 for BC are consistent with good signal quality. This is checked for all DQ/DQS
signals.
U p Figure 3: Example of a Burst Scan on Data byte 0 with associated strobe signal
Outline: Protocol Timing Audit of the DDR Memory Channel The JEDEC specification has hundreds of timing parameters that govern the ordering and timing
between transactions on the DDR memory bus. These are setup by a complicated set of steps at
boot time and if not configured correctly can result in data corruption of the memory.
Ensure that the OCP memory channel adheres to the JEDEC specification with regards to:
o Command to Command timing
o Refresh Rate
o Calibration Commands
o Correct Operation of Mode Register Settings
o ODT operation
o Rank to Rank command timing
o DIMM to DIMM command timing
o Power management operation
Method of Implementation:
For the audit a protocol or logic analyzer based solution will suffice. The measurement is
made at the DIMM slot thus no soldering or delicate time consuming probing needs to be
done.
Figure 4: Automated Violation Detection with the FS2800 DDR Detective®
Figure 5: DDR Detective interposer installed in 1 slot of a 2 slot memory channel
Pass/Fail Criteria
All JEDEC protocol and timing as specified by the JEDEC Specification pass without error.
Running each test for 1 hour each.
Software In order to exercise the memory channel software benchmarks will be run. They will be selected
using the following criteria:
o Near theoretical bandwidth created on the data bus
o Variety of Commands caused on the bus
o Exercises all supported power management commands
o Power Spikes and Inter-symbol interference caused
o Creates the Row Hammer event
The challenge will be to find the least number of different software programs to cause the
above events. It is anticipated that once failure mechanisms are uncovered and better
understood software that targets creating those failure mechanisms will be added to the above
list.
Documentation
The following documents will be created and electronically stored for each Server tested, for
each slot for each memory channel.
Eye Scan results with ‘eye’ EXCEL spreadsheets for each signal
Burst Scan for each byte (including ECC if implemented) for both Reads and Writes
Protocol Violation Report
Example: A 4 memory channel 3 slot per channel Server has 12 total slots. The tests are
thus repeated 12 times. Each Eye Scan file will contain all signals for that slot. Each Burst Scan
file will contain all signals for that slot. Each Slot will have a protocol violation check report thus
12 protocol violation reports will be generated. In total this example system will have 36 files. A
summary file can then be created that gives an easy to read pass/fail summary. Thus 37 files will
be generated for each Server.
Goal #2: Row Hammer Detection As geometries shrink and capacities increase DDR Memory cells are susceptible to leakage
current from adjacent cells. In the case of DDR Memory a ROW subjected to excessive
ACTIVATE commands can leak current into adjacent ROWS. This ROW is referred to as the
‘aggressor’. If the adjacent ROWS, called the victim ROWS, are on the tail end of the cyclical
refresh cycle their charge is low. Thus they are susceptible to leakage current that can cause a
bit flip. The failure of the DDR Memory cell to hold its charge due to leakage current from an
adjacent ROW when the adjacent ROW is targeted with excessive ACTIVATE commands is
known as “Row Hammer”. The name was coined because the ROW is being ‘hammered’ with
ACTIVATE commands.
How should OCP Certification address this problem? The answer is not altogether clear at this
point in time since this is relatively a new issue for the industry. Here are some of the possible
scenarios.
1. Replace all DDR3 Memory with Tested and Certified parts that do not have this failure.
It is not clear that this is an economical or viable option.
2. Implement Row Hammer mitigation strategies for DDR3. There are several but they do
not totally prevent the problem and are a power and performance hit to the server.
3. Does the customers application software create the ROW hammer event? This can be
detected using hardware test equipment that looks for excessive ACTIVATE commands
being generated by the Memory Controller. If it does not then perhaps no action needs
to be taken.
4. The DDR Memory DRAM itself is the culprit. Identify which parts are most susceptible
and purge only those parts from critical applications.
5. Move as quickly as possible to DDR4, which some in the industry claim is not susceptible
to Row Hammer failures. This has not been proven to be correct.
In any event, testing and research by reputable test labs should be undertaken and the results
published. This will arm the OCP community and the Cloud Computing industry with good
information on how to tackle this problem.
Initial Row Hammer Investigation Gather information and repeat Row Hammer experiments
Measure the effectiveness of the mitigation strategies and the power and performance
tradeoffs
Identify what software shows the problem the quickest
Compile Know Good Parts list for DDR3
Identify what types of applications are most susceptible and publish results
Develop a network of industry experts to review and publish results
Current Row Hammer Resources Google’s Article: http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-
bug-to-gain.html
CMU expose: http://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf
Various Papers and articles on the topic:
https://blogs.synopsys.com/committedtomemory/2015/03/09/row-hammering-what-it-is-and-
how-hackers-could-use-it-to-gain-access-to-your-system/
http://www.ddrdetective.com/files/6414/1036/5710/The_Known_Failure_Mechanism_in_DDR
3_memory_referred_to_as_Row_Hammer.pdf
https://www.youtube.com/watch?v=7wIUQ04Vkes
Wikipedia Link: http://en.wikipedia.org/wiki/Row_hammer#cite_note-googleprojectzero-4
Products identifying Row Hammer events and causing them
http://teledynelecroy.com/pressreleases/document.aspx?news_id=1805
http://www.eurosoft-uk.com/eurosoft-test-bulletin-testing-row-hammer/
http://www.ddrdetective.com/files/3314/1036/5702/Description_of_the_Row_Hammer_featur
e_on_the_FS2800_DDR_Detective.pdf
http://www.memtest86.com/
Summary Memory Channel Validation Audit and developing a well understood Row Hammer mitigation
strategy will put OCP in a leadership position. OCP is the only industry standards organization
that encompasses the entire ‘food chain’ of Cloud Computing from component vendors, OEMs,
Server Vendors, Software vendors to large data center operators. OCP members represent the
spectrum of industries reliant on Cloud Computing from the financial sector to social media.
OCP is poised to address this issue where other standards organizations, due to corporate
malfeasance or just plain ignorance, have failed to do so. These goals are attainable and in the
best interests of the OCP community.