Performed by: Liran Sperling 200476216 Gal Braun301357059 Instructor: Evgeny Fiksman
Digital signature using MD5 algorithm Hardware Acceleration Final Presentation Students: Eyal Mendel...
-
Upload
elmer-griffith -
Category
Documents
-
view
224 -
download
3
Transcript of Digital signature using MD5 algorithm Hardware Acceleration Final Presentation Students: Eyal Mendel...
Digital signature using MD5 algorithm Hardware Acceleration
Final Presentation
Students:Eyal Mendel & Aleks DyskinInstructor:Evgeny Fiksman High Speed Digital Systems Laboratory
Agenda
HW/SW System DesignHW/SW System Design
Performance EvaluationPerformance Evaluation
IntroductionIntroduction
Conclusions & SummaryConclusions & Summary
Agenda
HW/SW System DesignHW/SW System Design
Performance EvaluationPerformance Evaluation
IntroductionIntroduction
Conclusions & SummaryConclusions & Summary
Project Goals
Evaluation C to FPGA techniqueEvaluation C to FPGA technique
Study case: MD5 algorithmStudy case: MD5 algorithm
Tool: ASC – A Stream CompilerTool: ASC – A Stream Compiler
Introduction
Hardware Accelerator Design & ImplementationHardware Accelerator Design & Implementation
MD5 Goals/UsageIntroduction
Goal:The MD5 (Message Digest 5)algorithm is intended for digital signature applications, where a large file must be "compressed" in a secure manner before being encrypted with a private (secret) key under a public-key cryptosystem
Usage:MD5 is widely used as cryptographic hash function . As an internet standard RFC1321, MD5 has been employed in wide variety of security applications, commonly used to check the integrity of files.
MD5 steps (1)
Step 1: Append Padding Bits
The message is "padded" so that its length (in bits) is congruent to 448, modulo 512.
Step 2: Append Length A 64-bit representation of b (the length of the message before the
padding bits were added) is appended to the result of the previous step.
Introduction
Step 3: Initialize MD buffer
MD5 steps (2)
a=0x67452301;b=0xefcdab89;c=0x98badcfe;d=0x10325476
( , , ) ;
( , , ) ;
( , , ) ;
( , , ) ( )
F x y z xy xz
G x y z xz yz
H x y z x y z
I x y z y x z
Step 4-5: Process message in 16-word blocks and Output
Introduction
ASC Overview
• ASC (A Stream Compiler) simplifies exploration of hardware accelerators by transforming the hardware design task into a software design process using only ’gcc’ and ’make’ to obtain a hardware netlist.
• Single C++ program with custom types and operators is the only syntax needed.
• ASC provides all the environment and implements all the protocols needed to communicate between HW module and CPU.
Introduction
SW Model Evaluation(1)Introduction
• Maximum speed up in ideal case is: (process and speed_up takes 0 sec to evaluate)
•The evaluation for the finish stage was done for the worst case: i.e. the append_bits step is performed. In general case the append_bits is performed only once per file/string.• All the measurements were held on Xilinx PowerPC
Accelerated Part
1 0.49 0.512.83
1 0.49 0.33 0.18
SW Model Evaluation(2)Introduction
1 2
1 2
1 1 2 1
1 1 2 1
( 1) 1lim lim
1 1 lim
sw sw swtotal
n nhw hw hw
sw sw sw sw
nhw hw hw hw
T T TnSU
T n T n T
T T T T
T n T n T T
Where:• n is number of chunks•Tsw1,Thw1 is average time of not_last chunk execution•Tsw2,Thw2 is average time of the last chunk execution
For huge chunks amount the total speed up will be:
Agenda
HW/SW System DesignHW/SW System Design
Performance EvaluationPerformance Evaluation
IntroductionIntroduction
Conclusions & SummaryConclusions & Summary
System High-Level
Serial communication manager between PC and M310 board
This module serves as input/output of the system, starting and finishing the process.
Manages MD5 hardware interface.
SW reference module for comparison
Step 4 implementation
SW/HW System Design
SW/HW algorithm flowSW/HW System Design
HW Accelerator insights
Basic structure of the hardware module after the initial design “on paper” :
SW/HW System Design
Processing Unit
Detailed explanation of one process cycle :
The process cycle is being run 16 times per 512 bit input (32bit*16=512bit)
SW/HW System Design
Problem- which result is relevant for given ‘i’.
Function MaskingSW/HW System Design
T-Table access(1)SW/HW System Design
Every process cycle we need to fetch 32X4=128bits from the T-table
a. Problem: ASC supports only 32bit wide memoriesb. Using 2-port BRAM result in 2 clock cycles
?
T-Table Access (2)SW/HW System Design
Agenda
HW/SW System DesignHW/SW System Design
Performance EvaluationPerformance Evaluation
IntroductionIntroduction
Conclusions & SummaryConclusions & Summary
HW Module PerformancePerformance Evaluation
One data process of 512 bits takes: 680ns (@clock_freq=100MHz)
S_CYCLE=4 clock cyclesS_ LOOP = 16+1
68 clock cycles680ns
clock.freq 100MHz
Measurements (1)
String Software Hardware
Init. Append Finish_SW Total Init. Append Finish_HW Total
‘a’ 2.1 6.68 91.14 99.92 2.1 6.68 66.3+0.68=66.98 75.76
‘Aleks’ 2.1 8.58 89.62 100.3 2.1 8.58 64.1+0.68=64.78 75.46
‘message digest’ 2.1 13.1 86.2 101.4 2.1 13.1 57.2+0.68=57.88 73.08
All 56-byte strings 2.1 8.77 73.24 84.11 2.1 8.77 50.1+0.68=50.78 61.65
All times are in usec
Finish_SW=append Bits_SW+Process_SW+Output_SW
Finish_HW=append Bits_SW+Process_HW+Output_SW
Average speed-up HW-SW = 1.34998 times
Performance Evaluation
String Finish Software Finish Hardware
Append bits Process Output Append bits Process Output
‘a’ 64.1 24.84 2.2 64.1 0.68 2.2
‘Aleks’ 62 25.52 2.1 62 0.68 2.1
‘message digest’
55 29 2.2 55 0.68 2.2
All 56-byte strings
47.9 23.14 2.2 47.9 0.68 2.2
Performance Evaluation
All times are in usec
Measurements (2)
Agenda
HW/SW System DesignHW/SW System Design
Performance EvaluationPerformance Evaluation
IntroductionIntroduction
Conclusions & SummaryConclusions & Summary
Conclusions(1)Conclusions & Summary
• x1.35 Speedup with HW implementation (Worst Case). The expected Speed Up in ideal case for one chunk is:
• The theoretical speedup of larger than 1.35 can be achieved with large data chunks, when append_bit is evaluated only for the last chunk. In that case the ideal speed up of 2.83 is expected, but in reality the speed up of ~ 2.75 is reached from measurments (graph next slide)
• ASC tool proved the ability to implement complicated hardware modules with the use of few software commands and its code is easy_to_read
11.45
(1 0.31)
1 2
1 2
2 1
( 1) 1
( 1) 1
( 1)* 1
software s s
hardware h h
total
T n T T
T n T T
n su suSu
n
When:•T1s,T1h is average time of not_last chunk execution•T2s,T2h is average time of the last chunk execution•su2 is speed up for not_last chunk• su1 is speed up for the last chunk• n is number of chunks
Conclusions(2)Speed Up Prediction
Summary
• We learned ASC :design approach, debug and synthesize process.
• We showed the feasibility of MD5 implementation with ASC
• Implementation design of algorithm from pseudo code to hardware• Masking mechanism• Parallel processing and mux-ing the appropriate result• Overcoming over the limitations of hardware by creative approach (memory imp.)• Flow control
• Project goals were partially achieved• The File version was not implemented
Conclusions & Summary
Further WorkConclusions & Summary
• Further acceleration can be reached using pipe line architecture:
• File version further development.
The End
Thank you for your time.