Notary:Hardware Techniques to Enhance Signatures
Luke Yen
Collaborator: Prof. Stark C. Draper
Advisor: Prof. Mark D. Hill
University of Wisconsin, Madison
MICRO-41 - November 11, 2008www.cs.wisc.edu/multifacet/papers/micro08_notary.pdf
Executive Summary
Tackle 2 problems with hardware signatures:
• Problem 1: Best signature hashing (i.e., H3) has high area & power overheads
• Solution 1: Use entropy analysis to guide lower-cost hashing (Page-Block-XOR, PBX) that performs similar to H3
– Ex: 160 gates for H3 vs 20 gates for PBX
• Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs
• Solution 2: Avoid inserting private stack addrs, propose privatization interface for higher performance
04/20/23 University of Wisconsin-Madison2
Outline
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
04/20/23 University of Wisconsin-Madison3
Signature background• Signatures (hardware Bloom filters) used to summarize and detect
conflicts with a transaction’s read- and write-sets– Inspired by Bulk system [Ceze,ISCA’06]– Implemented in LogTM-SE [Yen,HPCA’07]– Can have false positives, but never false negatives– Also proposed for non-TM purposes (e.g., SC violation detection,
atomicity violation detection, race recording)• Ex: Use k Bloom filters of size m/k, with independent hash functions
04/20/23 University of Wisconsin-Madison4
Signature hash functions
• Which hash function is best? [Sanchez, MICRO’07]– Bit-selection? Hash simply decodes some number of input bits
– H3? Each bit of a hash value is an XOR of (on avg.) half of the input address bits
04/20/23 University of Wisconsin-Madison5
• Result: H3 better with >=2 hash functions• However, H3 uses many multi-level XOR trees
•Can we improve this?
LogTM-SE w/ 2kb signatures
H3 implementation
• Num XOR
• Ex: 2kb signatures, k=2, c=10, 32-bit addr = 160 XOR gates per signature
• Can we reduce the total gate count?
04/20/23 University of Wisconsin-Madison6
kcbitsinlengthaddr
4
Outline
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
04/20/23 University of Wisconsin-Madison7
Entropy overview
• Not all address bits have equal randomness– Ex: High-level address bits unlikely to change if working set
size is small
• Key insight: If input bits are random and those bits are used as inputs to hash functions, random hash values result– Use entropy to measure bit randomness
• Entropy – measure of the uncertainty of a random variable x
04/20/23 University of Wisconsin-Madison8
Entropy formally defined
• Entropy =
• p(xi) = the probability of the occurrence of value xi
• N = number of sample values random variable x can take on
• Entropy = amount of information required on average to describe outcome of variable x (in bits)– Ex: What is the best possible lossless compression?
04/20/23 University of Wisconsin-Madison9
N
iii xpxp
12 ))((log)(
n-bit field has constant value
All bit patterns in n-bit field equally likely
Entropy value of n-bit field
0 bits n bits
min max
Other cases
Our measures of entropy
• For our workloads, we care about:• Q1: What is the best achievable entropy?
– Global entropy – upper bound on entropy of address
• Q2: How does entropy change within an address?– Local entropy – entropy of bit-field within the address
04/20/23 University of Wisconsin-Madison10
Addr31 6
Global entropy
Addr31 6Local entropy
NSkip
Outline
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
04/20/23 University of Wisconsin-Madison11
Entropy results
• Workloads to be described later• Global entropy is at most 16 bits• Bit-window for local entropy is 16 bits wide (NSkip from 0-10)
– Smaller windows (<16b) may not reach global entropy value– Larger windows (>16b) hides some fine-grain info
04/20/23 University of Wisconsin-Madison12
Entropy results summary
• More entropy results in our MICRO paper
• In summary, for our workloads entropy monotonically decreases when moving towards high-order bits– We calculate the average entropy across the entire
workload’s execution– May miss entropy changes due to program phase behavior
• Our Page-Block-XOR (PBX) hash takes advantage of this overall trend
04/20/23 University of Wisconsin-Madison13
Page-Block-XOR (PBX)
• Motivated by 3 findings:– (1) Lower-order bits have most entropy
• Follows from our entropy results– (2) XORing two bit-fields produces random hash values
• From prior work on XOR hashing (e.g., data placement in caches, DRAM)
– (3) Bit-field overlaps can lead to higher false positives• Correlation between the two bit-fields can reduce the
range of hash values produced (worse for larger signatures)
04/20/23 University of Wisconsin-Madison14
PBX implementation
• For 2kb signatures with 2 hash functions:– 20 XOR gates for PBX vs 160 XOR gates for H3!
04/20/23 University of Wisconsin-Madison15
• PPN and Cache-index fields not tied to system params:• Use entropy to find two non-overlapping bit-fields with
high randomness
Summary thus far
• Problem 1: H3 has high area & power overheads
• Solution 1: Use entropy analysis to guide lower-cost PBX
– Ex: 160 gates for H3 vs 20 gates for PBX
• Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs
• Solution 2: To be described
04/20/23 University of Wisconsin-Madison16
Outline
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
04/20/23 University of Wisconsin-Madison17
Motivation
• False conflicts caused by thread-private addrs– Avoid conflicts if addrs not inserted in thread’s signatures
04/20/23 University of Wisconsin-Madison18
Privatization solutions
• Two solutions proposed:– (1) Remove private stack references from sigs.
• Very little work for programmer/compiler• Benefits depend on fraction of stack addresses versus all
transactional references– (2) Language-level interface (e.g., private_malloc(), shared_malloc())
• Even higher performance boost• For skilled programmer• WARNING: Incorrectly marking shared objects as private can lead
to program errors!
04/20/23 University of Wisconsin-Madison19
Page-based implementation
• Each page is assigned a status, private or shared– Invariant: Page is shared if any object is shared
• If stack is private, library marks stack pages as private• If using privatization heap functions, mark heap pages
accordingly
04/20/23 University of Wisconsin-Madison20
OS support
• OS allocates different physical page frames for shared and private pages– Sets a per-frame bit in translation entry if shared– Reduce number of page frames used by packing objects
with same status together
• Signatures insert memory addresses of transactional references to shared pages– Query page sharing bit in HW TLB & current transactional
status
04/20/23 University of Wisconsin-Madison21
Outline
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
04/20/23 University of Wisconsin-Madison22
Methodology
• Full-system simulation using Simics and Wisconsin GEMS timing modules
• Transistor-level design for area & power of XOR gates• CACTI for Bloom filter bit array area & power
• Simulated system– Single-chip CMP– 16 single-threaded,in-order cores– 32kB, 4-way private L1 I & D, write-back– 8MB, 8-way shared L2 cache– MESI directory protocol– Signatures from 64b-64kb (8B-8kB) & “Perfect”
04/20/23 University of Wisconsin-Madison23
Workloads
• Micro-benchmarks– BTree – read and write ops on shared tree– Sparse Matrix – algorithm from dense column vector
multiplication kernel
• SPLASH-2 apps – Barnes & Raytrace – exert most signature pressure
• Stanford STAMP apps – Vacation, Genome, Delaunay, Bayes, Labyrinth
• DNS server– BIND
04/20/23 University of Wisconsin-Madison24
Outline
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
04/20/23 University of Wisconsin-Madison25
PBX vs H3 area & power
• Area & power overheads (2kb, k=4):
04/20/23 University of Wisconsin-Madison26
Type of overhead
Bloom filter bit array
H3 hash PBX hash
H3 sig. PBX sig. % savings for PBX sig.
Area(mm2)
2.70e-2 8.10e-3 4.70e-4 3.50e-2 2.70e-2 23
Power(mW)
1.80e2 1.04e1 1.02 1.90e2 1.81e2 4.7
PBX vs H3 execution time
04/20/23 University of Wisconsin-Madison27
PBX performs similar to H3
Additional workload results in paper
Privatization results summary
• Removing private stack references from signatures did not help much– Most addr references not to stack– Most likely because running with SPARC ISA. Other ISAs
(e.g., x86) likely has more benefits
• Privatization interface helps four workloads– Remainder either does not have private heap structures or
does not have high transactional duty cycle
04/20/23 University of Wisconsin-Madison28
Outline
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
04/20/23 University of Wisconsin-Madison30
Conclusions
• Tackle 2 problems with signature designs:– (1) Area and power overheads of H3 hashing
• E.g., 160 XOR gates for H3, 20 for PBX
– (2) False conflicts due to signature bits set by private memory references
• Our solutions:– (1) Use entropy analysis to guide hashing function (PBX), a
low-cost alternative that performs similarly to H3
– (2) Prevent private stack references from entering signatures, and propose a privatization interface for heap allocations
• Notary can be applied to non-TM uses:– PBX hashing can directly transfer
– Privatization may transfer if addr filtering applies
04/20/23 University of Wisconsin-Madison31
Future Work
• Dynamic entropy calculation:– How to adapt PBX hashing to entropy changes over time?
• Dynamic privatization characteristics:– How common is it for objects to change sharing status (i.e.,
from private to shared, and vice versa)?
04/20/23 University of Wisconsin-Madison32
Privatization interface
04/20/23 University of Wisconsin-Madison34
Privatization function Usage
shared_malloc(size),private_malloc(size)
Dynamic allocation of shared and private memory objects
shared_free(ptr),private_free(ptr)
Frees up memory allocated by shared or private allocators
privatize_barrier(num_threads, ptr, size),publicize_barrier(num_threads, ptr, size)
Program threads come to a common point to privatize or publicize an object. Must be used outside of transactions
Dynamic privatization
• Dynamically switch from private to shared, and vice versa
• If transitioning from private -> shared, safe to mark page as shared (at cost of performance)
• If transitioning from shared -> private, default policy is to disallow if there exists other shared objects on same page• Otherwise, trap to user software and let
programmer call shared_free(), followed by private_malloc() on object
04/20/23 University of Wisconsin-Madison35
04/20/23 University of Wisconsin-Madison39
Signature Operation Example
Program:
xbegin
LD A
ST B
LD C
LD D
ST C
…
0000000000000100
00000010
0010010000100100
00100010
Hash Function(s)
00000000
R
W
ABCDExternal ST E
00100100
00100010
ALIASFALSE POSITIVE:CONFLICT!
External ST F
00100100
00100010
NO CONFLICT
Type of Hash Functions
• In real programs, addresses neither independent nor uniformly distributed (key assumptions to derive PFP(n))
• But can generate hash values that are almost uniformly distributed and uncorrelated with good (universal/almost universal) hash functions
• Hash functions considered:
04/20/23 University of Wisconsin-Madison40
Bit-selection(inexpensive, low quality)
H3 [Carter, CSS79](moderate, higher quality)
Top Related