File Caching with SSD Arrays
description
Transcript of File Caching with SSD Arrays
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
1
File Caching with SSD Arrays
Wei Yang
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
2
Motivation• We are curious
– No immediate needs, but future needs– Caching (only) analysis job inputs– SSD has limited write cycles – Other goals, see the last slide
• File level caching– Conventional LFU/LRU algorithms
• can not capture ATLAS analysis jobs data usage pattern (if there is such a pattern)
– Sub-file level caching would be great! But book keeping is hard– We search for caching algorithm
• Out-Bytes > In-Bytes under ATLAS workload • Use LRU, but based on days/weeks/months job usage pattern
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
3
Analysis jobs visit SSD cache first
Cache miss!forward to HD storage
Day 1 ... Day N
File 0001 X1 Xn
File 0002 Y1 Yn
Fill the cache
Xrootd monitoring stream
Setup 1: Caching based on File Access Frequency
o A table records access Frequency of all fileso Rotate columns to maintain N days of records
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
4
Analysis jobs visit SSD cache first
Cache miss!forward to HD storage
Fill the cache
Xrootd monitoring stream to
UCSD collector
Setup 2: Caching based on Historic File Access Info
o Record every file access as event like infoo save to ROOT files for later analysis
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
5
Hardware of the SSD Box• Dell 610
– 8-core 2.4 Ghz– 24GB– Intel dual X520 10Gb NIC– LSI SAS 9200-8e (support TRIM)– RHEL 6 x86_64– Xrootd
• SSD Array– Dell MD1220– 12x OCZ Talos 960GB MLC SSDs, total ~11TB
• Non-raid to support TRIM. • Xrootd take care to gluing them together as a single space
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
6
The box can deliverCan the caching algorithm deliver?
3-hour plot2012-11-05
6-month plot as of 2012-11-12
File Access Freq. Alg.Net data sink, not cache
Algorithm: Bytes-read/file size > 110% during the last 5 days, prioritized by this ratio and up to 200GB/hour
Sept 1
Lack of jobs
Cache brings in ~200GB/hour
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
7
GB/hour from SSD + HDDGB/hour from SSDGB/hour to SSD
Lost monitoringdata from HDD
UCSD collectordead
Lack of jobsfor the last 4 days
Ceiling of 10Gb NIC
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
8
Simulate the Cache with Historic Data
Cache size requiredfor day [-x, -1]
day –n -n+1 -1 0
Day 0:Size of all files read
Bytes read from SSD+HDD
Bytes read from SSD
Cache size required for [-x+1, 0] - = New data to cache
For a given caching algorithm, what do we want to learn from those historic data?
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
9
Algorithm: every files during the last N days.11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
10
Algorithm: every files during the last N daysCache hit rate = Byte from SSD/Bytes from SSD+HDD
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
11
Algorithm: Bytes-read/file size > 110% during the last 5 days11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
12
Analyzing the Historic Data
Try to find a way to identify data worth caching.
So far, not much success
Worth caching
11/14/12
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
1311/14/12
Do the jobs tend to open the same file in a short time window?• If some, we may not have a chance to cache
File that worth caching• Access time (open) scatter over several hours – cacheable• But “scattering over several hours” doesn’t mean the file worth caching
US ATLAS Distributed Facility Workshop University of California, Santa Cruz
14
Next Step
• So far focusing on making it a good cache– More work to be done– Should also look at
• Asking Panda for input files lists of coming jobs• Possibility of sub-file level caching
• How much can the cache speed up analysis jobs?– All files are in SSD cache– Normal caching --- some files in SSD cache, some are
not
11/14/12