File Caching with SSD Arrays

US ATLAS Distributed Facility Workshop University of California, Santa Cruz

1

File Caching with SSD Arrays

Wei Yang

11/14/12


2

Motivation• We are curious

– No immediate needs, but future needs– Caching (only) analysis job inputs– SSD has limited write cycles – Other goals, see the last slide

• File level caching– Conventional LFU/LRU algorithms

• can not capture ATLAS analysis jobs data usage pattern (if there is such a pattern)

– Sub-file level caching would be great! But book keeping is hard– We search for caching algorithm

• Out-Bytes > In-Bytes under ATLAS workload • Use LRU, but based on days/weeks/months job usage pattern

11/14/12


3

Analysis jobs visit SSD cache first

Cache miss!forward to HD storage

Day 1 ... Day N

File 0001 X1 Xn

File 0002 Y1 Yn

Fill the cache

Xrootd monitoring stream

Setup 1: Caching based on File Access Frequency

o A table records access Frequency of all fileso Rotate columns to maintain N days of records

11/14/12


4

Analysis jobs visit SSD cache first

Cache miss!forward to HD storage

Fill the cache

Xrootd monitoring stream to

UCSD collector

Setup 2: Caching based on Historic File Access Info

o Record every file access as event like infoo save to ROOT files for later analysis

11/14/12


5

Hardware of the SSD Box• Dell 610

– 8-core 2.4 Ghz– 24GB– Intel dual X520 10Gb NIC– LSI SAS 9200-8e (support TRIM)– RHEL 6 x86_64– Xrootd

• SSD Array– Dell MD1220– 12x OCZ Talos 960GB MLC SSDs, total ~11TB

• Non-raid to support TRIM. • Xrootd take care to gluing them together as a single space

11/14/12


6

The box can deliverCan the caching algorithm deliver?

3-hour plot2012-11-05

6-month plot as of 2012-11-12

File Access Freq. Alg.Net data sink, not cache

Algorithm: Bytes-read/file size > 110% during the last 5 days, prioritized by this ratio and up to 200GB/hour

Sept 1

Lack of jobs

Cache brings in ~200GB/hour

11/14/12


7

GB/hour from SSD + HDDGB/hour from SSDGB/hour to SSD

Lost monitoringdata from HDD

UCSD collectordead

Lack of jobsfor the last 4 days

Ceiling of 10Gb NIC

11/14/12


8

Simulate the Cache with Historic Data

Cache size requiredfor day [-x, -1]

day –n -n+1 -1 0

Day 0:Size of all files read

Bytes read from SSD+HDD

Bytes read from SSD

Cache size required for [-x+1, 0] - = New data to cache

For a given caching algorithm, what do we want to learn from those historic data?

11/14/12


9

Algorithm: every files during the last N days.11/14/12


10

Algorithm: every files during the last N daysCache hit rate = Byte from SSD/Bytes from SSD+HDD

11/14/12


11

Algorithm: Bytes-read/file size > 110% during the last 5 days11/14/12


12

Analyzing the Historic Data

Try to find a way to identify data worth caching.

So far, not much success

Worth caching

11/14/12


1311/14/12

Do the jobs tend to open the same file in a short time window?• If some, we may not have a chance to cache

File that worth caching• Access time (open) scatter over several hours – cacheable• But “scattering over several hours” doesn’t mean the file worth caching


14

Next Step

• So far focusing on making it a good cache– More work to be done– Should also look at

• Asking Panda for input files lists of coming jobs• Possibility of sub-file level caching

• How much can the cache speed up analysis jobs?– All files are in SSD cache– Normal caching --- some files in SSD cache, some are

not

11/14/12

File Caching with SSD Arrays

Documents

Transcript of File Caching with SSD Arrays