SDEC2011 Glory-FS development & Experiences

31
1 ETRI Proprietary 저장시스템연구팀

description

http://sdec.kr/

Transcript of SDEC2011 Glory-FS development & Experiences

Page 1: SDEC2011 Glory-FS development & Experiences

1ETRI Proprietary 저장시스템연구팀

Page 2: SDEC2011 Glory-FS development & Experiences

2ETRI Proprietary 저장시스템연구팀

GLORY: Global Resource Management System For Future Internet Services

2011.6.28

진기성([email protected])

Development & Experiences -

Page 3: SDEC2011 Glory-FS development & Experiences

3ETRI Proprietary 저장시스템연구팀

목차

개요

구조

주요 기능

성능 현황

Page 4: SDEC2011 Glory-FS development & Experiences

4ETRI Proprietary 저장시스템연구팀

Page 5: SDEC2011 Glory-FS development & Experiences

5ETRI Proprietary 저장시스템연구팀

What’s GLORY-FS?저비용 스토리지 서버를 이용하여 대용량 파일 저장 공간을 제공하는

클러스터 파일 시스템 S/W

대용량 데이터: 컴퓨터 모델링, 3-D모델링, CAD/CAM, PDF, 위성사진,

음악/오디오, 비디오/그래픽(UCC), CDN, Archive

데이터 특성: throughput 중심 처리, 연속적 read/write

Research Goal

수천에서 수만 대의 보급형 서버들을 스토리지 서버로 이용하여

- 스토리지 TCO를 최소화

- 높은 성능

- 장애에 대한 효율적인 통제 능력

을 갖는 대규모 인터넷 서비스를 위한 스토리지 시스템 S/W

수천~수만대 스토리지 서버: more than Petabytes TCO 최소화: autonomous storage management 높은 성능: linear scalable IO performance 장애 통제: self-detection, self-healing, etc.

개요

Page 6: SDEC2011 Glory-FS development & Experiences

6ETRI Proprietary 저장시스템연구팀

개요

Target

I/O performance

Capaci

ty

로컬파일시스템NTFS, ext3

Network Attached Storagefor SMB business

CIFS, NFS

SAN File System(GPFS, StorageTank, ZFS)

High Performance/Parallel File Systemfor Super Computing

Lustre, Panasas,

60MBps 2Gbps 10Gbps 100Gbps 1000Gbps

500GB

5TB

10TB

100TB

1000TB

10PB

100PB

Clustered NASfor Enterprise, Web2.0,…

Isilon, IBIRX

Cloud Storage

Google FS, HadoopDFS

GLORY-FS

Page 7: SDEC2011 Glory-FS development & Experiences

7ETRI Proprietary 저장시스템연구팀

GLORY 스토리지 솔루션

GLORY-FSTM

클러스터파일시스템

S/W

모든 클래스

(pc/entry-level/mid-range/high-end)

하드웨어

GLORY

스토리지시스템

• 하나의 거대 가상 드라이브 생성

• 비 대칭형 클러스터를 통한 최고 성능

• 서비스 중단 없는 손쉬운 확장

• 저가 하드웨어 및 자율 관리를 통한 스토리지 비용 최소화

• 통합 웹 관리 도구

개요

Page 8: SDEC2011 Glory-FS development & Experiences

8ETRI Proprietary 저장시스템연구팀

Page 9: SDEC2011 Glory-FS development & Experiences

9ETRI Proprietary 저장시스템연구팀

구조 - 구성 요소

•Data Server 관리/모니터링

•파일 시스템 메타데이터 관리

•디렉토리 트리

•Inodes

•Chunk Locations

•MySQL에 메타데이터 저장

•최소 1대, 가용성을 위해서 2대, 고성능을 위해서는 2대 이상

GLORY-FS 메타데이터 서버

•파일 데이터 (variable sized chunk)

•Ext3, xfs 등에 파일 데이터 저장

•최소 1대, 가용성 및 용량을 위해서는 2대 이상

GLORY-FS 데이터 서버

•Linux POSIX API 호환 마운트 포인트 제공

•Windows FS API 호환 네트워크 드라이브 제공

GLORY-FS 클라이언트

Page 10: SDEC2011 Glory-FS development & Experiences

10ETRI Proprietary 저장시스템연구팀

구조 - 동작 흐름

GLORY-FS Data Server SW

(gfs_ds start)

GLORY-FS Client SW

(mount –t gfs mds:/vol /mnt)

GLORY-FS Metadata Server SW

(gfs_mds start)

1/10Gbps Ethernet

Switch

Volume

/

home share

big.avi

Data

Metadata

Data

PC용 HDD

(7200rpm SATA HDD)

1TB, 80MBps 10TB, 2Gbps

X86 기반 저가형 서버 박스

Page 11: SDEC2011 Glory-FS development & Experiences

11ETRI Proprietary 저장시스템연구팀

•Metadata Server & Data Server process is a user mode daemon

•No kernel dependency

•Binary level distribution

•No kernel panic

All Usermode Architecture

•Fuse is a user level file system SDK for Linux(http://fuse.sourceforge.net)

•Fuse is supported by nearly all Linux distributionLinux 2.4.21 or later, Linux 2.6.x, FreeBSD, NetBSD, Mac OS X, OpenSolaris, GNU/Hurd

Linux Client requires fuse kernel support

•Callback File System is a user level file system SDK for Windows(http://www.eldos.com)

•GLORY-FS Windows Client are distributed with free CBFS binary license

Windows Client requires Callback File System

구조 – 프로세스 구조

Page 12: SDEC2011 Glory-FS development & Experiences

12ETRI Proprietary 저장시스템연구팀

Page 13: SDEC2011 Glory-FS development & Experiences

13ETRI Proprietary 저장시스템연구팀

GLORY-FS consists of at least 2 data servers (for reliability issues)

Capacity/Performance/Reliability increases as data servers are added

Clients

StorageServer

Each Storage Server has

10TB Storage and Gigabit Ethernet

Performance Capacity

2Gbps 20TB

3Gbps 30TB

4Gbps 40TB

5Gbps 50TB

6Gbps 60TB

7Gbps 70TB

온라인 용량/성능 확장

GLORY-FS 2008 LG CNS Meeting (20081008)

Page 14: SDEC2011 Glory-FS development & Experiences

14ETRI Proprietary 저장시스템연구팀

Each file is sliced into pieces, called CHUNK, and stored across multiple data servers

Once CHUNKs are stored, makes REPLICA chunks to different data servers

When data server failure occurs, RECOVERS lost chunks from their replicas

All REPLICAs are used for file Read Access (Read load balance)

Data Server Data Server Data Server Data Server Data Server

LIF E

F I L E

I

L

복제 기반 자동 복구

Page 15: SDEC2011 Glory-FS development & Experiences

15ETRI Proprietary 저장시스템연구팀

복제 전용 네트워크 지원

• seperate replication network from service network

• guarantee stable service I/O quality

Dedicated Replication Network

Filesystem ClientData Servers

Service I/O Traffic Data Replication Traffic

Gigabit Switch 1/10 Gigabit Switch

Page 16: SDEC2011 Glory-FS development & Experiences

16ETRI Proprietary 저장시스템연구팀

볼륨/디렉토리/파일별 복제수 지정 지원

• set replication factor to each directory

• critical & recent directory to higher value

• old directory to lower value

Configure Directory Level Replication

default volume replica = 3

/

2010 2011

01 12* *02 01 02

get more usable storage space without data server expansion !!

2009

01 12* *02

old directory's replica can be set 1 or 2

Page 17: SDEC2011 Glory-FS development & Experiences

17ETRI Proprietary 저장시스템연구팀

자동 복구 고급 기능

• Ex) If a chunk has only 1 replica, it is re-replicated first that a chunk with 2 replica

Pioritized Recovery

• MDS assigns a UUID to each disk

• Supports disk relocation to different DS (upon node failure)

• This will preserve the chunks within that disk, eliminating replica recovery

Physical Disk Relocation & IP transparency

• When any DS start, stop, crash,…

User-defined Procedures support on major event

Page 18: SDEC2011 Glory-FS development & Experiences

18ETRI Proprietary 저장시스템연구팀

메타데이터 서버 클러스터

Metadata Lookup

Data I/O

Unbounded Data Capacity & Performance (over 10PB)

Unbounded Metadata Capacity & Performance (over 1 billion files)

MGT

MDS MDS MDS MDS

• up to 10 MDS nodes & 1 billion files

• up to 50,000 metadata IOPS

Scalable Metadata Capacity & Performance

• Management Server(MGT) : cluster resouce management

• Metadata Server(MDS) : inode & chunk location

Cluster Architecture

Metadata Server Cluster

Data Server

Filesystem Client

Page 19: SDEC2011 Glory-FS development & Experiences

19ETRI Proprietary 저장시스템연구팀

다중 볼륨 지원

• online volume addition

• support 30,000 unique volumes

Service Oriented Multiple Volumes

• resizing volume quota

• resizing volume replication level

• real-time volume statistics monitoring

Online Management of Volume Attributes

Web Management Screen Shot

Page 20: SDEC2011 Glory-FS development & Experiences

20ETRI Proprietary 저장시스템연구팀

For massive read operations such as streaming services

Hot file will be detected and replicated among more data servers automatically

Data Server Data Server Data Server Data Server Data Server

H H

File “H” is HOT

핫스팟 회피 기능

H HH HH HH HH H

File “H” is REPLICATED

Page 21: SDEC2011 Glory-FS development & Experiences

21ETRI Proprietary 저장시스템연구팀

POSIX 표준 호환 Linux Client

API

access

chdir

chmod

chown

chroot

close

create

fchdir

fchmod

fchown

fcntl

fdatasync

flock

API

fstat

fstatfs

fsync

ftruncate

getcwd

getdents

getxattr

lchown

link

listxattr

lockf

lseek

API

lstat

mkdir

mknod

mmap

mount

munmap

open

pread

pwrite

read

readlink

readv

API

rename

rmdir

setxattr

stat

statfs

symlink

sync

truncate

umount

unlink

ustat

utime

fcntl: POSIX locking not supported yet.

flock: not supported (NFS also does not support)

lockf: POSIX locking not supported yet

mmap: writable shared mmap not supported

open: O_DIRECT mode not supported

Page 22: SDEC2011 Glory-FS development & Experiences

22ETRI Proprietary 저장시스템연구팀

Windows API 호환 Client

• use windows callback filesystem library(CBFS V3)

• integrated into windows exporer

• support various winows version(XP, Windows 2003,2008, Vista, Win7)

• GLORY-FS client mangement tool

Windows Client

Management Tool

Windows Exporer

Mount / unmount GLORY-FS

Page 23: SDEC2011 Glory-FS development & Experiences

23ETRI Proprietary 저장시스템연구팀

모니터링 및 웹관리도구

명령어 기반 관리 도구 웹 기반 관리 도구

Page 24: SDEC2011 Glory-FS development & Experiences

24ETRI Proprietary 저장시스템연구팀

Page 25: SDEC2011 Glory-FS development & Experiences

25ETRI Proprietary 저장시스템연구팀

단일 MDS 메타데이터 성능

0

500

1000

1500

2000

2500

3000

3500

LAKE

NFS(mds1)

• IOZone 성능 시험 도구 활용 (fileop)

GLORY

NFS

Page 26: SDEC2011 Glory-FS development & Experiences

26ETRI Proprietary 저장시스템연구팀

다중 MDS 메타데이터 성능

0

5000

10000

15000

20000

25000

30000

35000

single MDS CMDS #1 CMDS #2 CMDS #4

open

setattr

create

Page 27: SDEC2011 Glory-FS development & Experiences

27ETRI Proprietary 저장시스템연구팀

단일 클라이언트 I/O 성능

File Size (MB)

Throughput(MBps)

I/O Size (KB)1

4

16

64

0

20

40

60

80

100

120

100-120

80-100

60-80

40-60

20-40

0-20

Initial Read (Initial Write & Re-write looks similar)

• IOZone 성능 시험 도구 활용

Page 28: SDEC2011 Glory-FS development & Experiences

28ETRI Proprietary 저장시스템연구팀

단일 클라이언트 I/O 성능

File Size (MB)

Throughput(MBps)

I/O Size (KB)1

4

16

64

0

500

1000

1500

2000

2500

3000

2500-3000

2000-2500

1500-2000

1000-1500

500-1000

0-500

Re-Read

• IOZone 성능 시험 도구 활용

Page 29: SDEC2011 Glory-FS development & Experiences

29ETRI Proprietary 저장시스템연구팀

단일 클라이언트 I/O 성능 요약

CacheI/O

(4GBps)

NetworkI/O

(1GBps)

DiskI/O

(60MBps)

FM CacheDS Cache DS Disk

Cached Read Bandwidth = >2GBps (Depends on CPU and Bus speed)

Bursty Read Bandwidth ≒105MBps

Sustained Read Bandwidth ≒80MBps

Write-back cache not supported by fuse.

Bursty Write Bandwidth ≒105MBps

Sustained Write Bandwidth ≒80MBps

Read

Write

CacheI/O

(4GBps)

Page 30: SDEC2011 Glory-FS development & Experiences

30ETRI Proprietary 저장시스템연구팀

다중 클라이언트 통합 I/O 성능

20 client, 1GB file I/O for each node

• MDS 1대, DataServer 6대, Client 20대

• IOZone 성능 시험 도구 활용

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

성능

(M

B/

sec)

I/O performance

WRITE(LAKE)

READ(LAKE)

Ideal Network Limit

WRITE

READ

Page 31: SDEC2011 Glory-FS development & Experiences

31ETRI Proprietary 저장시스템연구팀