sector-sphere

Distributed Data Storage and

Parallel Processing Engine

Sector & Sphere

Yunhong Gu Univ. of Illinois at Chicago

What is Sector/Sphere?

Sector: Distributed File System Sphere: Parallel Data Processing Engine

(generic MapReduce) Open source software, GPL/BSD, written

in C++. Started since 2006, current version 1.23 http://sector.sf.net

Overview

Motivation Sector Sphere Experimental Results

Motivation

Storage Compute

Data

Super-computer model:Expensive, data IO bottleneck

Sector/Sphere model:Inexpensive, parallel data IO, data locality

Motivation

Parallel/Distributed Programming with MPI, etc.:Flexible and powerful.But too complicated

Sector/Sphere model (cloud model):Clusters are a unity to the developer, simplified programming interface.Limited to certain data parallel applications.

Motivation

Systems for single data centers:Requires additional effort to locate and move data.

Sector/Sphere model:Support wide-area data collection and distribution.

Data Center

Data Center

Data Center

Data ProviderUS Location

Data ReaderAsia Location

Sector/Sphere

Data ProviderUS Location Data Provider

US Location

Data ProviderEurope Location

Up

loa

d

Data UserUS Location

Processing

Data ReaderAsia Location

Sector Distributed File System

Security Server Masters

slaves slaves

SSL SSLClients

User accountData protectionSystem Security

MetadataScheduling

Service provider

System access toolsApp. Programming

Interfaces

Storage and Processing

Data

UDTEncryption optional

Sector Distributed File System

Sector stores files on the native/local file system of each slave node.

Sector does not split files into blocks Pro: simple/robust, suitable for wide area, fast and

flexible data processing Con: users need to handle file size properly

The master nodes maintain the file system metadata. No permanent metadata is needed.

Topology aware

Sector: Performance

Data channel is set up directly between a slave and a client

Multiple active-active masters (load balance), starting from 1.24

UDT is used for high speed data transferUDT is a high performance UDP-based data

transfer protocol.Much faster than TCP over wide area

UDT: UDP-based Data Transfer

http://udt.sf.net Open source UDP based data transfer

protocolWith reliability control and congestion control

Fast, firewall friendly, easy to use Already used in many commercial and

research software

Sector: Fault Tolerance

Sector uses replications for better reliability and availability

Replicas can be made either at write time (instantly) or periodically

Sector supports multiple active-active masters for high availability

Sector: Security

Sector uses a security server to maintain user account and IP access control for masters, slaves, and clients

Control messages are encrypted not completely finished in the current version

Data transfer can be encrypted as an option

Data transfer channel is set up by rendezvous, no listening server.

Sector: Tools and API

Supported file system operation: ls, stat, mv, cp, mkdir, rm, upload, downloadWild card characters supported

System monitoring: sysinfo. C++ API: list, stat, move, copy, mkdir,

remove, open, close, read, write, sysinfo. FUSE

Sphere: Simplified Data Processing

Data parallel applications Data is processed at where it resides, or

on the nearest possible node (locality) Same user defined functions (UDF) are

applied on all elements (records, blocks, or files)

Processing output can be written to Sector files or sent back to the client

Generalized Map/Reduce

Sphere: Simplified Data Processing

for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …);

n

SPE

n+1n+2n+3...n+m

SPESPESPE

n-k...nn+1n+2n+3

Sphere Client

Application

Locate and Schedule SPEs

Split data

Collect result

Input Stream

Output Stream

SphereStream sdss;sdss.init("sdss files");SphereProcess myproc;myproc->run(sdss,"findBrownDwarf", …);myproc->read(result);

findBrownDwarf(char* image, int isize, char* result, int rsize);

Sphere: Data Movement

Slave -> Slave Local

Slave -> Slaves (Shuffle/Hash)

Slave -> Client

nn+1n+2n+3...n+m

SPESPESPESPE

SPESPESPESPE

0123...b

0123...b

Sta

ge 1

: Shu

fflin

gS

tage

2: S

ortin

g

Input Stream

Intermediate Stream

Output Stream

Sphere/UDF vs. MapReduce

Record Offset Index UDF Hashing / Bucket - UDF -

Parser / Input Reader Map Partition Compare Reduce Output Writer

Sphere/UDF vs. MapReduce

Sphere is more straightforward and flexibleUDF can be applied directly on records,

blocks, files, and even directoriesNative binary data supportSorting is required by Reduce, but it is

optional in Sphere Sphere uses PUSH model for data

movement, faster than the PULL model used by MapReduce

Why Sector doesn’t Split Files?

Certain applications need to process a whole file or even directory

Certain legacy applications need a file or a directory as input

Certain applications need multiple inputs, e.g., everything in a directory

In Hadoop, all blocks would have to be moved to one node for processing, hence no data locality benefit.

Load Balance

The number of data segments is much more than the number of SPEs. When an SPE completes a data segment, a new segment will be assigned to the SPE.

Data transfer is balanced across the system to optimize network bandwidth usage.

Fault Tolerance

Map failure is recoverable If one SPE fails, the data segment assigned to

it will be re-assigned to another SPE and be processed again.

Reduce failure is unrecoverable In small-medium systems, machine failure

during run time is rare If necessary, developers can split the input

into multiple sub-tasks to reduce the cost of reduce failure.

Open Cloud Testbed

4 Racks in Baltimore (JHU), Chicago (StarLight and UIC), and San Diego (Calit2)

10Gb/s inter-site connection on CiscoWave

2Gb/s inter-rack connection Two dual-core AMD CPU, 12GB RAM,

1TB single disk Will be doubled by Sept. 2009.

Open Cloud Testbed

The TeraSort Benchmark

Data is split into small files, scattered on all slaves

Stage 1: On each slave, an SPE scans local files, sends each record to a bucket file on a remote node according to the key.

Stage 2: On each destination node, an SPE sort all data inside each bucket.

TeraSort

10-byte 90-byte

Key Value

10-bit

Bucket-0

Bucket-1

Bucket-1023

0-1023

Stage 1:Hash based on the first 10 bits

Bucket-0

Bucket-1

Bucket-1023

Stage 2:Sort each bucket on local node

100 bytes record

Performance Results: TeraSort

Data Size

Sphere Hadoop (3 replicas)

Hadoop (1 replica)

UIC 300GB 1265 2889 2252

UIC + StarLight 600GB 1361 2896 2617

UIC + StarLight + Calit2

900GB 1430 4341 3069

UIC + StarLight + Calit2 + JHU

1.2TB 1526 6675 3702

Run time: secondsSector v1.16 vs Hadoop 0.17

Performance Results: TeraSort

Sorting 1.2TB on 120 nodes Sphere Hash & Local Sort: 981sec +

545sec Hadoop: 3702/6675 seconds Sphere Hash

CPU: 130% MEM: 900MB Sphere Local Sort

CPU: 80% MEM: 1.4GB Hadoop: CPU 150% MEM 2GB

The MalStone Benchmark

Drive-by problem: visit a web site and get comprised by malware.

MalStone-A: compute the infection ratio of each site.

MalStone-B: compute the infection ratio of each site from the beginning to the end of every week.

http://code.google.com/p/malgen/

MalStone

Site ID Time

Key Value

3-bytesite-000X

site-001X

site-999X

000-999

Stage 1:Process each record and hash into buckets according to site ID

site-000X

site-001X

site-999x

Stage 2:Compute infection rate for each merchant

Event ID | Timestamp | Site ID | Compromise Flag | Entity ID00000000005000000043852268954353585368|2008-11-08 17:56:52.422640|3857268954353628599|1|000000497829

Text Record

Transform

Flag

Performance Results: MalStone

* Courtesy of Collin Bennet and Jonathan Seidman of Open Data Group.

MalStone-A MalStone-B

Hadoop 454m 13s 840m 50s

Hadoop Streaming/Python

87m 29s 142m 32s

Sector/Sphere 33m 40s 43m 44s

Process 10 billions records on 20 OCT nodes (local).

System Monitoring (Testbed)

System Monitoring (Sector/Sphere)

For More Information

Sector/Sphere code & docs: http://sector.sf.net

Open Cloud Consortium: http://www.opencloudconsortium.org

NCDM: http://www.ncdm.uic.edu

sector-sphere

Technology

Transcript of sector-sphere