sector-sphere
-
Upload
xlight -
Category
Technology
-
view
103 -
download
0
description
Transcript of sector-sphere
Distributed Data Storage and
Parallel Processing Engine
Sector & Sphere
Yunhong Gu Univ. of Illinois at Chicago
What is Sector/Sphere?
Sector: Distributed File System Sphere: Parallel Data Processing Engine
(generic MapReduce) Open source software, GPL/BSD, written
in C++. Started since 2006, current version 1.23 http://sector.sf.net
Overview
Motivation Sector Sphere Experimental Results
Motivation
Storage Compute
Data
Super-computer model:Expensive, data IO bottleneck
Sector/Sphere model:Inexpensive, parallel data IO, data locality
Motivation
Parallel/Distributed Programming with MPI, etc.:Flexible and powerful.But too complicated
Sector/Sphere model (cloud model):Clusters are a unity to the developer, simplified programming interface.Limited to certain data parallel applications.
Motivation
Systems for single data centers:Requires additional effort to locate and move data.
Sector/Sphere model:Support wide-area data collection and distribution.
Data Center
Data Center
Data Center
Data ProviderUS Location
Data ReaderAsia Location
Sector/Sphere
Data ProviderUS Location Data Provider
US Location
Data ProviderEurope Location
Up
loa
d
Data UserUS Location
Processing
Data ReaderAsia Location
Sector Distributed File System
Security Server Masters
slaves slaves
SSL SSLClients
User accountData protectionSystem Security
MetadataScheduling
Service provider
System access toolsApp. Programming
Interfaces
Storage and Processing
Data
UDTEncryption optional
Sector Distributed File System
Sector stores files on the native/local file system of each slave node.
Sector does not split files into blocks Pro: simple/robust, suitable for wide area, fast and
flexible data processing Con: users need to handle file size properly
The master nodes maintain the file system metadata. No permanent metadata is needed.
Topology aware
Sector: Performance
Data channel is set up directly between a slave and a client
Multiple active-active masters (load balance), starting from 1.24
UDT is used for high speed data transferUDT is a high performance UDP-based data
transfer protocol.Much faster than TCP over wide area
UDT: UDP-based Data Transfer
http://udt.sf.net Open source UDP based data transfer
protocolWith reliability control and congestion control
Fast, firewall friendly, easy to use Already used in many commercial and
research software
Sector: Fault Tolerance
Sector uses replications for better reliability and availability
Replicas can be made either at write time (instantly) or periodically
Sector supports multiple active-active masters for high availability
Sector: Security
Sector uses a security server to maintain user account and IP access control for masters, slaves, and clients
Control messages are encrypted not completely finished in the current version
Data transfer can be encrypted as an option
Data transfer channel is set up by rendezvous, no listening server.
Sector: Tools and API
Supported file system operation: ls, stat, mv, cp, mkdir, rm, upload, downloadWild card characters supported
System monitoring: sysinfo. C++ API: list, stat, move, copy, mkdir,
remove, open, close, read, write, sysinfo. FUSE
Sphere: Simplified Data Processing
Data parallel applications Data is processed at where it resides, or
on the nearest possible node (locality) Same user defined functions (UDF) are
applied on all elements (records, blocks, or files)
Processing output can be written to Sector files or sent back to the client
Generalized Map/Reduce
Sphere: Simplified Data Processing
for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …);
n
SPE
n+1n+2n+3...n+m
SPESPESPE
n-k...nn+1n+2n+3
Sphere Client
Application
Locate and Schedule SPEs
Split data
Collect result
Input Stream
Output Stream
SphereStream sdss;sdss.init("sdss files");SphereProcess myproc;myproc->run(sdss,"findBrownDwarf", …);myproc->read(result);
findBrownDwarf(char* image, int isize, char* result, int rsize);
Sphere: Data Movement
Slave -> Slave Local
Slave -> Slaves (Shuffle/Hash)
Slave -> Client
nn+1n+2n+3...n+m
SPESPESPESPE
SPESPESPESPE
0123...b
0123...b
Sta
ge 1
: Shu
fflin
gS
tage
2: S
ortin
g
Input Stream
Intermediate Stream
Output Stream
Sphere/UDF vs. MapReduce
Record Offset Index UDF Hashing / Bucket - UDF -
Parser / Input Reader Map Partition Compare Reduce Output Writer
Sphere/UDF vs. MapReduce
Sphere is more straightforward and flexibleUDF can be applied directly on records,
blocks, files, and even directoriesNative binary data supportSorting is required by Reduce, but it is
optional in Sphere Sphere uses PUSH model for data
movement, faster than the PULL model used by MapReduce
Why Sector doesn’t Split Files?
Certain applications need to process a whole file or even directory
Certain legacy applications need a file or a directory as input
Certain applications need multiple inputs, e.g., everything in a directory
In Hadoop, all blocks would have to be moved to one node for processing, hence no data locality benefit.
Load Balance
The number of data segments is much more than the number of SPEs. When an SPE completes a data segment, a new segment will be assigned to the SPE.
Data transfer is balanced across the system to optimize network bandwidth usage.
Fault Tolerance
Map failure is recoverable If one SPE fails, the data segment assigned to
it will be re-assigned to another SPE and be processed again.
Reduce failure is unrecoverable In small-medium systems, machine failure
during run time is rare If necessary, developers can split the input
into multiple sub-tasks to reduce the cost of reduce failure.
Open Cloud Testbed
4 Racks in Baltimore (JHU), Chicago (StarLight and UIC), and San Diego (Calit2)
10Gb/s inter-site connection on CiscoWave
2Gb/s inter-rack connection Two dual-core AMD CPU, 12GB RAM,
1TB single disk Will be doubled by Sept. 2009.
Open Cloud Testbed
The TeraSort Benchmark
Data is split into small files, scattered on all slaves
Stage 1: On each slave, an SPE scans local files, sends each record to a bucket file on a remote node according to the key.
Stage 2: On each destination node, an SPE sort all data inside each bucket.
TeraSort
10-byte 90-byte
Key Value
10-bit
Bucket-0
Bucket-1
Bucket-1023
0-1023
Stage 1:Hash based on the first 10 bits
Bucket-0
Bucket-1
Bucket-1023
Stage 2:Sort each bucket on local node
100 bytes record
Performance Results: TeraSort
Data Size
Sphere Hadoop (3 replicas)
Hadoop (1 replica)
UIC 300GB 1265 2889 2252
UIC + StarLight 600GB 1361 2896 2617
UIC + StarLight + Calit2
900GB 1430 4341 3069
UIC + StarLight + Calit2 + JHU
1.2TB 1526 6675 3702
Run time: secondsSector v1.16 vs Hadoop 0.17
Performance Results: TeraSort
Sorting 1.2TB on 120 nodes Sphere Hash & Local Sort: 981sec +
545sec Hadoop: 3702/6675 seconds Sphere Hash
CPU: 130% MEM: 900MB Sphere Local Sort
CPU: 80% MEM: 1.4GB Hadoop: CPU 150% MEM 2GB
The MalStone Benchmark
Drive-by problem: visit a web site and get comprised by malware.
MalStone-A: compute the infection ratio of each site.
MalStone-B: compute the infection ratio of each site from the beginning to the end of every week.
http://code.google.com/p/malgen/
MalStone
Site ID Time
Key Value
3-bytesite-000X
site-001X
site-999X
000-999
Stage 1:Process each record and hash into buckets according to site ID
site-000X
site-001X
site-999x
Stage 2:Compute infection rate for each merchant
Event ID | Timestamp | Site ID | Compromise Flag | Entity ID00000000005000000043852268954353585368|2008-11-08 17:56:52.422640|3857268954353628599|1|000000497829
Text Record
Transform
Flag
Performance Results: MalStone
* Courtesy of Collin Bennet and Jonathan Seidman of Open Data Group.
MalStone-A MalStone-B
Hadoop 454m 13s 840m 50s
Hadoop Streaming/Python
87m 29s 142m 32s
Sector/Sphere 33m 40s 43m 44s
Process 10 billions records on 20 OCT nodes (local).
System Monitoring (Testbed)
System Monitoring (Sector/Sphere)
For More Information
Sector/Sphere code & docs: http://sector.sf.net
Open Cloud Consortium: http://www.opencloudconsortium.org
NCDM: http://www.ncdm.uic.edu