Project Matsu: Elastic Clouds for Disaster Relief

20
Project Matsu: Large Scale On-Demand Image Processing for Disaster Relief Collin Bennett, Robert Grossman, Yunhong Gu, and Andrew Levine Open Cloud Consortium June 21, 2010 www.opencloudconsortium .org

description

This is a talk I gave at OGF 29 in Chicago on June 21, 2010.

Transcript of Project Matsu: Elastic Clouds for Disaster Relief

Page 1: Project Matsu: Elastic Clouds for Disaster Relief

Project Matsu: Large Scale On-Demand Image Processing for Disaster Relief

Collin Bennett, Robert Grossman, Yunhong Gu, and Andrew Levine

Open Cloud ConsortiumJune 21, 2010

www.opencloudconsortium.org

Page 2: Project Matsu: Elastic Clouds for Disaster Relief

Project Matsu Goals

• Provide persistent data resources and elastic computing to assist in disasters:– Make imagery available for disaster relief workers– Elastic computing for large scale image processing– Change detection for temporally different and

geospatially identical image sets• Provide a resource to test standards and

interoperability studies large data clouds

Page 3: Project Matsu: Elastic Clouds for Disaster Relief

Part 1:Open Cloud Consortium

Page 4: Project Matsu: Elastic Clouds for Disaster Relief

• 501(3)(c) Not-for-profit corporation• Supports the development of standards,

interoperability frameworks, and reference implementations.

• Manages testbeds: Open Cloud Testbed and Intercloud Testbed.

• Manages cloud computing infrastructure to support scientific research: Open Science Data Cloud.

• Develops benchmarks.

4www.opencloudconsortium.org

Page 5: Project Matsu: Elastic Clouds for Disaster Relief

OCC Members

• Companies: Aerospace, Booz Allen Hamilton, Cisco, InfoBlox, Open Data Group, Raytheon, Yahoo

• Universities: CalIT2, Johns Hopkins, Northwestern Univ., University of Illinois at Chicago, University of Chicago

• Government agencies: NASA• Open Source Projects: Sector Project

5

Page 6: Project Matsu: Elastic Clouds for Disaster Relief

Operates Clouds

• 500 nodes• 3000 cores• 1.5+ PB• Four data centers• 10 Gbps• Target to refresh 1/3

each year.

• Open Cloud Testbed• Open Science Data Cloud• Intercloud Testbed• Project Matsu: Cloud-

based Disaster Relief Services

Page 7: Project Matsu: Elastic Clouds for Disaster Relief

Open Science Data Cloud

7

Astronomical dataBiological data (Bionimbus)

Networking dataImage processing for disaster relief

Page 8: Project Matsu: Elastic Clouds for Disaster Relief

Focus of OCC Large Data Cloud Working Group

8

Cloud Storage Services

Cloud Compute Services (MapReduce, UDF, & other programming frameworks)

Table-based Data Services

Relational-like Data Services

App App App App App

App App

App App

• Developing APIs for this framework.

Page 9: Project Matsu: Elastic Clouds for Disaster Relief

Tools and Standards

• Apache Hadoop/MapReduce• Sector/Sphere large data cloud• Open Geospatial Consortium

– Web Map Service (WMS)

• OCC tools are open source (matsu-project)– http://code.google.com/p/matsu-project/

Page 10: Project Matsu: Elastic Clouds for Disaster Relief

Part 2: Technical Approach

• Hadoop – Lead Andrew Levine• Hadoop with Python Streams – Lead Collin

Bennet• Sector/Sphere – Lead Yunhong Gu

Page 11: Project Matsu: Elastic Clouds for Disaster Relief

Implementation 1: Hadoop & Mapreduce

Andrew Levine

Page 12: Project Matsu: Elastic Clouds for Disaster Relief

Image Processing in the Cloud - MapperMapper Input Key: Bounding Box

Mapper Input Value:

Mapper Output Key: Bounding BoxMapper Output Value:

Mapper resizes and/or cuts up the originalimage into pieces to output Bounding Boxes

(minx = -135.0 miny = 45.0 maxx = -112.5 maxy = 67.5)

Step 1: Input to Mapper

Step 2: Processing in Mapper Step 3: Mapper Output

Mapper Output Key: Bounding BoxMapper Output Value:

Mapper Output Key: Bounding BoxMapper Output Value:

Mapper Output Key: Bounding BoxMapper Output Value:

Mapper Output Key: Bounding BoxMapper Output Value:

Mapper Output Key: Bounding BoxMapper Output Value:

Mapper Output Key: Bounding BoxMapper Output Value:

Mapper Output Key: Bounding BoxMapper Output Value:

+ Timestamp

+ Timestamp

+ Timestamp

+ Timestamp

+ Timestamp

+ Timestamp

+ Timestamp

+ Timestamp

+ Timestamp

Page 13: Project Matsu: Elastic Clouds for Disaster Relief

Image Processing in the Cloud - ReducerReducer Key Input: Bounding Box

(minx = -45.0 miny = -2.8125 maxx = -43.59375 maxy = -2.109375)Reducer Value Input:

Step 1: Input to Reducer

… …

Step 2: Process difference in Reducer

Assemble Images based on timestamps and compare Result is a delta of the two Images

Step 3: Reducer Output

All images go to different map layers set of images for display in WMS

Timestamp 1Set

Timestamp 2Set

Delta Set

Page 14: Project Matsu: Elastic Clouds for Disaster Relief

Implementation 2: Hadoop & Python Streams

Collin Bennett

Page 15: Project Matsu: Elastic Clouds for Disaster Relief

Preprocessing Step• All images (in a batch to be processed) are

combined into a single file.

• Each line contains the image’s byte array transformed to pixels (raw bytes don’t seem to work well with the one-line-at-a-time Hadoop streaming paradigm).

geolocation \t timestamp | tuple size ; image width ; image height; comma-separated list of pixels

the fields in red are metadata needed to process the image in the reducer

Page 16: Project Matsu: Elastic Clouds for Disaster Relief

Map and Shuffle

• We can use the identity mapper• All of the work for mapping was

done in the pre-process step• Map / Shuffle key is the geolocation• In the reducer, the timestamp will be

1st field of each record when splitting on ‘|’

Page 17: Project Matsu: Elastic Clouds for Disaster Relief

Implementation 3: Sector/SphereYunhong Gu

Page 18: Project Matsu: Elastic Clouds for Disaster Relief

Sector Distributed File System

• Sector aggregate hard disk storage across commodity computers– With single namespace, file system level reliability

(using replication), high availability• Sector does not split files

– A single image will not be split, therefore when it is being processed, the application does not need to read the data from other nodes via network

– A directory can be kept together on a single node as well, as an option

Page 19: Project Matsu: Elastic Clouds for Disaster Relief

Sphere UDF

• Sphere allows a User Defined Function to be applied to each file (either it is a single image or multiple images)

• Existing applications can be wrapped up in a Sphere UDF

• In many situations, Sphere streaming utility accepts a data directory and a application binary as inputs

• ./stream -i haiti -c ossim_foo -o results

Page 20: Project Matsu: Elastic Clouds for Disaster Relief

For More Information

[email protected]