Tera/Petabyte data distribution architectures

25
Tera/Petabyte data Tera/Petabyte data distribution distribution architectures architectures Chris A. Mattmann Chris A. Mattmann USC-CSE Annual Research USC-CSE Annual Research Review Review Tuesday, June 28, 2022 Tuesday, June 28, 2022

description

Tera/Petabyte data distribution architectures. Chris A. Mattmann USC-CSE Annual Research Review Sunday, October 26, 2014. Outline. Research Problem and Importance Background and Related Work Problem Statement Approach Evaluation Strategy Conclusions. Research Problem and Importance. - PowerPoint PPT Presentation

Transcript of Tera/Petabyte data distribution architectures

Page 1: Tera/Petabyte data distribution architectures

Tera/Petabyte data Tera/Petabyte data distribution architecturesdistribution architectures

Chris A. MattmannChris A. MattmannUSC-CSE Annual Research USC-CSE Annual Research

ReviewReviewMonday, April 24, 2023Monday, April 24, 2023

Page 2: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-22

OutlineOutline Research Problem and ImportanceResearch Problem and Importance Background and Related WorkBackground and Related Work Problem StatementProblem Statement ApproachApproach Evaluation StrategyEvaluation Strategy ConclusionsConclusions

Page 3: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-33

Research Problem and Research Problem and ImportanceImportance

Volume of data returned Volume of data returned from scientific experiments from scientific experiments and media content and media content providers growing rapidlyproviders growing rapidly– Planetary Data SystemPlanetary Data System

Current: 20 Current: 20 terabytesterabytes for for all NASA missionsall NASA missions

Growing to: over Growing to: over 200200 terabytesterabytes from a single from a single mission!mission!

– Orbiting Carbon Orbiting Carbon Observatory Observatory

Current: hundreds of Current: hundreds of gigabytesgigabytes to to a single a single terabyteterabyte

Growing to: over Growing to: over 150 150 terabytes!terabytes!

PDS Archive Volume Growth*

0102030405060708090

1990 1992 1994 1996 1998 2000 2002 2004 2006 2008

Year

TB (A

ccum

)

TBytes

* Projected as of 1/11/04

Page 4: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-44

Research Problem and Research Problem and ImportanceImportance

National Cancer Institute’s National Cancer Institute’s Early Detection Research Early Detection Research Network (EDRN)Network (EDRN)– Current: Current: tens of gigabytestens of gigabytes

to to hundreds of gigabyteshundreds of gigabytes– Growing to: Growing to: hundreds of hundreds of

gigabytes to terabytesgigabytes to terabytes Question: how to Question: how to

distribute these distribute these voluminous data sets?voluminous data sets?

Page 5: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-55

Distributing Large Volumes of Distributing Large Volumes of DataData

Use existing Use existing infrastructure?infrastructure?

HTTP/REST?HTTP/REST? Issues:Issues:

– Scalability?Scalability?– Single entrySingle entry

point?point?– Limited bandwidth?Limited bandwidth?– What about other What about other

distribution distribution mechanisms?mechanisms?

Planetary Data System

HTTP/R

ESTR

MI

SOA

PG

ridFTP

Page 6: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-66

Distributing Large Volumes of Distributing Large Volumes of DataData

Few data movement mechanisms in place for Few data movement mechanisms in place for scientists, students, educators, etc. to get their datascientists, students, educators, etc. to get their data– EDRN: HTTP/RESTEDRN: HTTP/REST– National Space Science Data Archive: FTPNational Space Science Data Archive: FTP– Physical Oceanography Data Active Archive Center: FTP, Physical Oceanography Data Active Archive Center: FTP,

and Aspera commercial UDP technologyand Aspera commercial UDP technology– Even Even Google: Google: HTTP/REST, SOAPHTTP/REST, SOAP

Even when there are many mechanisms in place, Even when there are many mechanisms in place, how do we select the correct one?how do we select the correct one?

Sometimes, we may even need to use them in Sometimes, we may even need to use them in concertconcert– Certain users may only be able to get data from GridFTP, Certain users may only be able to get data from GridFTP,

while others may require HTTP/RESTwhile others may require HTTP/REST– HTTP combined with a UDP based mechanism may speed HTTP combined with a UDP based mechanism may speed

up the transferup the transfer

Page 7: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-77

Distributing Large Volumes of Distributing Large Volumes of DataData

Understanding the TradeoffsUnderstanding the Tradeoffs– HTTP/REST isn’t all bad: it’s pervasive, it’s ubiquitous, it’s HTTP/REST isn’t all bad: it’s pervasive, it’s ubiquitous, it’s

a standarda standard It’s good in many It’s good in many situations, situations, but not but not all situationsall situations

– Same goes for many of the other distribution mechanismsSame goes for many of the other distribution mechanisms RMI scalable, but ties you to java, Peer-to-Peer highly scalable RMI scalable, but ties you to java, Peer-to-Peer highly scalable

and efficient, but may neglect dependability and consistencyand efficient, but may neglect dependability and consistency Understanding how many different data movement Understanding how many different data movement

technologies there are:technologies there are:– GridFTP, Aspera software, HTTP/REST, RMI, CORBA, SOAP, GridFTP, Aspera software, HTTP/REST, RMI, CORBA, SOAP,

XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MWGLIDE/PRISM-MW

– ……and and that’s just off the top of my head!that’s just off the top of my head! Understanding the Understanding the classesclasses of data movement of data movement

technologiestechnologies

Page 8: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-88

Software ArchitectureSoftware Architecture The definition of a system in the form The definition of a system in the form

of its canonical building blocksof its canonical building blocks– Software Components: the Software Components: the

computational units in the systemcomputational units in the system– Software Software ConnectorsConnectors: the : the

communications and interactions communications and interactions between software componentsbetween software components

– Software Configurations: arrangements Software Configurations: arrangements of components and connectors and the of components and connectors and the rules that guide their compositionrules that guide their composition

Page 9: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-99

A Software Architectural View A Software Architectural View of the Data Distribution of the Data Distribution

ProblemProblem ……Understanding the Understanding the architecturesarchitectures of of

existing data systemsexisting data systems

Planetary Data System PDS Query

Server

messaging layer (H

TTP)

Dataset/Product Catalog

Profile Server

Profile Server

Profile Server

Profile Server

...

….. …

..

...

pds.jpl.nasa.gov (Linux)Legend:

OODT Component

Data/metadata store

OODT Connector Hardware host

OODT controlled portion of machine

Product Server

Profile Server

DBMS(PDS Metadata)

themis.asu.edu (Linux)

data/control flowBlack Box

Filesystem (PDS Products)

PDS Profile Handler

PDS Query Handler

DB/Filesys/Data System

(PDS metadata)

PDS Profile Handler (Oracle)

OODT “Sandbox”

OODT “Sandbox”

Product Server

Profile Server

another.pds.server (AnotherOS)

Filesystem (PDS Products)

PDS Profile Handler

PDS Query Handler

DB/Filesys/Data System

(PDS metadata)

OODT “Sandbox”

Catalog and Archive Server

Filesystem (PDS Products)

Other Applications

pds.jpl.nasa.gov (Linux)

Other Applications

PDS Portal(Query Client)

user host

Profile Client

Product Client

Page 10: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1010

A Software Architectural View A Software Architectural View of the Data Distribution of the Data Distribution

ProblemProblem ……Deciding the appropriate software Deciding the appropriate software

connectors for data distribution (and their connectors for data distribution (and their combinations) to usecombinations) to use

Planetary Data System

Connector N

……

...

GridFTP

HTTP/REST

SOAP

Bittorrent

Connectors

Page 11: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1111

A Software Architectural View A Software Architectural View of the Data Distribution of the Data Distribution

ProblemProblem ……Satisfying specified user scenarios for Satisfying specified user scenarios for

data distributiondata distribution20 users, 5 user types, 10 terabytes of data,

users located at JPL, ESA, Goddard, 2 delivery intervals...

100 users, 1 user types, 1 terabytes of data, users located at ISI, 1 delivery intervals...

10,000 users, 20 user types, 200 terabytes of data, users located in North America, South America and Europe, 25 delivery intervals...

250 users, 2 user types, 100 gigabytes of data, users located at JPL, Goddard, 10

delivery intervals...

Page 12: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1212

A Software Architectural View A Software Architectural View of the Data Distribution of the Data Distribution

ProblemProblem ……Making these Making these

people happy!people happy!

Page 13: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1313

Research QuestionResearch Question What types of software What types of software

connectors are best suited for connectors are best suited for delivering these huge amounts delivering these huge amounts of data to the users, that satisfy of data to the users, that satisfy their particular scenarios, in a their particular scenarios, in a manner that is performant, manner that is performant, scalable, in these hugely scalable, in these hugely distributed data systems?distributed data systems?

Page 14: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1414

Problem StatementProblem Statement Identifying and selecting suitable software connectors Identifying and selecting suitable software connectors

for data distribution* that satisfy user specified for data distribution* that satisfy user specified constraintsconstraints

Use eight key dimensions of data distributionUse eight key dimensions of data distribution– Literature reviewLiterature review– Our own experience in the context of planetary science and Our own experience in the context of planetary science and

cancer research at JPLcancer research at JPL– User specified constraints on eight dimensions are User specified constraints on eight dimensions are data data

distribution scenariosdistribution scenarios Identification of four basic distribution connector Identification of four basic distribution connector

classesclasses– RPC, P2P, Grid, Event-basedRPC, P2P, Grid, Event-based– What classes are appropriate for which distribution What classes are appropriate for which distribution

scenarios?scenarios?* Referred to as “distribution connectors” or “data distribution connectors”

Page 15: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1515

Eight Dimensions of Data Eight Dimensions of Data DistributionDistribution

Data Distribution

Delivery Schedule

Performance Requirements

Number of Intervals

Volume Per Interval

Timing of Interval

ConsistencyScalabilityDependabilityEfficiency

Access Policies

Geographic Distribution

Number of Data Types

Total Volume

WAN

LAN

Number of Users

Number of User Types

Producers

Consumers

Automatic

Initiated

Automatic

InitiatedTypes of Data Data

Metadata

Page 16: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1616

Eight Dimensions of Data Eight Dimensions of Data DistributionDistribution

Total VolumeTotal Volume - the total amount of data that needs to be transferred - the total amount of data that needs to be transferred from providers of data to consumers of data.from providers of data to consumers of data.

Number of Delivery IntervalsNumber of Delivery Intervals - the number, size and frequency - the number, size and frequency (timing) of intervals that the volume of data should be delivered (timing) of intervals that the volume of data should be delivered within.within.

Performance RequirementsPerformance Requirements - any constraints and requirements on the - any constraints and requirements on the scalability, efficiency, consistency, and dependability of the scalability, efficiency, consistency, and dependability of the distribution scenario.distribution scenario.

Number of UsersNumber of Users - the amount of unique users that the data volume - the amount of unique users that the data volume needs to be delivered to.needs to be delivered to.

Number of User TypesNumber of User Types - the amount of unique user types, such as - the amount of unique user types, such as scientists, or students, that the data volume needs to be delivered to.scientists, or students, that the data volume needs to be delivered to.

Data TypesData Types - The number of different data types that are part of the - The number of different data types that are part of the total volume to be delivered. total volume to be delivered.

Geographic DistributionGeographic Distribution - The geographic distribution of the data - The geographic distribution of the data providers and consumers.providers and consumers.

Access PoliciesAccess Policies - The number and types of access policies in place at - The number and types of access policies in place at each producer and consumer of data.each producer and consumer of data.

Page 17: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1717

ApproachApproach

GRIDP2P

RPC

ConnectorClasses

Categorization Framework

ClassifierCategorizer

Software Framework

Distribution Scenarios

Selector

Integrator

PreferenceEvaluator

UserPreferences

User Inputs

Combined Connector

Testing Framework

PerformanceEvaluator

Satisfaction of Distribution Scenario?

Framework Output

Framework OutputDCPs

Classification

Categorization Integration

Testing/Evaluation

Page 18: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1818

Evaluation StrategyEvaluation Strategy Empirical evaluation using real world systemsEmpirical evaluation using real world systems

– NASA Planetary Data SystemNASA Planetary Data System– NASA Orbiting Carbon Observatory MissionNASA Orbiting Carbon Observatory Mission– National Cancer Institute’s Early Detection Research National Cancer Institute’s Early Detection Research

NetworkNetwork Quantifiably measureQuantifiably measure

– consistency (data delivered is data sent)consistency (data delivered is data sent)– efficiency (memory footprint and data throughput)efficiency (memory footprint and data throughput)– scalability (data volume and number of hostsscalability (data volume and number of hosts– dependability (uptime, number of faults)dependability (uptime, number of faults)

Compare to off-the-shelf connector solutionsCompare to off-the-shelf connector solutions– OODT, GridFTP, Aspera, UFTP, Bittorrent, possibly moreOODT, GridFTP, Aspera, UFTP, Bittorrent, possibly more

Page 19: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1919

Current ProgressCurrent Progress Preliminary Study with NASA’s Planetary Preliminary Study with NASA’s Planetary

Data SystemData System Classified and Compared Data Movement Classified and Compared Data Movement

TechnologiesTechnologies– Parallel TCP/IP technologiesParallel TCP/IP technologies

GridFTP, bbFTPGridFTP, bbFTP– UDP bursting technologiesUDP bursting technologies

Aspera, UFTPAspera, UFTP– Baseline technologiesBaseline technologies

SCP, FTP, HTTPSCP, FTP, HTTP

Page 20: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2020

Experimental ResultsExperimental Results Classified and Evaluated each technology Classified and Evaluated each technology

against data distribution dimensionsagainst data distribution dimensions Measured transfer rate Measured transfer rate

– LAN-basedLAN-based– WAN-basedWAN-based– Varied dataset sizes Varied dataset sizes

from 10s of MBs to from 10s of MBs to 10s of GBs10s of GBs

Ease to operate, easeEase to operate, easeto installto install

UDP technologies not testable on WANUDP technologies not testable on WAN(firewall, security, ease to configure)(firewall, security, ease to configure)

* GridFTP (blue), bbFTP (red), FTP * GridFTP (blue), bbFTP (red), FTP (green)(green)

Page 21: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2121

ConclusionsConclusions Proposed approach for classifying, Proposed approach for classifying,

selecting and evaluating different selecting and evaluating different software connectors for data distributionsoftware connectors for data distribution

Preliminary results suggest parallel Preliminary results suggest parallel TCP/IP technologies beneficial in real TCP/IP technologies beneficial in real world system (PDS)world system (PDS)

Currently formalizing connector Currently formalizing connector metadata and developing connector XML metadata and developing connector XML profilesprofiles

Page 22: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2222

Questions?Questions? Thanks for your attention!Thanks for your attention!

Page 23: Tera/Petabyte data distribution architectures

BackupBackup

Page 24: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2424

Refereed PapersRefereed Papers C. MattmannC. Mattmann, S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez , S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez

and R. Joynger. A Classification and Evaluation of Data Movement and R. Joynger. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Products. In Proceedings of NASA/IEEE Conference on Mass Storage Proceedings of NASA/IEEE Conference on Mass Storage Systems and TechnologiesSystems and Technologies, May 2006., May 2006.

C. MattmannC. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software , D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture-Based Framework for Highly Distributed and Data Intensive Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Scientific Applications. In Proceedings ofProceedings of ICSEICSE, Shanghai, China, May 20th-, Shanghai, China, May 20th-28th, 2006.28th, 2006.

N. Medvidovic and N. Medvidovic and C. MattmannC. Mattmann. The GridLite DREAM: Bringing the Grid to . The GridLite DREAM: Bringing the Grid to Your Pocket. In Your Pocket. In Proceedings of the Monterey Workshop on Networked Proceedings of the Monterey Workshop on Networked SystemsSystems, Irvine, CA, September, 2005., Irvine, CA, September, 2005.

C. MattmannC. Mattmann, N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the , N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the Grid. In Grid. In Proceedings of the 8th ACM SIGSOFT International Symposium on Proceedings of the 8th ACM SIGSOFT International Symposium on Component-based Software Engineering (CBSE8)Component-based Software Engineering (CBSE8), pp. 322-336. LNCS 3489, , pp. 322-336. LNCS 3489, St. Louis, Missouri, May 14th-15th, 2005.St. Louis, Missouri, May 14th-15th, 2005.

C. MattmannC. Mattmann, S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. , S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-intensive Environments. In intensive Environments. In Proceedings of the European Grid Conference Proceedings of the European Grid Conference (EGC2005)(EGC2005), pp. 68-77. LNCS 3470, Amsterdam, The Netherlands, February , pp. 68-77. LNCS 3470, Amsterdam, The Netherlands, February 14-16, 2005.14-16, 2005.

Page 25: Tera/Petabyte data distribution architectures

Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2525

Refereed PapersRefereed Papers J. Steven Hughes, D. Crichton, S. Kelly, J. Steven Hughes, D. Crichton, S. Kelly, C. MattmannC. Mattmann, R. Joyner, J. Wilf and , R. Joyner, J. Wilf and

J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance Orbiter Era and Beyond. In Orbiter Era and Beyond. In Proceedings of the 2nd ESA Symposium on Proceedings of the 2nd ESA Symposium on Ensuring the Long Term Preservation and Adding Value to Scientific and Ensuring the Long Term Preservation and Adding Value to Scientific and Technical Data (PV-2004)Technical Data (PV-2004). Frascati, Italy, October 5-7, 2004.. Frascati, Italy, October 5-7, 2004.

C. MattmannC. Mattmann, D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software , D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software Architecture for Large scale, Distributed, Data-Intensive Systems. In Architecture for Large scale, Distributed, Data-Intensive Systems. In Proceedings of the 4th IEEE/IFIP Working Conference on Software Proceedings of the 4th IEEE/IFIP Working Conference on Software Architecture (WICSA-4)Architecture (WICSA-4), pp. 255-264. Oslo, Norway, June 12th-15th, 2004., pp. 255-264. Oslo, Norway, June 12th-15th, 2004.

C. MattmannC. Mattmann, P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data , P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data Products using Data Grid Middleware for Deep Space Mission Systems. In Products using Data Grid Middleware for Deep Space Mission Systems. In Proceedings of the 8th International Conference on Space Operations Proceedings of the 8th International Conference on Space Operations (Spaceops-2004), (Spaceops-2004), AIAA Press. Montreal, Canada, May 2004.AIAA Press. Montreal, Canada, May 2004.