Tera/Petabyte data distribution architectures
-
Upload
robert-burns -
Category
Documents
-
view
40 -
download
0
description
Transcript of Tera/Petabyte data distribution architectures
Tera/Petabyte data Tera/Petabyte data distribution architecturesdistribution architectures
Chris A. MattmannChris A. MattmannUSC-CSE Annual Research USC-CSE Annual Research
ReviewReviewMonday, April 24, 2023Monday, April 24, 2023
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-22
OutlineOutline Research Problem and ImportanceResearch Problem and Importance Background and Related WorkBackground and Related Work Problem StatementProblem Statement ApproachApproach Evaluation StrategyEvaluation Strategy ConclusionsConclusions
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-33
Research Problem and Research Problem and ImportanceImportance
Volume of data returned Volume of data returned from scientific experiments from scientific experiments and media content and media content providers growing rapidlyproviders growing rapidly– Planetary Data SystemPlanetary Data System
Current: 20 Current: 20 terabytesterabytes for for all NASA missionsall NASA missions
Growing to: over Growing to: over 200200 terabytesterabytes from a single from a single mission!mission!
– Orbiting Carbon Orbiting Carbon Observatory Observatory
Current: hundreds of Current: hundreds of gigabytesgigabytes to to a single a single terabyteterabyte
Growing to: over Growing to: over 150 150 terabytes!terabytes!
PDS Archive Volume Growth*
0102030405060708090
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Year
TB (A
ccum
)
TBytes
* Projected as of 1/11/04
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-44
Research Problem and Research Problem and ImportanceImportance
National Cancer Institute’s National Cancer Institute’s Early Detection Research Early Detection Research Network (EDRN)Network (EDRN)– Current: Current: tens of gigabytestens of gigabytes
to to hundreds of gigabyteshundreds of gigabytes– Growing to: Growing to: hundreds of hundreds of
gigabytes to terabytesgigabytes to terabytes Question: how to Question: how to
distribute these distribute these voluminous data sets?voluminous data sets?
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-55
Distributing Large Volumes of Distributing Large Volumes of DataData
Use existing Use existing infrastructure?infrastructure?
HTTP/REST?HTTP/REST? Issues:Issues:
– Scalability?Scalability?– Single entrySingle entry
point?point?– Limited bandwidth?Limited bandwidth?– What about other What about other
distribution distribution mechanisms?mechanisms?
Planetary Data System
HTTP/R
ESTR
MI
SOA
PG
ridFTP
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-66
Distributing Large Volumes of Distributing Large Volumes of DataData
Few data movement mechanisms in place for Few data movement mechanisms in place for scientists, students, educators, etc. to get their datascientists, students, educators, etc. to get their data– EDRN: HTTP/RESTEDRN: HTTP/REST– National Space Science Data Archive: FTPNational Space Science Data Archive: FTP– Physical Oceanography Data Active Archive Center: FTP, Physical Oceanography Data Active Archive Center: FTP,
and Aspera commercial UDP technologyand Aspera commercial UDP technology– Even Even Google: Google: HTTP/REST, SOAPHTTP/REST, SOAP
Even when there are many mechanisms in place, Even when there are many mechanisms in place, how do we select the correct one?how do we select the correct one?
Sometimes, we may even need to use them in Sometimes, we may even need to use them in concertconcert– Certain users may only be able to get data from GridFTP, Certain users may only be able to get data from GridFTP,
while others may require HTTP/RESTwhile others may require HTTP/REST– HTTP combined with a UDP based mechanism may speed HTTP combined with a UDP based mechanism may speed
up the transferup the transfer
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-77
Distributing Large Volumes of Distributing Large Volumes of DataData
Understanding the TradeoffsUnderstanding the Tradeoffs– HTTP/REST isn’t all bad: it’s pervasive, it’s ubiquitous, it’s HTTP/REST isn’t all bad: it’s pervasive, it’s ubiquitous, it’s
a standarda standard It’s good in many It’s good in many situations, situations, but not but not all situationsall situations
– Same goes for many of the other distribution mechanismsSame goes for many of the other distribution mechanisms RMI scalable, but ties you to java, Peer-to-Peer highly scalable RMI scalable, but ties you to java, Peer-to-Peer highly scalable
and efficient, but may neglect dependability and consistencyand efficient, but may neglect dependability and consistency Understanding how many different data movement Understanding how many different data movement
technologies there are:technologies there are:– GridFTP, Aspera software, HTTP/REST, RMI, CORBA, SOAP, GridFTP, Aspera software, HTTP/REST, RMI, CORBA, SOAP,
XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MWGLIDE/PRISM-MW
– ……and and that’s just off the top of my head!that’s just off the top of my head! Understanding the Understanding the classesclasses of data movement of data movement
technologiestechnologies
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-88
Software ArchitectureSoftware Architecture The definition of a system in the form The definition of a system in the form
of its canonical building blocksof its canonical building blocks– Software Components: the Software Components: the
computational units in the systemcomputational units in the system– Software Software ConnectorsConnectors: the : the
communications and interactions communications and interactions between software componentsbetween software components
– Software Configurations: arrangements Software Configurations: arrangements of components and connectors and the of components and connectors and the rules that guide their compositionrules that guide their composition
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-99
A Software Architectural View A Software Architectural View of the Data Distribution of the Data Distribution
ProblemProblem ……Understanding the Understanding the architecturesarchitectures of of
existing data systemsexisting data systems
Planetary Data System PDS Query
Server
messaging layer (H
TTP)
Dataset/Product Catalog
Profile Server
Profile Server
Profile Server
Profile Server
...
….. …
..
...
pds.jpl.nasa.gov (Linux)Legend:
OODT Component
Data/metadata store
OODT Connector Hardware host
OODT controlled portion of machine
Product Server
Profile Server
DBMS(PDS Metadata)
themis.asu.edu (Linux)
data/control flowBlack Box
Filesystem (PDS Products)
PDS Profile Handler
PDS Query Handler
DB/Filesys/Data System
(PDS metadata)
PDS Profile Handler (Oracle)
OODT “Sandbox”
OODT “Sandbox”
Product Server
Profile Server
another.pds.server (AnotherOS)
Filesystem (PDS Products)
PDS Profile Handler
PDS Query Handler
DB/Filesys/Data System
(PDS metadata)
OODT “Sandbox”
Catalog and Archive Server
Filesystem (PDS Products)
Other Applications
pds.jpl.nasa.gov (Linux)
Other Applications
PDS Portal(Query Client)
user host
Profile Client
Product Client
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1010
A Software Architectural View A Software Architectural View of the Data Distribution of the Data Distribution
ProblemProblem ……Deciding the appropriate software Deciding the appropriate software
connectors for data distribution (and their connectors for data distribution (and their combinations) to usecombinations) to use
Planetary Data System
Connector N
……
...
GridFTP
HTTP/REST
SOAP
Bittorrent
Connectors
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1111
A Software Architectural View A Software Architectural View of the Data Distribution of the Data Distribution
ProblemProblem ……Satisfying specified user scenarios for Satisfying specified user scenarios for
data distributiondata distribution20 users, 5 user types, 10 terabytes of data,
users located at JPL, ESA, Goddard, 2 delivery intervals...
100 users, 1 user types, 1 terabytes of data, users located at ISI, 1 delivery intervals...
10,000 users, 20 user types, 200 terabytes of data, users located in North America, South America and Europe, 25 delivery intervals...
250 users, 2 user types, 100 gigabytes of data, users located at JPL, Goddard, 10
delivery intervals...
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1212
A Software Architectural View A Software Architectural View of the Data Distribution of the Data Distribution
ProblemProblem ……Making these Making these
people happy!people happy!
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1313
Research QuestionResearch Question What types of software What types of software
connectors are best suited for connectors are best suited for delivering these huge amounts delivering these huge amounts of data to the users, that satisfy of data to the users, that satisfy their particular scenarios, in a their particular scenarios, in a manner that is performant, manner that is performant, scalable, in these hugely scalable, in these hugely distributed data systems?distributed data systems?
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1414
Problem StatementProblem Statement Identifying and selecting suitable software connectors Identifying and selecting suitable software connectors
for data distribution* that satisfy user specified for data distribution* that satisfy user specified constraintsconstraints
Use eight key dimensions of data distributionUse eight key dimensions of data distribution– Literature reviewLiterature review– Our own experience in the context of planetary science and Our own experience in the context of planetary science and
cancer research at JPLcancer research at JPL– User specified constraints on eight dimensions are User specified constraints on eight dimensions are data data
distribution scenariosdistribution scenarios Identification of four basic distribution connector Identification of four basic distribution connector
classesclasses– RPC, P2P, Grid, Event-basedRPC, P2P, Grid, Event-based– What classes are appropriate for which distribution What classes are appropriate for which distribution
scenarios?scenarios?* Referred to as “distribution connectors” or “data distribution connectors”
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1515
Eight Dimensions of Data Eight Dimensions of Data DistributionDistribution
Data Distribution
Delivery Schedule
Performance Requirements
Number of Intervals
Volume Per Interval
Timing of Interval
ConsistencyScalabilityDependabilityEfficiency
Access Policies
Geographic Distribution
Number of Data Types
Total Volume
WAN
LAN
Number of Users
Number of User Types
Producers
Consumers
Automatic
Initiated
Automatic
InitiatedTypes of Data Data
Metadata
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1616
Eight Dimensions of Data Eight Dimensions of Data DistributionDistribution
Total VolumeTotal Volume - the total amount of data that needs to be transferred - the total amount of data that needs to be transferred from providers of data to consumers of data.from providers of data to consumers of data.
Number of Delivery IntervalsNumber of Delivery Intervals - the number, size and frequency - the number, size and frequency (timing) of intervals that the volume of data should be delivered (timing) of intervals that the volume of data should be delivered within.within.
Performance RequirementsPerformance Requirements - any constraints and requirements on the - any constraints and requirements on the scalability, efficiency, consistency, and dependability of the scalability, efficiency, consistency, and dependability of the distribution scenario.distribution scenario.
Number of UsersNumber of Users - the amount of unique users that the data volume - the amount of unique users that the data volume needs to be delivered to.needs to be delivered to.
Number of User TypesNumber of User Types - the amount of unique user types, such as - the amount of unique user types, such as scientists, or students, that the data volume needs to be delivered to.scientists, or students, that the data volume needs to be delivered to.
Data TypesData Types - The number of different data types that are part of the - The number of different data types that are part of the total volume to be delivered. total volume to be delivered.
Geographic DistributionGeographic Distribution - The geographic distribution of the data - The geographic distribution of the data providers and consumers.providers and consumers.
Access PoliciesAccess Policies - The number and types of access policies in place at - The number and types of access policies in place at each producer and consumer of data.each producer and consumer of data.
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1717
ApproachApproach
GRIDP2P
RPC
ConnectorClasses
Categorization Framework
ClassifierCategorizer
Software Framework
Distribution Scenarios
Selector
Integrator
PreferenceEvaluator
UserPreferences
User Inputs
Combined Connector
Testing Framework
PerformanceEvaluator
Satisfaction of Distribution Scenario?
Framework Output
Framework OutputDCPs
Classification
Categorization Integration
Testing/Evaluation
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1818
Evaluation StrategyEvaluation Strategy Empirical evaluation using real world systemsEmpirical evaluation using real world systems
– NASA Planetary Data SystemNASA Planetary Data System– NASA Orbiting Carbon Observatory MissionNASA Orbiting Carbon Observatory Mission– National Cancer Institute’s Early Detection Research National Cancer Institute’s Early Detection Research
NetworkNetwork Quantifiably measureQuantifiably measure
– consistency (data delivered is data sent)consistency (data delivered is data sent)– efficiency (memory footprint and data throughput)efficiency (memory footprint and data throughput)– scalability (data volume and number of hostsscalability (data volume and number of hosts– dependability (uptime, number of faults)dependability (uptime, number of faults)
Compare to off-the-shelf connector solutionsCompare to off-the-shelf connector solutions– OODT, GridFTP, Aspera, UFTP, Bittorrent, possibly moreOODT, GridFTP, Aspera, UFTP, Bittorrent, possibly more
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-1919
Current ProgressCurrent Progress Preliminary Study with NASA’s Planetary Preliminary Study with NASA’s Planetary
Data SystemData System Classified and Compared Data Movement Classified and Compared Data Movement
TechnologiesTechnologies– Parallel TCP/IP technologiesParallel TCP/IP technologies
GridFTP, bbFTPGridFTP, bbFTP– UDP bursting technologiesUDP bursting technologies
Aspera, UFTPAspera, UFTP– Baseline technologiesBaseline technologies
SCP, FTP, HTTPSCP, FTP, HTTP
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2020
Experimental ResultsExperimental Results Classified and Evaluated each technology Classified and Evaluated each technology
against data distribution dimensionsagainst data distribution dimensions Measured transfer rate Measured transfer rate
– LAN-basedLAN-based– WAN-basedWAN-based– Varied dataset sizes Varied dataset sizes
from 10s of MBs to from 10s of MBs to 10s of GBs10s of GBs
Ease to operate, easeEase to operate, easeto installto install
UDP technologies not testable on WANUDP technologies not testable on WAN(firewall, security, ease to configure)(firewall, security, ease to configure)
* GridFTP (blue), bbFTP (red), FTP * GridFTP (blue), bbFTP (red), FTP (green)(green)
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2121
ConclusionsConclusions Proposed approach for classifying, Proposed approach for classifying,
selecting and evaluating different selecting and evaluating different software connectors for data distributionsoftware connectors for data distribution
Preliminary results suggest parallel Preliminary results suggest parallel TCP/IP technologies beneficial in real TCP/IP technologies beneficial in real world system (PDS)world system (PDS)
Currently formalizing connector Currently formalizing connector metadata and developing connector XML metadata and developing connector XML profilesprofiles
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2222
Questions?Questions? Thanks for your attention!Thanks for your attention!
BackupBackup
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2424
Refereed PapersRefereed Papers C. MattmannC. Mattmann, S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez , S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez
and R. Joynger. A Classification and Evaluation of Data Movement and R. Joynger. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Products. In Proceedings of NASA/IEEE Conference on Mass Storage Proceedings of NASA/IEEE Conference on Mass Storage Systems and TechnologiesSystems and Technologies, May 2006., May 2006.
C. MattmannC. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software , D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture-Based Framework for Highly Distributed and Data Intensive Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Scientific Applications. In Proceedings ofProceedings of ICSEICSE, Shanghai, China, May 20th-, Shanghai, China, May 20th-28th, 2006.28th, 2006.
N. Medvidovic and N. Medvidovic and C. MattmannC. Mattmann. The GridLite DREAM: Bringing the Grid to . The GridLite DREAM: Bringing the Grid to Your Pocket. In Your Pocket. In Proceedings of the Monterey Workshop on Networked Proceedings of the Monterey Workshop on Networked SystemsSystems, Irvine, CA, September, 2005., Irvine, CA, September, 2005.
C. MattmannC. Mattmann, N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the , N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the Grid. In Grid. In Proceedings of the 8th ACM SIGSOFT International Symposium on Proceedings of the 8th ACM SIGSOFT International Symposium on Component-based Software Engineering (CBSE8)Component-based Software Engineering (CBSE8), pp. 322-336. LNCS 3489, , pp. 322-336. LNCS 3489, St. Louis, Missouri, May 14th-15th, 2005.St. Louis, Missouri, May 14th-15th, 2005.
C. MattmannC. Mattmann, S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. , S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-intensive Environments. In intensive Environments. In Proceedings of the European Grid Conference Proceedings of the European Grid Conference (EGC2005)(EGC2005), pp. 68-77. LNCS 3470, Amsterdam, The Netherlands, February , pp. 68-77. LNCS 3470, Amsterdam, The Netherlands, February 14-16, 2005.14-16, 2005.
Apr 24, 2023Apr 24, 2023 MATTMANN-ARRMATTMANN-ARR CAM-CAM-2525
Refereed PapersRefereed Papers J. Steven Hughes, D. Crichton, S. Kelly, J. Steven Hughes, D. Crichton, S. Kelly, C. MattmannC. Mattmann, R. Joyner, J. Wilf and , R. Joyner, J. Wilf and
J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance Orbiter Era and Beyond. In Orbiter Era and Beyond. In Proceedings of the 2nd ESA Symposium on Proceedings of the 2nd ESA Symposium on Ensuring the Long Term Preservation and Adding Value to Scientific and Ensuring the Long Term Preservation and Adding Value to Scientific and Technical Data (PV-2004)Technical Data (PV-2004). Frascati, Italy, October 5-7, 2004.. Frascati, Italy, October 5-7, 2004.
C. MattmannC. Mattmann, D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software , D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software Architecture for Large scale, Distributed, Data-Intensive Systems. In Architecture for Large scale, Distributed, Data-Intensive Systems. In Proceedings of the 4th IEEE/IFIP Working Conference on Software Proceedings of the 4th IEEE/IFIP Working Conference on Software Architecture (WICSA-4)Architecture (WICSA-4), pp. 255-264. Oslo, Norway, June 12th-15th, 2004., pp. 255-264. Oslo, Norway, June 12th-15th, 2004.
C. MattmannC. Mattmann, P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data , P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data Products using Data Grid Middleware for Deep Space Mission Systems. In Products using Data Grid Middleware for Deep Space Mission Systems. In Proceedings of the 8th International Conference on Space Operations Proceedings of the 8th International Conference on Space Operations (Spaceops-2004), (Spaceops-2004), AIAA Press. Montreal, Canada, May 2004.AIAA Press. Montreal, Canada, May 2004.