Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the...

65
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for Tools and Techniques for the Data Grid the Data Grid Gagan Agrawal

Transcript of Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the...

  • Slide 1

Slide 2 Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal Slide 3 Ohio State University Department of Computer Science and Engineering 2 Scientific Data Analysis on Grid-based Data Repositories Scientific data repositories Large volume Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments Data could be streaming in nature Scientific data analysis Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization Slide 4 Ohio State University Department of Computer Science and Engineering 3 Opportunities Scientific simulations and data collection instruments generating large scale data Grid standards enabling sharing of data Rapidly increasing wide-area bandwidths Slide 5 Ohio State University Department of Computer Science and Engineering 4 Motivating Scientific Applications Magnetic Resonance Imaging Oil Reservoir Management Data-driven applications from science, Engineering, biomedicine: Oil Reservoir ManagementWater Contamination Studies Cancer Studies using MRITelepathology with Digitized Slides Satellite Data ProcessingVirtual Microscope Slide 6 Ohio State University Department of Computer Science and Engineering 5 Existing Efforts Data grids recognized as important component of grid/distributed computing Major topics Efficient/Secure Data Movement Replica Selection Metadata catalogs / Metadata services Setting up workflows Slide 7 Ohio State University Department of Computer Science and Engineering 6 Open Issues Accessing / Retrieving / Processing data from scientific repositories Need to deal with low-level formats Integrating tools and services having/requiring data with different formats Support for processing streaming data in a distributed environment Efficient distributed data-intensive applications Developing scalable data analysis applications Slide 8 Ohio State University Department of Computer Science and Engineering 7 Ongoing Projects Automatic Data Virtualization On the fly information integration in a distributed environment Middleware for Processing Streaming Data Supporting Coarse-grained pipelined parallelism Compiling XQuery on Scientific and Streaming Data Middleware and Algorithms for Scalable Data Mining Slide 9 Ohio State University Department of Computer Science and Engineering 8 Outline Automatic Data Virtualization Relational/SQL XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism Slide 10 Ohio State University Department of Computer Science and Engineering 9 Automatic Data Virtualization: Motivation Emergence of grid-based data repositories Can enable sharing of data in an unprecedented way Access mechanisms for remote repositories Complex low-level formats make accessing and processing of data difficult Main desired functionality Ability to select, down-load, and process a subset of data Slide 11 Ohio State University Department of Computer Science and Engineering 10 Current Approaches Databases Relational model using SQL Properties of transactions: Atomicity, Isolation, Durability, Consistency Good! But is it too heavyweight for read-mostly scientific data ? Manual implementation based on low-level datasets Need detailed understanding of low-level formats HDF5, NetCDF, etc No single established standard BinX, BFD, DFDL Machine readable descriptions, but application is dependent on a specific layout Slide 12 Ohio State University Department of Computer Science and Engineering 11 Data Virtualization An abstract view of data dataset Data Service Data Virtualization By Global Grid Forums DAIS working group: A Data Virtualization describes an abstract view of data. A Data Service implements the mechanism to access and process data through the Data Virtualization Slide 13 Ohio State University Department of Computer Science and Engineering 12 Our Approach: Automatic Data Virtualization Automatically create data services A new application of compiler technology A meta-data descriptor describes the layout of data on a repository An abstract view is exposed to the users Two implementations: Relational /SQL-based XML/XQuery based Slide 14 Ohio State University Department of Computer Science and Engineering 13 System Overview Compiler Analysis and Code Generation Extraction Service STORM Aggregation Service Meta-data Descriptor User Defined Aggregate Query frontend Select Query Input Slide 15 Ohio State University Department of Computer Science and Engineering 14 Design a Meta-data Description Language Requirements Specify the relationship of a dataset to the virtual dataset schema Describe the dataset physical layout within a file Describe the dataset distribution on nodes of one or more clusters Specify the subsetting index attributes Easy to use for data repository administrators and also convenient for our code generation Slide 16 Ohio State University Department of Computer Science and Engineering 15 Design Overview Dataset Schema Description Component Dataset Storage Description Component Dataset Layout Description Component Slide 17 Ohio State University Department of Computer Science and Engineering 16 An Example Oil Reservoir Management The dataset comprises several simulation on the same grid For each realization, each grid point, a number of attributes are stored. The dataset is stored on a 4 node cluster. Component I: Dataset Schema Description [IPARS]// { * Dataset schema name *} REL = short int// {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars Slide 18 Ohio State University Department of Computer Science and Engineering 17 Compiler Analysis Compiler Analysis Meta-data descriptor Create AFC Process AFC Index & Extraction function code Data _Extract { Find _File _Groups() Process _File _Groups() } Find _File _Groups { Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S 1, ,S m be the m sets T = foreach {s 1, ,s m } s i S i { {* cartesian product between S 1, ,S m *} If the values of implicit attributes are not inconsistent { T = T {s 1, ,s m } } Output T } Process _File _Groups { foreach {s 1, ,s m } T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk } Slide 19 Ohio State University Department of Computer Science and Engineering 18 Test the ability of our code generation tool Oil Reservoir Management The performance difference is within 4%~10% as for Layout 0. Correctly and efficiently handle a variety of different layouts for the same data Slide 20 Ohio State University Department of Computer Science and Engineering 19 Evaluate the Scalability of Our Tool Scale the number of nodes hosting the Oil reservoir management dataset Extract a subset of interest at the size of 1.3GB The execution times scale almost linearly. The performance difference varies between 5%~34%, with an average difference of 16%. Slide 21 Ohio State University Department of Computer Science and Engineering 20 Comparison with an existing database (PostgreSQL) 6GB data for Satellite data processing. The total storage required after loading the data in PostgreSQL is 18GB. Create Index for both spatial coordinates and S1 in PostgreSQL. No special performance tuning applied for the experiment. No.Description 1SELECT * FROM TITAN; 2SELECT * FROM TITAN WHERE X>=0 AND X =0 AND Y =0 AND Z Ohio State University Department of Computer Science and Engineering 27 XQuery Overview XQuery -A language for querying and processing XML document - Functional language - Single Assignment - Strongly typed XQuery Expression - for let where return (FLWR) - unordered - path expression Unordered( For $d in document(depts.xml)//deptno let $e:=document(emps.xml)//emp [Deptno= $d] where count($e)>=10 return {$d, {count($e) } {avg($e/salary)} } ) Slide 29 Ohio State University Department of Computer Science and Engineering 28 Satellite- XQuery Code Unordered ( for $i in ( $minx to $maxx) for $j in ($miny to $maxy) let p:=document(sate.xml) /data/pixel where lat = i and long = j return {$i} {$j} {accumulate($p)} ) Define function accumulate ($p) as double { let $inp := item-at($p,1) let $NVDI := (( $inp/band1 - $inp/band0)div($inp/band1+$inp/band0 )+1)*512 return if (empty( $p) ) then 0 else { max($NVDI, accumulate(subsequence ($p, 2 ))) } Slide 30 Ohio State University Department of Computer Science and Engineering 29 Challenges Need to translate to low-level schema Focus on correctness and avoiding unnecessary reads Enhancing locality Data-centric execution on XQuery constructs Use information on low-level data layout Issues specific to XQuery Reductions expressed as recursive functions Generating code in an imperative language For either direct compilation or use a part of a runtime system Requires type conversion Slide 31 Ohio State University Department of Computer Science and Engineering 30 Outline Automatic Data Virtualization Relational/SQL XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism Slide 32 Ohio State University Department of Computer Science and Engineering 31 Introduction : Information Integration Goal: to provide a uniform access/query interface to multiple heterogeneous sources. Challenges: Global schema Query optimization Resource discovery Ontology discrepancy etc. Slide 33 Ohio State University Department of Computer Science and Engineering 32 Introduction: Wrapper Goal: to provide the integration system transparent access to data sources Challenges Development cost Performance Transportability Slide 34 Ohio State University Department of Computer Science and Engineering 33 Roadmap Introduction System Overview Meta Data Description Language Wrapper Generation Conclusion Slide 35 Ohio State University Department of Computer Science and Engineering 34 Overview: Main Components Users view of the data Meta data description language Mapping between input and output schema Schema mapping Parse inputs and generate outputs DataReader and DataWriter Slide 36 Ohio State University Department of Computer Science and Engineering 35 System Overview Meta Data Descriptor ParserMapping Generator Internal Data Entry Representation Schema Mapping Code Generator DataReaderDataWriter Integrator Source Dataset Target Dataset Slide 37 Ohio State University Department of Computer Science and Engineering 36 Meta Data Descriptor (1) Design Goals: Easy to interpret and process Easy to write Sufficiently expressive Added features (for bioinformatics datasets): Strings with no fixed size Delimiters are used for separating fields Fields may be divided into lines/variables Total number of items unknown Slide 38 Ohio State University Department of Computer Science and Engineering 37 Meta Data Descriptor (2) Component I. Schema Description [FASTA] ID = string DESCRIPTION = string SEQ = string Schema name Data field name Data type Slide 39 Ohio State University Department of Computer Science and Engineering 38 Meta Data Descriptor (3) Component II. Layout Description DATASET FASTAData { DATATYPE {FASTA} DATASPACE LINESIZE=80 { LOOP ENTRY 1:EOF:1 { > ID DESCRIPTION \n | EOF } } DATA {osu/fasta} } >Example1 envelope protein ELRLRYCAPAGFALLKCNDA DYDGFKTNCSNVSVVHCTNL MNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKH >Example2 synthetic peptide HITREPLKHIPKERYRGTNDT Dataset name Schema name File layout File location IDDESCRIPTION SEQ Slide 40 Ohio State University Department of Computer Science and Engineering 39 Meta Data Descriptor (4) Input SWISSPROT data LOOP ENTRY 1:EOF:1 { ID ID LOOP I 1:3:1 {\nDT DATE} [\nOG ORGANELLE] P1;CRAM_CRAAB TTCCPSIVAR SNFNVCRLPG TPEAICATYT GCIIIPGATC PGDYAN \\ Slide 41 Ohio State University Department of Computer Science and Engineering 40 Wrapper Generation: Mapping Generator Goal: Generate schema mapping from schema descriptors Criteria: Strict name matching [SWISSPROT]:[FASTA] ID:ID SEQ:SEQ [input schema]: [output schema] source field : target field DESCRIPTION:DESCRIPTION from SWISSPROT Slide 42 Ohio State University Department of Computer Science and Engineering 41 Wrapper Generation: Parser Key Observation: Data stored in entry-wise manner LOOP ENTRY 1:EOF:1 { single data entry } Entry made of delimiter-variable pairs with environment symbols in between Slide 43 Ohio State University Department of Computer Science and Engineering 42 Wrapper Generation: Parse Tree LOOP ENTRY 1:EOF:1 { > ID DESCRIPTION \n | EOF } >-ID -DESCRIPTION \n-DUMMY|EOF Data Entry \n-SEQ Slide 44 Ohio State University Department of Computer Science and Engineering 43 Wrapper Generation: Code Generator Create two application specific modules DataReader: Scans the source data file; Locates DLM-VAR pair; Submits variable required by the target with its order. Data Writer: Takes in the variable and its order; Looks up DLM-VAR pair; Checks linesize; Writes target file. Slide 45 Ohio State University Department of Computer Science and Engineering 44 Outline Automatic Data Virtualization Relational/SQL XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism Slide 46 Ohio State University Department of Computer Science and Engineering 45 Streaming Data Model Continuous data arrival and processing Emerging model for data processing Sources that produce data continuously: sensors, long running simulations WAN bandwidths growing faster than disk bandwidths Active topic in many computer science communities Databases Data Mining Networking . Slide 47 Ohio State University Department of Computer Science and Engineering 46 Summary/Limitations of Current Work Focus on centralized processing of stream from a single source (databases, data mining) communication only (networking) Many applications involve distributed processing of streams streams from multiple sources Slide 48 Ohio State University Department of Computer Science and Engineering 47 Motivating Application Switch Network X Network Fault Management System Slide 49 Ohio State University Department of Computer Science and Engineering 48 Motivating Application (2) Computer Vision Based Surveillance Slide 50 Ohio State University Department of Computer Science and Engineering 49 Features of Distributed Streaming Processing Applications Data sources could be distributed Over a WAN Continuous data arrival Enormous volume Probably can t communicate it all to one site Results from analysis may be desired at multiple sites Real-time constraints A real-time, high-throughput, distributed processing problem Slide 51 Ohio State University Department of Computer Science and Engineering 50 Motivation Challenges & Possible Solutions Challenge1: Data, Communication, and/or Compute- Intensive Switch Network X Slide 52 Ohio State University Department of Computer Science and Engineering 51 Challenges & possible Solutions Challenge1: Data and/or Computation intensive Solution: Grid computing technologies Switch Network Motivation Slide 53 Ohio State University Department of Computer Science and Engineering 52 Challenges & possible Solutions Challenge1: Data and/or Computation intensive Solution: Grid computing technologies Challenge 2: real-time analysis is required Motivation Solution: Self-Adaptation functionality is desired Slide 54 Ohio State University Department of Computer Science and Engineering 53 Need for a Grid-Based Stream Processing Middleware Application developers interested in data stream processing Will like to have abstracted Grid standards and interfaces Adaptation function Will like to focus on algorithms only GATES is a middleware for Grid-based Self-adapting Data Stream Processing Slide 55 Ohio State University Department of Computer Science and Engineering 54 Using GATES Break down the analysis into several sub-tasks that make a pipeline Implement each sub-task in Java Write an XML configuration file for the sub-tasks to be automatically deployed. Launch the application by running a java program (StreamClient.class) provided by the GATES Slide 56 Ohio State University Department of Computer Science and Engineering 55 System Architecture Slide 57 Ohio State University Department of Computer Science and Engineering 56 Adaptation for Real-time Processing Analysis on streaming data is approximate Accuracy and execution rate trade-off can be captured by certain parameters (Adaptation parameters) Sampling Rate Size of summary structure Application developers can expose these parameters and a range of values Slide 58 Ohio State University Department of Computer Science and Engineering 57 Public class Sampling-Stage implements StreamProcessing{ void init(){ } void work(buffer in, buffer out){ while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } } API for Adaptation sampling-ratio = GATES.getSuggestedParameter(); GATES.Information-About-Adjustment-Parameter(min, max, 1) Slide 59 Ohio State University Department of Computer Science and Engineering 58 Outline Automatic Data Virtualization Relational/SQL XML/XQuery based Information Integration Middleware for Streaming Data Coarse-grained pipelined parallelism Slide 60 Ohio State University Department of Computer Science and Engineering 59 Context: Coarse Grained Pipelined Parallelism Motivating application scenarios Internet data Slide 61 Ohio State University Department of Computer Science and Engineering 60 Motivating Application Classes Scientific data analysis solving shallow water equations (SWE) developing Eastern North Pacific Tidal model Data mining k-nearest neighbor search algorithm k-means clustering hot list query Visualization visualizing time-dependent, two-dimensional wake vortex computations Iso-surface rendering Image analysis virtual microscope Slide 62 Ohio State University Department of Computer Science and Engineering 61 Ways to Implement Local processing Internet data Slide 63 Ohio State University Department of Computer Science and Engineering 62 Ways to Implement Remote processing Internet data Slide 64 Ohio State University Department of Computer Science and Engineering 63 Our approach A coarse-grained pipelined execution model is a good match Internet Ways to Implement data Slide 65 Ohio State University Department of Computer Science and Engineering 64 Overview of Our Efforts Language and Compiler Framework for Coarse-Grained Pipelined Parallelism (SC 2003) Reduction Strategies (SC 2003) Support for program adaptation (SC 2004) DataCutter Runtime System (Saltz et al.) Packet size optimization (ICPP 2004) Filter decomposition problem (submitted) Slide 66 Ohio State University Department of Computer Science and Engineering 65 Group Members Ph.D students Liang Chen Wei Du Leo Glimcher Ruoming Jin Xiaogang Li Kaushik Sinha Li Weng Xuan Zhang Masters students Anjan Goswami Swarup Sahoo Slide 67 Ohio State University Department of Computer Science and Engineering 66 Getting Involved Talk to me Most recent papers are available online Sign in for my 888