DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan...

182
DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department of Computer Science University of Texas at Dallas

Transcript of DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan...

Page 1: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

DATA INTENSIVE QUERY PROCESSING FOR LARGE RDFGRAPHS USING CLOUD COMPUTING TOOLSMohammad Farhan HusainDr. Latifur KhanDr. Bhavani Thuraisingham

Department of Computer ScienceUniversity of Texas at Dallas

Page 2: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline

Semantic Web Technologies & Cloud Computing Frameworks

Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

Page 3: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Semantic Web Technologies

Data in machine understandable format Infer new knowledge Standards

Data representation – RDF Triples

Example:

Ontology – OWL, DAML Query language - SPARQL

Subject Predicate Object

http://test.com/s1

foaf:name “John Smith”

Page 4: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Cloud Computing Frameworks Proprietary

Amazon S3 Amazon EC2 Force.com

Open source tool Hadoop – Apache’s open source

implementation of Google’s proprietary GFS file system MapReduce – functional programming

paradigm using key-value pairs

Page 5: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline

Semantic Web Technologies & Cloud Computing Frameworks

Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

Page 6: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Goal

To build efficient storage using Hadoop for large amount of data (e.g. billion triples)

To build an efficient query mechanism Publish as open source project

http://code.google.com/p/hadooprdf/ Integrate with Jena as a Jena Model

Page 7: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Motivation

Current Semantic Web frameworks do not scale to large number of triples, e.g. Jena In-Memory, Jena RDB, Jena SDB AllegroGraph Virtuoso Universal Server BigOWLIM

There is a lack of distributed framework and persistent storage

Hadoop uses low end hardware providing a distributed framework with high fault tolerance and reliability

Page 8: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline

Semantic Web Technologies & Cloud Computing Frameworks

Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

Page 9: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Current Approaches

State-of-the-art approach Store RDF data in HDFS and query through

MapReduce programming (Our approach) Traditional approach

Store data in HDFS and process query outside of Hadoop Done in BIOMANTA1 project (details of querying

could not be found)

1. http://biomanta.org/

Page 10: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline

Semantic Web Technologies & Cloud Computing Frameworks

Goal & Motivation Current Approaches System Architecture & Storage

Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

Page 11: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

System ArchitectureLUBM Data

Generator

Preprocessor

N-Triples Converter

Predicate Based Splitter

Object Type Based Splitter

Hadoop Distributed File System / Hadoop

Cluster

MapReduce Framework

Query Rewriter

Query Plan Generator

Plan Executor

RDF/XML

Preprocessed Data

2. Jobs

3. Answer

3. Answer

1. Query

Page 12: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Storage Schema

Data in N-Triples Using namespaces

Example: http://utdallas.edu/res1 utd:resource1

Predicate based Splits (PS) Split data according to Predicates

Predicate Object based Splits (POS) Split further according to rdf:type of Objects

Page 13: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Example

D0U0:GraduateStudent20 rdf:type lehigh:GraduateStudentlehigh:University0 rdf:type lehigh:UniversityD0U0:GraduateStudent20 lehigh:memberOf lehigh:University0

P

File: rdf_typeD0U0:GraduateStudent20 lehigh:GraduateStudentlehigh:University0 lehigh:University

File: lehigh_memberOfD0U0:GraduateStudent20 lehigh:University0

PS

File: rdf_type_GraduateStudentD0U0:GraduateStudent20

File: rdf_type_UniversityD0U0:University0

File: lehigh_memberOf_UniversityD0U0:GraduateStudent20 lehigh:University0

POS

Page 14: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Space Gain

Example

Steps Number of Files Size (GB) Space Gain

N-Triples 20020 24 --

Predicate Split (PS) 17 7.1 70.42%

Predicate Object Split (POS)

41 6.6 72.5%

Data size at various steps for LUBM1000

Page 15: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline

Semantic Web Technologies & Cloud Computing Frameworks

Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

Page 16: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

SPARQL Query

SPARQL – SPARQL Protocol And RDF Query Language

Example

SELECT ?x ?y WHERE{

?z foaf:name ?x ?z foaf:age ?y

} QueryData

Result

Page 17: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

SPAQL Query by MapReduce

Example querySELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu}

Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}

Page 18: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Inside Hadoop MapReduce Job

subOrganizationOf_University

Department1 http://University0.edu

Department2 http://University1.edu

worksFor_Department

Professor1 Deaprtment1Professor2 Department2

MapMap MapMap

Reduce

Reduce

OutputWF#Professor1

Department1 SO#http://University0.edu

Dep

artm

ent1

WF#

Prof

esso

r1

Dep

artm

ent2

WF#

Prof

esso

r2

FilteringObject ==

http://University0.edu

INPUT

MAP

SHUFFLE&SORT

REDUCE

OUTPUT

Department1 SO#http://University0.edu WF#Professor1Department2 WF#Professor2

Page 19: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline

Semantic Web Technologies & Cloud Computing Frameworks

Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

Page 20: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Query Plan Generation

Challenge One Hadoop job may not be sufficient to answer

a query In a single Hadoop job, a single triple pattern cannot

take part in joins on more than one variable simultaneously

Solution Algorithm for query plan generation

Query plan is a sequence of Hadoop jobs which answers the query

Exploit the fact that in a single Hadoop job, a single triple pattern can take part in more than one join on a single variable simultaneously

Page 21: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Example

Example query:SELECT ?X, ?Y, ?Z WHERE { ?X pred1 obj1 subj2 ?Z obj2 subj3 ?X ?Z ?Y pred4 obj4 ?Y pred5 ?X }

Simplified view:1. X2. Z3. XZ4. Y5. XY

Page 22: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Join Graph &Hadoop Jobs

2

3

1

5

4

Z

X

X

X

Y

Join Graph

2

3

1

5

4

Z

X

X

X

Y

Valid Job 1

2

3

1

5

4

Z

X

X

X

Y

Valid Job 2

2

3

1

5

4

Z

X

X

X

Y

Invalid Job

Page 23: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Possible Query Plans

A. job1: (x, xz, xy)=yz, job2: (yz, y) = z, job3: (z, z) = done

2

3

1

5

4

Z

X

X

X

Y

Join Graph

2

3

1

5

4

Z

X

X

X

Y

Job 1

2

1,3,5

4

Z

Y

Job 2

2

Job 3

1,3,4,5

Z1,2,3,4,

5

Result

Page 24: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Possible Query Plans

B. job1: (y, xy)=x; (z,xz)=x, job2: (x, x, x) = done

2

3

1

5

4

Z

X

X

X

Y

Join Graph

2

3

1

5

4

Z

X

X

X

Y

Job 1

2,3

1

4,5

X

X

X

Job 2

1,2,3,4,

5

Result

Page 25: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Query Plan Generation

Goal: generate a minimum cost job plan Back tracking approach

Exhaustively generates all possible plans. Uses two coloring scheme on a graph to

find jobs with colors WHITE and BLACK. Two WHITE nodes cannot be adjacent

User defined cost model. Chooses best plan according to cost model.

Page 26: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Some Definitions

Triple Pattern,TPA triple pattern is an ordered collection of subject, predicate and object which appears in a SPARQL query WHERE clause. The subject, predicate and object can be either a variable (unbounded) or a concrete value (bounded).

Triple Pattern Join,TPJA triple pattern join is a join between two TPs on a variable

MapReduceJoin, MRJA MapReduceJoin is a join between two or more triple patterns on a variable.

Page 27: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Some Definitions

Job, JBA job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.

Conflicting MapReduceJoins, CMRJ A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.

NON-Conflicting MapReduceJoins, NCMRJ Non-conflicting MapReduceJoins is a pair of MRJs either not sharing any triple pattern or sharing a triple pattern and the MRJs are on same variable.

Page 28: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Example

LUBM Query SELECT ?X WHERE { 1 ?X rdf : type ub : Chair . 2 ?Y rdf : type ub : Department . 3 ?X ub : worksFor ?Y . 4 ?Y ub : subOrganizat ionOf <http : /

/www.U0 . edu> }

Page 29: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Example (contd.)

Triple Pattern Graph and Join Graph for the LUBM Query

Triple Pattern Graph (TPG)#1

Join Graph (JG)#1

Join Graph (JG)#2

Triple Pattern Graph (TPG)#2

Page 30: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Example(contd.)

Figure shows TPG and JG for query. On left, we have TPG where each node represents a

triple pattern in query, and they are named in the order they appear.

In the middle, we have the JG. Each node in the JG represents an edge in the TPG

For the query, an FQP can have two jobs First one dealing with NCMRJ between triple patterns 2,

3, 4 Second one NCMRJ between triple pattern 1 and the

output of the first join. IQP would be first job having CMRJs between 1, 3

and 4 and the second having MRJ between triple pattern 2 and the output of the first join.

Page 31: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Query Plan Generation: Backtracking

Page 32: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Query Plan Generation: Backtracking

Page 33: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Query Plan Generation: Backtracking Drawbacks of back tracking approach

Computationally intractable Search space is exponential in size

Page 34: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Steps a Hadoop Job Goes Through Executable file (containing MapReduce code) is

transferred from client machine to JobTracker1

JobTracker decides which TaskTrackers2 will execute the job

Executable file is distributed to TaskTrackers over network

Map processes start by reading data from HDFS Map outputs are written to discs Map outputs are read from discs, shuffled (transferred

over the network to TaskTrackers which would run Reduce processes), sorted and written to discs

Reduce processes start by reading the input from the discs

Reduce outputs are written to discs

Page 35: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

MapReduce Data Flow

http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow

Page 36: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Observations & an Approximate Solution Observations

Fixed overheads of a Hadoop job Multiple read-writes to disc Data transfer over network multiple times

Even a “Hello World” MapReduce job takes a couple of seconds because of the fixed overheads

Approximate solution Minimize number of jobs This is a good approximation since the overhead of

each job (e.g. jar file distribution, multiple disc read-writes, multiple network data transfer) and job switching is huge

Page 37: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm: Terms

Joining variable: A variable that is common in two or more triples Ex: x, y, xy, xz, za -> x,y,z are joining, a not

Complete elimination: A join operation that eliminates a joining variable y can be completely eliminated if we join (xy,y)

Partial elimination: A join that partially eliminates a joining variable After complete elimination of y, x can be partially

eliminated by joining (xz,x)

Page 38: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm: Terms

E-count: Number of joining variables in the resultant

triple after a complete elimination In the example x, y, z, xy, xz E-count of x is = 2 (resultant triple: yz) E-count of y is = 1 (resultant triple: x) E-count of z is = 1 (resultant triple: x)

Page 39: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm: Proposition Maximum job required for any SPARQL

query K, if K<=1; min( ceil(1.71*log2K), N), if K >

1 Where K is the number of triples in the

query N is the total number of joining variables

Page 40: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm: Proof

If we make just one join with each joining variable, then all joins can be done in N jobs (one join per job)

Special case scenario- Suppose each joining variable is common in

exactly two triples: Example- ab, bc, cd, de, ef, …. (like a chain)

At each job, we can make K/2 joins, which reduce the number of triples to half (i.e., K/2)

So, each job halves the number of triples Therefore, total jobs required is log2K <

1.71*log2K

Page 41: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm: Proof (Continued) General case: Suppose we sort (decreasing order) the variables

according to the frequency in different triples Let vi has frequency fi

Therefore, fi <= fi-1<=fi-2<=…<=f1 Note that if f1=2, then it reduces to the special

case Therefore, f1>2 in the general case, also, fN>=2 Now, we keep joining on v1, v2, … ,vN as long as

there is no conflict

Page 42: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm: Proof (Continued) Suppose L triples could not be reduced

because each of them are left alone with one/more joining variable that are conflicting (e.g. try reducing xy, yz, zx)

Therefore, M>=L joins have been performed, producing M triples (total M+L triples remaining)

Since each join involved at least 2 triples, 2M + L <= K 2(L+e) + L <= K (letting M = L +e, e >= 0) 3L + 2e <= K 2L + (4/3)e <= K*(2/3) (multiplying by 2/3 on

both sides)

Page 43: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm: Proof (Continued) 2L+e <= (2/3) * K So each job reduces #of triples to 2/3 Therefore,

K * (2/3)Q >= 1>= K * (2/3)Q+1

(3/2) Q <= K <= (3/2)Q+1 , Q <= log3/2K = 1.71 * log2K <= Q+1

In most real world scenarios, we can assume that 100 triples in a query is extremely rare

So, the maximum number of jobs required in this case is 12

Page 44: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm

Greedy algorithm Early elimination heuristic:

Make as many complete eliminations in each job as possible

This leaves the fewest number of variables for join in the next job

Must choose the join first that has the least e-count (least number of joining variables in the resultant triple)

Page 45: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm

Page 46: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy Algorithm

Step I: remove non-joining variables Step II: sort the vars according to e-

count Step III: choose a var for elimination as

long as complete or partial elimination is possible – these joins make a job

Step IV: continue to step II if more triples are available

Page 47: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline

Semantic Web Technologies & Cloud Computing Frameworks

Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

Page 48: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Experiment

Dataset and queries Cluster description Comparison with Jena In-Memory, SDB

and BigOWLIM frameworks Experiments with number of Reducers Algorithm runtimes: Greedy vs.

Exhaustive Some query results

Page 49: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Dataset And Queries

LUBM Dataset generator 14 benchmark queries Generates data of

some imaginary universities

Used for query execution performance comparison by many researches

Page 50: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Our Clusters

10 node cluster in SAIAL lab 4 GB main memory Intel Pentium IV 3.0 GHz

processor 640 GB hard drive

OpenCirrus HP labs test bed

Page 51: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Comparison: LUBM Query 2

Page 52: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Comparison: LUBM Query 9

Page 53: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Comparison: LUBM Query 12

Page 54: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Experiment with Number of Reducers

Page 55: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Greedy vs. Exhaustive Plan Generation

Page 56: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Some Query ResultsSeco

nd

s

Million Triples

Page 57: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline

Semantic Web Technologies & Cloud Computing Frameworks

Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

Page 58: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Future Works

Enable plan generation algorithm to handle queries with complex structures

Ontology driven file partitioning for faster query answering

Balanced partitioning for data set with skewed distribution

Materialization with limited number of jobs for inference

Experiment with non-homogenous cluster

Page 59: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Publications Mohammad Husain, Latifur Khan, Murat Kantarcioglu, Bhavani M.

Thuraisingham: Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools, IEEE International Conference on Cloud Computing, 2010 (acceptance rate 20%)

Mohammad Husain, Pankil Doshi, Latifur Khan, Bhavani M. Thuraisingham: Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce, International Conference on Cloud Computing Technology and Science, Beijing, China, 2009

Mohammad Husain, Mohammad M. Masud, James McGlothlin, Latifur Khan, Bhavani Thuraisingham: Greedy Based Query Processing for Large RDF Graphs Using Cloud Computing, IEEE Transactions on Knowledge and Data Engineering Special Issue on Cloud Computing (submitted)

Mohammad Farhan Husain, Tahseen Al-Khateeb, Mohmmad Alam, Latifur Khan: Ontology based Policy Interoperability in Geo-Spatial Domain, CSI Journal (to appear)

Mohammad Farhan Husain, Mohmmad Alam, Tahseen Al-Khateeb, Latifur Khan: Ontology based policy interoperability in geo-spatial domain. ICDE Workshops 2008

Chuanjun Li, Latifur Khan, Bhavani M. Thuraisingham, M. Husain, Shaofei Chen, Fang Qiu : Geospatial Data Mining for National Security: Land Cover Classification and Semantic Grouping, Intelligence and Security Informatics, 2007

Page 60: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Questions/Discussion

Page 61: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

A TOKEN-BASED ACCESS CONTROL SYSTEM FOR RDF DATA IN THE CLOUDS

Arindam KhaledMohammad Farhan HusainLatifur KhanKevin HamlenBhavani ThuraisinghamDepartment of Computer Science

University of Texas at Dallas

Research Funded by AFOSR

61

CloudCom 2010

Page 62: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline

Motivation and Background Semantic Web Security Scalability

Access control Proposed Architecture Results

62CloudCom 2010

Page 63: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Motivation

Semantic web is gaining immense popularity

Resource Description Framework (RDF) is one of the ways to represent data in Semantic web.

But most of the existing frameworks either lack scalability or don’t incorporate security.

Our framework incorporates both of those.

CloudCom 2010

63

Page 64: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Semantic Web

Originally proposed by Sir Tim Berners-Lee who envisioned it as a machine-understandable web.

Powerful since it allows relationships between web resources.

Semantic web and Ontologies are used to represent knowledge.

Resource Description Framework (RDF) is used for its expressive power, semantic interoperability, and reusability.

64CloudCom 2010

Page 65: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Semantic Web Technologies

Data in machine understandable format Infer new knowledge Standards

Data representation – RDF Triples

Example:

Ontology – OWL, DAML Query language - SPARQL

Subject Predicate Object

http://test.com/s1

foaf:name “John Smith”

65CloudCom 2010

Page 66: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Current Technologies

Joseki [15], Kowari [17], 3store [10], and Sesame [5] are few RDF stores.

Security is not addressed for these. In Jena [14, 20], efforts have been made

to incorporate security. But Jena lacks scalability – often queries

over large data become intractable [12, 13].

66CloudCom 2010

Page 67: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Cloud Computing Frameworks Proprietary

Amazon S3 Amazon EC2 Force.com

Open source tool Hadoop – Apache’s open source

implementation of Google’s proprietary GFS file system MapReduce – functional programming

paradigm using key-value pairs

67CloudCom 2010

Page 68: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Cloud as RDF Stores

Large RDF graphs can be efficiently stored and queried in the clouds [6, 12, 13, 18].

These stores lack access control. We address this problem by generating

tokens for specified access levels. Agents are assigned these tokens based

on their business requirements and restrictions.

68CloudCom 2010

Page 69: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

System ArchitectureLUBM Data

Generator

Preprocessor

N-Triples Converter

Predicate Based Splitter

Object Type Based Splitter

Hadoop Distributed File System / Hadoop

Cluster

RDF/XML

Preprocessed Data

2. Jobs

3. Answer

3. Answer

1. Query

MapReduce Framework

Query Rewriter

Query Plan Generator

Plan Executor

Acce

ss C

ontro

l

69CloudCom 2010

Page 70: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Storage Schema

Data in N-Triples Using namespaces

Example: http://utdallas.edu/res1 utd:resource1

Predicate based Splits (PS) Split data according to Predicates

Predicate Object based Splits (POS) Split further according to rdf:type of Objects

70CloudCom 2010

Page 71: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

ExampleD0U0:GraduateStudent20 rdf:type lehigh:GraduateStudentlehigh:University0 rdf:type lehigh:UniversityD0U0:GraduateStudent20 lehigh:memberOf lehigh:University0

P

File: rdf_typeD0U0:GraduateStudent20 lehigh:GraduateStudentlehigh:University0 lehigh:University

File: lehigh_memberOfD0U0:GraduateStudent20 lehigh:University0

PS

File: rdf_type_GraduateStudentD0U0:GraduateStudent20

File: rdf_type_UniversityD0U0:University0

File: lehigh_memberOf_UniversityD0U0:GraduateStudent20 lehigh:University0

POS

71CloudCom 2010

Page 72: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Space Gain

Example

Steps Number of Files Size (GB) Space Gain

N-Triples 20020 24 --

Predicate Split (PS) 17 7.1 70.42%

Predicate Object Split (POS)

41 6.6 72.5%

Data size at various steps for LUBM1000

72CloudCom 2010

Page 73: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

SPARQL Query

SPARQL – SPARQL Protocol And RDF Query Language

Example

SELECT ?x ?y WHERE{

?z foaf:name ?x ?z foaf:age ?y

} Query

Data

Result73

CloudCom 2010

Page 74: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

SPAQL Query by MapReduce

Example querySELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu}

Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}

74CloudCom 2010

Page 75: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Inside Hadoop MapReduce Job

subOrganizationOf_UniversityDepartment1 http://University0.eduDepartment2 http://University1.edu

worksFor_DepartmentProfessor1 Deaprtment1Professor2 Department2

MapMap MapMap

Reduce

Reduce

OutputWF#Professor1

Department1 SO#http://University0.edu

Dep

artm

ent1

WF#Professor1

Depar

tmen

t2

WF#Professor2

FilteringObject ==

http://University0.edu

INPUT

MAP

SHUFFLE&SORT

REDUCE

OUTPUT

Department1 SO#http://University0.edu WF#Professor1Department2 WF#Professor2

75CloudCom 2010

Page 76: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Access Control in Our Architecture

CloudCom 2010

76

MapReduce Framework

Query Rewriter

Query Plan Generator

Plan Executor

Acce

ss C

ontro

l

Access control module is linked to all the components of MapReduce Framework

Page 77: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Motivation

It’s important to keep the data safe from unwanted access.

Encryption can be used, but it has no or small semantic value.

By issuing and manipulating different levels of access control, the agent could access the data intended for him or make infereneces.

CloudCom 2010

77

Page 78: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Access Control Terminology

Access Tokens (AT): Denoted by integer numbers allow agents to access security-relevant data.

Access Token Tuples (ATT): Have the form <AccessToken, Element, ElementType, ElementName> where Element can be Subject, Object, or Predicate, and ElementType can be described as URI , DataType, Literal , Model (Subject), or BlankNode.

78CloudCom 2010

Page 79: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Six Access Control Levels

Predicate Data Access: Defined for a particular predicate. An agent can access the predicate file. For example: An agent possessing ATT <1, Predicate, isPaid, _> can access the entire predicate file isPaid.

Predicate and Subject Data Access: More restrictive than the previous one. Combining one of these Subject ATT’s with a Predicate data access ATT having the same AT grants the agent access to a specific subject of a specific predicate. For example, having ATT’s <1, Predicate, isPaid, _> and <1, Subject, URI , MichaelScott> permits an agent with AT 1 to access a subject with URI MichaelScott of predicate isPaid.

79CloudCom 2010

Page 80: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Access Control Levels (Cont.) Predicate and Object: This access level

permits a principal to extract the names of subjects satisfying a particular predicate and object.

Subject Access: One of the less restrictive access control levels. The subject can ne a URI , DataType, or BlankNode.

Object Access: The object can be a URI , DataType, Literal , or BlankNode.

80CloudCom 2010

Page 81: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Access Control Levels (Cont.) Subject Model Level Access: This

permits an agent to read all necessary predicate files to obtain all objects of a given subject. The ones which are URI objects obtained from the last step are treated as subjects to extract their respective predicates and objects. This iterative process continues until all objects finally become blank nodes or literals. Agents may generate models on a given subject.

81CloudCom 2010

Page 82: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Access Token Assignment

Each agent contains an Access Token list (AT-list) which contains 0 or more ATs assigned to the agents along with their issuing timestamps.

These timestamps are used to resolve conflicts (explained later).

The set of triples accessible by an agent is the union of the result sets of the AT’s in the agent’s AT-list.

82CloudCom 2010

Page 83: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Conflict

A conflict arises when the following three conditions occur: An agent possesses two AT’s 1 and 2, the result set of AT 2 is a proper subset of

AT 1, and the timestamp of AT 1 is earlier than the

timestamp of AT 2 Later, more specific AT supersedes the

former, so AT 1 is discarded from the AT-list to resolve the conflict.

83CloudCom 2010

Page 84: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Conflict Type

Subset Conflict: It occurs when AT 2 (later issued) is a conjunction of ATT’s that refine AT 1. For example, AT 1 is defined by <1, Subject, URI, Sam> and AT 2 is defined by <2, Subject, URI, Sam> and <2, Predicate, HasAccounts, _> ATT’s. If AT 2 is issued to the possessor of AT 1 at a later time, then a conflict will occur and AT 1 will be discarded from the agent’s AT-list.

84CloudCom 2010

Page 85: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Conflict Type

Subtype conflict: Subtype conflicts occur when the ATT’s in AT 2 involve data types that are subtypes of those in AT 1. The data types can be those of subjects, objects or both.

85CloudCom 2010

Page 86: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Conflict Resolution Algorithm

86CloudCom 2010

Page 87: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Experiment

Dataset and queries Cluster description Comparison with Jena In-Memory, SDB

and BigOWLIM frameworks Experiments with number of Reducers Algorithm runtimes: Greedy vs.

Exhaustive Some query results

87CloudCom 2010

Page 88: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Dataset And Queries

LUBM Dataset generator 14 benchmark queries Generates data of

some imaginary universities

Used for query execution performance comparison by many researches

88CloudCom 2010

Page 89: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Our Clusters

10 node cluster in SAIAL lab 4 GB main memory Intel Pentium IV 3.0 GHz

processor 640 GB hard drive

OpenCirrus HP labs test bed

89CloudCom 2010

Page 90: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Results

Scenario 1: “takesCourse”A list of sensitive courses cannot

be viewed by a normal user for any student90

CloudCom 2010

Page 91: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Results

Scenario 2: “displayTeachers”A normal user is allowed to view information

about the lecturers only91

CloudCom 2010

Page 92: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Future Works

Build a generic system that incorporates tokens and resolve policy conflicts.

Implement Subject Model Level Access that recursively extracts objects of subjects and treats these objects as subjects as long as these objects are URIs. An agent with proper access level can construct a model on that subject.

92CloudCom 2010

Page 93: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

References

[1] Apache. Hadoop. http://hadoop.apache.org/.

[2] D. Beckett. RDF/XML syntax specification (revised). Technical report, W3C, February 2004.

[3] T. Berners-Lee. Semantic web road map. http://www.w3.org/DesignIssues/Semantic.html, 1998.

[4] L. Bouganim, F. D. Ngoc, and P. Pucheral. Client based access control management for XML documents. In Proc. 20´emes Journ´ees Bases de Donn´ees Avanc´ees (BDA),pages 65–89, Montpellier, France, October 2004.

93CloudCom 2010

Page 94: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

References

[5] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic architecture for storing and querying RDF. In Proc. 1st International Semantic Web Conference (ISWC), pages 54–68, Sardinia, Italy, June 2002.

[6] H. Choi, J. Son, Y. Cho, M. K. Sung, and Y. D. Chung. SPIDER: a system for scalable, parallel / distributed evaluation of large-scale RDF data. In Proc. 18th ACM Conference on Information and Knowledge Management (CIKM), pages 2087–2088, Hong Kong, China, November 2009.

[7] J. Grant and D. Beckett. RDF test cases. Technical report, W3C, February 2004.

[8] Y. Guo, Z. Pan, and J. Heflin. An evaluation of knowledge base systems for large OWL datasets. In In Proc. 3rd International Semantic Web Conference (ISWC), pages 274–288, Hiroshima, Japan, November 2004.

[9] Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics, 3(2–3):158–182, 2005.

94CloudCom 2010

Page 95: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

References

[10] S. Harris and N. Shadbolt. SPARQL query processing with conventional relational database systems. In Proc. Web Information Systems Engineering (WISE) International Workshop on Scalable Semantic Web Knowledge Base Systems

(SSWS), pages 235–244, New York, New York, November 2005.

[11] L. E. Holmquist, J. Redstr¨om, and P. Ljungstrand. Token based access to digital information. In Proc. 1st International Symposium on Handheld and Ubiquitous Computing (HUC), pages 234–245, Karlsruhe, Germany, September 1999.

[12] M. F. Husain, P. Doshi, L. Khan, and B. M. Thuraisingham. Storage and retrieval of large RDF graph using Hadoop and MapReduce. In Proc. 1st International Conference on Cloud Computing (CloudCom), pages 680–686, Beijing, China, December 2009.

95CloudCom 2010

Page 96: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

References

[13] M. F. Husain, L. Khan, M. Kantarcioglu, and B. Thuraisingham. Data intensive query processing for large RDF graphs using cloud computing tools. In Proc. IEEE 3rd International Conference on Cloud Computing (CLOUD), pages 1–10, Miami, Florida, July 2010.

[14] A. Jain and C. Farkas. Secure resource description framework: an access control model. In Proc. 11th ACM Symposium on Access Control Models and Technologies (SACMAT), pages 121–129, Lake Tahoe, California, June 2006.

[15] Joseki. http://www.joseki.org.

96CloudCom 2010

Page 97: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

References

[16] J. Kim, K. Jung, and S. Park. An introduction to authorization conflict problem in RDF access control. In Proc. 12th International Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES), pages 583– 592, Zagreg, Croatia, September 2008.

[17] Kowari. http://kowari.sourceforge.net. [18] P. Mika and G. Tummarello. Web semantics in the

clouds. IEEE Intelligent Systems, 23(5):82–87, 2008. [19] E. Prud’hommeaux and A. Seaborne. SPARQL

query language for RDF. Technical report, W3C, January 2008.

[20] P. Reddivari, T. Finin, and A. Joshi. Policy based access control for an RDF store. In Proc. Policy Management for the Web Workshop, 2005.

97CloudCom 2010

Page 98: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

CLOUD TOOLS OVERVIEW

Page 99: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HADOOP

Page 100: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Outline Hadoop - Basics HDFS

Goals Architecture Other functions

MapReduce Basics Word Count Example Handy tools Finding shortest path example

Related Apache sub-projects (Pig, HBase,Hive)

Page 101: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Hadoop - Why ? Need to process huge datasets on large

clusters of computers Very expensive to build reliability into each

application Nodes fail every day

Failure is expected, rather than exceptional The number of nodes in a cluster is not

constant Need a common infrastructure

Efficient, reliable, easy to use Open Source, Apache Licence

Page 102: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Who uses Hadoop? Amazon/A9 Facebook Google New York Times Veoh Yahoo! …. many more

Page 103: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Commodity Hardware

Typically in 2 level architecture Nodes are commodity PCs 30-40 nodes/rack Uplink from rack is 3-4 gigabit Rack-internal is 1 gigabit

Aggregation switch

Rack switch

Page 104: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Original Slides by

Dhruba Borthakur

Apache Hadoop Project Management Committee

Page 105: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Goals of HDFS Very Large Distributed File System

10K nodes, 100 million files, 10PB Assumes Commodity Hardware

Files are replicated to handle hardware failure

Detect failures and recover from them Optimized for Batch Processing

Data locations exposed so that computations can move to where data resides

Provides very high aggregate bandwidth

Page 106: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Distributed File System Single Namespace for entire cluster Data Coherency

Write-once-read-many access model Client can only append to existing files

Files are broken up into blocks Typically 64MB block size Each block replicated on multiple DataNodes

Intelligent Client Client can find location of blocks Client accesses data directly from DataNode

Page 107: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HDFS Architecture

Page 108: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Functions of a NameNode Manages File System Namespace

Maps a file name to a set of blocks Maps a block to the DataNodes where it

resides Cluster Configuration Management Replication Engine for Blocks

Page 109: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

NameNode Metadata Metadata in Memory

The entire metadata is in main memory No demand paging of metadata

Types of metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g. creation time, replication

factor A Transaction Log

Records file creations, file deletions etc

Page 110: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

DataNode A Block Server

Stores data in the local file system (e.g. ext3)

Stores metadata of a block (e.g. CRC) Serves data and metadata to Clients

Block Report Periodically sends a report of all existing

blocks to the NameNode Facilitates Pipelining of Data

Forwards data to other specified DataNodes

Page 111: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Block Placement Current Strategy

One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed

Clients read from nearest replicas Would like to make this policy

pluggable

Page 112: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Heartbeats DataNodes send hearbeat to the

NameNode Once every 3 seconds

NameNode uses heartbeats to detect DataNode failure

Page 113: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Replication Engine NameNode detects DataNode failures

Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to

DataNodes

Page 114: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Data Correctness Use Checksums to validate data

Use CRC32 File Creation

Client computes checksum per 512 bytes DataNode stores the checksum

File access Client retrieves the data and checksum

from DataNode If Validation fails, Client tries other

replicas

Page 115: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

NameNode Failure A single point of failure Transaction Log stored in multiple

directories A directory on the local file system A directory on a remote file system

(NFS/CIFS) Need to develop a real HA solution

Page 116: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Data Pieplining Client retrieves a list of DataNodes on

which to place replicas of a block Client writes block to the first

DataNode The first DataNode forwards the data

to the next node in the Pipeline When all replicas are written, the

Client moves on to write the next block in file

Page 117: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Rebalancer Goal: % disk full on DataNodes should

be similar Usually run when new DataNodes are

added Cluster is online when Rebalancer is

active Rebalancer is throttled to avoid network

congestion Command line tool

Page 118: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Secondary NameNode Copies FsImage and Transaction Log

from Namenode to a temporary directory

Merges FSImage and Transaction Log into a new FSImage in temporary directory

Uploads new FSImage to the NameNode Transaction Log on NameNode is purged

Page 119: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

User Interface Commads for HDFS User:

hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir/myfile.txt

Commands for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommision

datanodename Web Interface

http://host:port/dfshealth.jsp

Page 120: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

MAPREDUCE

Original Slides by

Owen O’Malley (Yahoo!)

&Christophe Bisciglia, Aaron Kimball & Sierra Michells-Slettvet

Page 121: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

MapReduce - What? MapReduce is a programming model for

efficient distributed computing It works like a Unix pipeline

cat input | grep | sort | uniq -c | cat > output

Input | Map | Shuffle & Sort | Reduce | Output

Efficiency from Streaming through data, reducing seeks Pipelining

A good fit for a lot of applications Log processing Web index building

Page 122: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

MapReduce - Dataflow

Page 123: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

MapReduce - Features Fine grained Map and Reduce tasks

Improved load balancing Faster recovery from failed tasks

Automatic re-execution on failure In a large cluster, some nodes are always slow or

flaky Framework re-executes failed tasks

Locality optimizations With large data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when

possible

Page 124: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Word Count Example Mapper

Input: value: lines of text of input Output: key: word, value: 1

Reducer Input: key: word, value: set of counts Output: key: word, value: sum

Launching program Defines this job Submits job to cluster

Page 125: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Word Count Dataflow

Page 126: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Word Count Mapperpublic static class Map extends MapReduceBase implements

Mapper<LongWritable,Text,Text,IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text();

public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString(); StringTokenizer = new StringTokenizer(line); while(tokenizer.hasNext()) { word.set(tokenizer.nextToken()); output.collect(word,one); } } }

Page 127: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Word Count Reducerpublic static class Reduce extends MapReduceBase implements

Reducer<Text,IntWritable,Text,IntWritable> {public static void map(Text key, Iterator<IntWritable> values,

OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {

int sum = 0; while(values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Page 128: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Word Count Example Jobs are controlled by configuring JobConfs JobConfs are maps from attribute names to string

values The framework defines attributes to control how

the job is executed conf.set(“mapred.job.name”, “MyApp”);

Applications can add arbitrary values to the JobConf conf.set(“my.string”, “foo”); conf.set(“my.integer”, 12);

JobConf is available to all tasks

Page 129: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Putting it all together Create a launching program for your

application The launching program configures:

The Mapper and Reducer to use The output key and value types (input types

are inferred from the InputFormat) The locations for your input and output

The launching program then submits the job and typically waits for it to complete

Page 130: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Putting it all togetherJobConf conf = new JobConf(WordCount.class);conf.setJobName(“wordcount”);

conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class);Conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

Page 131: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Input and Output Formats A Map/Reduce may specify how it’s input is to

be read by specifying an InputFormat to be used A Map/Reduce may specify how it’s output is to

be written by specifying an OutputFormat to be used

These default to TextInputFormat and TextOutputFormat, which process line-based text data

Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary data

These are file-based, but they are not required to be

Page 132: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

How many Maps and Reduces Maps

Usually as many as the number of HDFS blocks being processed, this is the default

Else the number of maps can be specified as a hint The number of maps can also be controlled by specifying

the minimum split size The actual sizes of the map inputs are computed by:

max(min(block_size,data/#maps), min_split_size

Reduces Unless the amount of data being processed

is small 0.95*num_nodes*mapred.tasktracker.tasks.max

imum

Page 133: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Some handy tools Partitioners Combiners Compression Counters Speculation Zero Reduces Distributed File Cache Tool

Page 134: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Partitioners Partitioners are application code that define

how keys are assigned to reduces Default partitioning spreads keys evenly, but

randomly Uses key.hashCode() % num_reduces

Custom partitioning is often required, for example, to produce a total order in the output Should implement Partitioner interface Set by calling conf.setPartitionerClass(MyPart.class) To get a total order, sample the map output keys and

pick values to divide the keys into roughly equal buckets and use that in your partitioner

Page 135: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Combiners When maps produce many repeated keys

It is often useful to do a local aggregation following the map Done by specifying a Combiner Goal is to decrease size of the transient data Combiners have the same interface as Reduces, and often

are the same class Combiners must not side effects, because they run an

intermdiate number of times In WordCount, conf.setCombinerClass(Reduce.class);

Page 136: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Compression Compressing the outputs and intermediate data will often

yield huge performance gains Can be specified via a configuration file or set

programmatically Set mapred.output.compress to true to compress job output Set mapred.compress.map.output to true to compress map

outputs Compression Types

(mapred(.map)?.output.compression.type) “block” - Group of keys and values are compressed together “record” - Each value is compressed individually Block compression is almost always best

Compression Codecs (mapred(.map)?.output.compression.codec) Default (zlib) - slower, but more compression LZO - faster, but less compression

Page 137: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Counters Often Map/Reduce applications have countable

events For example, framework counts records in to

and out of Mapper and Reducer To define user counters:

static enum Counter {EVENT1, EVENT2};reporter.incrCounter(Counter.EVENT1, 1);

Define nice names in a MyClass_Counter.properties fileCounterGroupName=MyCountersEVENT1.name=Event 1EVENT2.name=Event 2

Page 138: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Speculative execution The framework can run multiple instances of

slow tasks Output from instance that finishes first is used Controlled by the configuration variable

mapred.speculative.execution Can dramatically bring in long tails on jobs

Page 139: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Zero Reduces Frequently, we only need to run a filter on the

input data No sorting or shuffling required by the job Set the number of reduces to 0 Output from maps will go directly to OutputFormat and

disk

Page 140: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Distributed File Cache Sometimes need read-only copies of data on

the local computer Downloading 1GB of data for each Mapper is expensive

Define list of files you need to download in JobConf

Files are downloaded once per computer Add to launching program:

DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf);

Add to task:Path[] files = DistributedCache.getLocalCacheFiles(conf);

Page 141: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Tool Handle “standard” Hadoop command line options

-conf file - load a configuration file named file -D prop=value - define a single configuration property

prop Class looks like:

public class MyApp extends Configured implements Tool { public static void main(String[] args) throws Exception { System.exit(ToolRunner.run(new Configuration(),

new MyApp(), args));}public int run(String[] args) throws Exception { …. getConf() …. }}

Page 142: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Finding the Shortest Path A common graph search

application is finding the shortest path from a start node to one or more target nodes

Commonly done on a single machine with Dijkstra’s Algorithm

Can we use BFS to find the shortest path via MapReduce?

Page 143: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Finding the Shortest Path: Intuition

We can define the solution to this problem inductively DistanceTo(startNode) = 0 For all nodes n directly reachable from

startNode, DistanceTo(n) = 1 For all nodes n reachable from some other

set of nodes S,DistanceTo(n) = 1 + min(DistanceTo(m), m S)

Page 144: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

From Intuition to Algorithm A map task receives a node n as a key,

and (D, points-to) as its value D is the distance to the node from the

start points-to is a list of nodes reachable from

n p points-to, emit (p, D+1) Reduces task gathers possible

distances to a given p and selects the minimum one

Page 145: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

What This Gives Us This MapReduce task can advance the

known frontier by one hop To perform the whole BFS, a non-

MapReduce component then feeds the output of this step back into the MapReduce task for another iteration Problem: Where’d the points-to list go? Solution: Mapper emits (n, points-to) as

well

Page 146: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Blow-up and Termination This algorithm starts from one node Subsequent iterations include many

more nodes of the graph as the frontier advances

Does this ever terminate? Yes! Eventually, routes between nodes will

stop being discovered and no better distances will be found. When distance is the same, we stop

Mapper should emit (n,D) to ensure that “current distance” is carried into the reducer

Page 147: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HADOOP SUBPROJECTS

Page 148: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Hadoop Related Subprojects Pig

High-level language for data analysis HBase

Table storage for semi-structured data Zookeeper

Coordinating distributed applications Hive

SQL-like Query language and Metastore Mahout

Machine learning

Page 149: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

PIG

Original Slides by

Matei Zaharia

UC Berkeley RAD Lab

Page 150: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Pig Started at Yahoo! Research Now runs about 30% of Yahoo!’s jobs Features

Expresses sequences of MapReduce jobs Data model: nested “bags” of items Provides relational (SQL) operators (JOIN, GROUP BY, etc.) Easy to plug in Java functions

Page 151: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

An Example Problem Suppose you

have user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Page 152: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

In MapReduce

Page 153: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

In Pig LatinUsers = load ‘users’ as (name, age);Filtered = filter Users by age >= 18 and age <= 25;

Pages = load ‘pages’ as (user, url);Joined = join Filtered by name, Pages by user;

Grouped = group Joined by url;Summed = foreach Grouped generate group, count(Joined) as clicks;

Sorted = order Summed by clicks desc;Top5 = limit Sorted 5;store Top5 into ‘top5sites’;

Page 154: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Ease of TranslationLoad Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …

Page 155: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Ease of Translation

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …

Job 1

Job 2

Job 3

Page 156: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HBASE

Original Slides by

Tom White

Lexeme Ltd.

Page 157: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HBase - What? Modeled on Google’s Bigtable Row/column store Billions of rows/millions on columns Column-oriented - nulls are free Untyped - stores byte[]

Page 158: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HBase - Data Model

Row TimestampColumn family:

animal:

Column family

repairs:

animal:type

animal:size repairs:cost

enclosure1

t2 zebra 1000 EUR

t1 lion big

enclosure2

… … … …

Page 159: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HBase - Data StorageColumn family animal:

(enclosure1, t2, animal:type)

zebra

(enclosure1, t1, animal:size)

big

(enclosure1, t1, animal:type)

lionColumn family repairs:

(enclosure1, t1, repairs:cost)

1000 EUR

Page 160: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HBase - CodeHTable table = …Text row = new Text(“enclosure1”);Text col1 = new Text(“animal:type”);Text col2 = new Text(“animal:size”);BatchUpdate update = new BatchUpdate(row);update.put(col1, “lion”.getBytes(“UTF-8”));update.put(col2, “big”.getBytes(“UTF-8));table.commit(update);

update = new BatchUpdate(row);update.put(col1, “zebra”.getBytes(“UTF-8”));table.commit(update);

Page 161: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HBase - Querying Retrieve a cell

Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();

Retrieve a rowRowResult = table.getRow( “enclosure1” );

Scan through a range of rowsScanner s = table.getScanner( new String[] { “animal:type” } );

Page 162: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

HIVE

Original Slides by

Matei Zaharia

UC Berkeley RAD Lab

Page 163: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Hive Developed at Facebook Used for majority of Facebook jobs “Relational database” built on Hadoop

Maintains list of table schemas SQL-like query language (HiveQL) Can call Hadoop Streaming scripts from

HiveQL Supports table partitioning, clustering,

complex data types, some optimizations

Page 164: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Creating a Hive Table

Partitioning breaks table into separate files for each (dt, country) pairEx: /hive/page_view/dt=2008-06-08,country=USA

/hive/page_view/dt=2008-06-08,country=CA

CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING)STORED AS SEQUENCEFILE;

Page 165: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

A Simple Query

SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01'AND page_views.date <= '2008-03-31'AND page_views.referrer_url like '%xyz.com';

• Hive only reads partition 2008-03-01,* instead of scanning entire table

• Find all page views coming from xyz.com on March 31st:

Page 166: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Aggregation and Joins• Count users who visited each page by gender:

• Sample output:

SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)FROM page_views pv JOIN user u ON (pv.userid = u.id)GROUP BY pv.page_url, u.genderWHERE pv.date = '2008-03-03';

Page 167: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Using a Hadoop Streaming Mapper ScriptSELECT TRANSFORM(page_views.userid, page_views.date)USING 'map_script.py'AS dt, uid CLUSTER BY dtFROM page_views;

Page 168: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

STORM

Original Slides by

Nathan Marz

Twitter

Page 169: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Storm Developed by BackType which was acquired

by Twitter Lots of tools for data (i.e. batch) processing

Hadoop, Pig, HBase, Hive, … None of them are realtime systems which is

becoming a real requirement for businesses Storm provides realtime computation

Scalable Guarantees no data loss Extremely robust and fault-tolerant Programming language agnostic

Page 170: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Before Storm

Page 171: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Before Storm – Adding a worker

Deploy

Reconfigure/Redeploy

Page 172: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Problems Scaling is painful Poor fault-tolerance Coding is tedious

Page 173: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

What we want Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message

passing “Just works” !!

Page 174: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Storm Cluster

Master node (similar to Hadoop JobTracker)

Used for cluster coordination

Run worker processes

Page 175: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Concepts Streams Spouts Bolts Topologies

Page 176: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

Page 177: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Spouts

Source of streams

Page 178: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Bolts

Processes input streams and produces new streams:Can implement functions such as filters, aggregation, join, etc

Page 179: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Topology

Network of spouts and bolts

Page 180: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Topology

Spouts and bolts execute asmany tasks across the cluster

Page 181: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Stream Grouping

When a tuple is emitted which task does it go to?

Page 182: DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Stream Grouping• Shuffle grouping: pick a random task• Fields grouping: consistent hashing

on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest

id