DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan...

DATA INTENSIVE QUERY PROCESSING FOR LARGE RDFGRAPHS USING CLOUD COMPUTING TOOLSMohammad Farhan HusainDr. Latifur KhanDr. Bhavani Thuraisingham

Department of Computer ScienceUniversity of Texas at Dallas

Outline

Semantic Web Technologies & Cloud Computing Frameworks

Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

Semantic Web Technologies

Data in machine understandable format Infer new knowledge Standards

Data representation – RDF Triples

Example:

Ontology – OWL, DAML Query language - SPARQL

Subject Predicate Object

http://test.com/s1

foaf:name “John Smith”

Cloud Computing Frameworks Proprietary

Amazon S3 Amazon EC2 Force.com

Open source tool Hadoop – Apache’s open source

implementation of Google’s proprietary GFS file system MapReduce – functional programming

paradigm using key-value pairs

Outline



Goal

To build efficient storage using Hadoop for large amount of data (e.g. billion triples)

To build an efficient query mechanism Publish as open source project

http://code.google.com/p/hadooprdf/ Integrate with Jena as a Jena Model

Motivation

Current Semantic Web frameworks do not scale to large number of triples, e.g. Jena In-Memory, Jena RDB, Jena SDB AllegroGraph Virtuoso Universal Server BigOWLIM

There is a lack of distributed framework and persistent storage

Hadoop uses low end hardware providing a distributed framework with high fault tolerance and reliability

Outline



Current Approaches

State-of-the-art approach Store RDF data in HDFS and query through

MapReduce programming (Our approach) Traditional approach

Store data in HDFS and process query outside of Hadoop Done in BIOMANTA1 project (details of querying

could not be found)

1. http://biomanta.org/

Outline


Goal & Motivation Current Approaches System Architecture & Storage

Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works

System ArchitectureLUBM Data

Generator

Preprocessor

N-Triples Converter

Predicate Based Splitter

Object Type Based Splitter

Hadoop Distributed File System / Hadoop

Cluster

MapReduce Framework

Query Rewriter

Query Plan Generator

Plan Executor

RDF/XML

Preprocessed Data

2. Jobs

3. Answer

3. Answer

1. Query

Storage Schema

Data in N-Triples Using namespaces

Example: http://utdallas.edu/res1 utd:resource1

Predicate based Splits (PS) Split data according to Predicates

Predicate Object based Splits (POS) Split further according to rdf:type of Objects

Example

D0U0:GraduateStudent20 rdf:type lehigh:GraduateStudentlehigh:University0 rdf:type lehigh:UniversityD0U0:GraduateStudent20 lehigh:memberOf lehigh:University0

P

File: rdf_typeD0U0:GraduateStudent20 lehigh:GraduateStudentlehigh:University0 lehigh:University

File: lehigh_memberOfD0U0:GraduateStudent20 lehigh:University0

PS

File: rdf_type_GraduateStudentD0U0:GraduateStudent20

File: rdf_type_UniversityD0U0:University0

File: lehigh_memberOf_UniversityD0U0:GraduateStudent20 lehigh:University0

POS

Space Gain

Example

Steps Number of Files Size (GB) Space Gain

N-Triples 20020 24 --

Predicate Split (PS) 17 7.1 70.42%

Predicate Object Split (POS)

41 6.6 72.5%

Data size at various steps for LUBM1000

Outline



SPARQL Query

SPARQL – SPARQL Protocol And RDF Query Language

Example

SELECT ?x ?y WHERE{

?z foaf:name ?x ?z foaf:age ?y

} QueryData

Result

SPAQL Query by MapReduce

Example querySELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu}

Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}

Inside Hadoop MapReduce Job

subOrganizationOf_University

Department1 http://University0.edu

Department2 http://University1.edu

worksFor_Department

Professor1 Deaprtment1Professor2 Department2

MapMap MapMap

Reduce

Reduce

OutputWF#Professor1

Department1 SO#http://University0.edu

Dep

artm

ent1

WF#

Prof

esso

r1

Dep

artm

ent2

WF#

Prof

esso

r2

FilteringObject ==

http://University0.edu

INPUT

MAP

SHUFFLE&SORT

REDUCE

OUTPUT

Department1 SO#http://University0.edu WF#Professor1Department2 WF#Professor2

Outline



Query Plan Generation

Challenge One Hadoop job may not be sufficient to answer

a query In a single Hadoop job, a single triple pattern cannot

take part in joins on more than one variable simultaneously

Solution Algorithm for query plan generation

Query plan is a sequence of Hadoop jobs which answers the query

Exploit the fact that in a single Hadoop job, a single triple pattern can take part in more than one join on a single variable simultaneously

Example

Example query:SELECT ?X, ?Y, ?Z WHERE { ?X pred1 obj1 subj2 ?Z obj2 subj3 ?X ?Z ?Y pred4 obj4 ?Y pred5 ?X }

Simplified view:1. X2. Z3. XZ4. Y5. XY

Join Graph &Hadoop Jobs

2

3

1

5

4

Z

X

X

X

Y

Join Graph

2

3

1

5

4

Z

X

X

X

Y

Valid Job 1

2

3

1

5

4

Z

X

X

X

Y

Valid Job 2

2

3

1

5

4

Z

X

X

X

Y

Invalid Job

Possible Query Plans

A. job1: (x, xz, xy)=yz, job2: (yz, y) = z, job3: (z, z) = done

2

3

1

5

4

Z

X

X

X

Y

Join Graph

2

3

1

5

4

Z

X

X

X

Y

Job 1

2

1,3,5

4

Z

Y

Job 2

2

Job 3

1,3,4,5

Z1,2,3,4,

5

Result

Possible Query Plans

B. job1: (y, xy)=x; (z,xz)=x, job2: (x, x, x) = done

2

3

1

5

4

Z

X

X

X

Y

Join Graph

2

3

1

5

4

Z

X

X

X

Y

Job 1

2,3

1

4,5

X

X

X

Job 2

1,2,3,4,

5

Result

Query Plan Generation

Goal: generate a minimum cost job plan Back tracking approach

Exhaustively generates all possible plans. Uses two coloring scheme on a graph to

find jobs with colors WHITE and BLACK. Two WHITE nodes cannot be adjacent

User defined cost model. Chooses best plan according to cost model.

Some Definitions

Triple Pattern,TPA triple pattern is an ordered collection of subject, predicate and object which appears in a SPARQL query WHERE clause. The subject, predicate and object can be either a variable (unbounded) or a concrete value (bounded).

Triple Pattern Join,TPJA triple pattern join is a join between two TPs on a variable

MapReduceJoin, MRJA MapReduceJoin is a join between two or more triple patterns on a variable.

Some Definitions

Job, JBA job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.

Conflicting MapReduceJoins, CMRJ A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.

NON-Conflicting MapReduceJoins, NCMRJ Non-conflicting MapReduceJoins is a pair of MRJs either not sharing any triple pattern or sharing a triple pattern and the MRJs are on same variable.

Example

LUBM Query SELECT ?X WHERE { 1 ?X rdf : type ub : Chair . 2 ?Y rdf : type ub : Department . 3 ?X ub : worksFor ?Y . 4 ?Y ub : subOrganizat ionOf <http : /

/www.U0 . edu> }

Example (contd.)

Triple Pattern Graph and Join Graph for the LUBM Query

Triple Pattern Graph (TPG)#1

Join Graph (JG)#1

Join Graph (JG)#2

Triple Pattern Graph (TPG)#2

Example(contd.)

Figure shows TPG and JG for query. On left, we have TPG where each node represents a

triple pattern in query, and they are named in the order they appear.

In the middle, we have the JG. Each node in the JG represents an edge in the TPG

For the query, an FQP can have two jobs First one dealing with NCMRJ between triple patterns 2,

3, 4 Second one NCMRJ between triple pattern 1 and the

output of the first join. IQP would be first job having CMRJs between 1, 3

and 4 and the second having MRJ between triple pattern 2 and the output of the first join.

Query Plan Generation: Backtracking

Query Plan Generation: Backtracking Drawbacks of back tracking approach

Computationally intractable Search space is exponential in size

Steps a Hadoop Job Goes Through Executable file (containing MapReduce code) is

transferred from client machine to JobTracker1

JobTracker decides which TaskTrackers2 will execute the job

Executable file is distributed to TaskTrackers over network

Map processes start by reading data from HDFS Map outputs are written to discs Map outputs are read from discs, shuffled (transferred

over the network to TaskTrackers which would run Reduce processes), sorted and written to discs

Reduce processes start by reading the input from the discs

Reduce outputs are written to discs

MapReduce Data Flow

http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow

Observations & an Approximate Solution Observations

Fixed overheads of a Hadoop job Multiple read-writes to disc Data transfer over network multiple times

Even a “Hello World” MapReduce job takes a couple of seconds because of the fixed overheads

Approximate solution Minimize number of jobs This is a good approximation since the overhead of

each job (e.g. jar file distribution, multiple disc read-writes, multiple network data transfer) and job switching is huge

Greedy Algorithm: Terms

Joining variable: A variable that is common in two or more triples Ex: x, y, xy, xz, za -> x,y,z are joining, a not

Complete elimination: A join operation that eliminates a joining variable y can be completely eliminated if we join (xy,y)

Partial elimination: A join that partially eliminates a joining variable After complete elimination of y, x can be partially

eliminated by joining (xz,x)

Greedy Algorithm: Terms

E-count: Number of joining variables in the resultant

triple after a complete elimination In the example x, y, z, xy, xz E-count of x is = 2 (resultant triple: yz) E-count of y is = 1 (resultant triple: x) E-count of z is = 1 (resultant triple: x)

Greedy Algorithm: Proposition Maximum job required for any SPARQL

query K, if K<=1; min( ceil(1.71*log2K), N), if K >

1 Where K is the number of triples in the

query N is the total number of joining variables

Greedy Algorithm: Proof

If we make just one join with each joining variable, then all joins can be done in N jobs (one join per job)

Special case scenario- Suppose each joining variable is common in

exactly two triples: Example- ab, bc, cd, de, ef, …. (like a chain)

At each job, we can make K/2 joins, which reduce the number of triples to half (i.e., K/2)

So, each job halves the number of triples Therefore, total jobs required is log2K <

1.71*log2K

Greedy Algorithm: Proof (Continued) General case: Suppose we sort (decreasing order) the variables

according to the frequency in different triples Let vi has frequency fi

Therefore, fi <= fi-1<=fi-2<=…<=f1 Note that if f1=2, then it reduces to the special

case Therefore, f1>2 in the general case, also, fN>=2 Now, we keep joining on v1, v2, … ,vN as long as

there is no conflict

Greedy Algorithm: Proof (Continued) Suppose L triples could not be reduced

because each of them are left alone with one/more joining variable that are conflicting (e.g. try reducing xy, yz, zx)

Therefore, M>=L joins have been performed, producing M triples (total M+L triples remaining)

Since each join involved at least 2 triples, 2M + L <= K 2(L+e) + L <= K (letting M = L +e, e >= 0) 3L + 2e <= K 2L + (4/3)e <= K*(2/3) (multiplying by 2/3 on

both sides)

Greedy Algorithm: Proof (Continued) 2L+e <= (2/3) * K So each job reduces #of triples to 2/3 Therefore,

K * (2/3)Q >= 1>= K * (2/3)Q+1

(3/2) Q <= K <= (3/2)Q+1 , Q <= log3/2K = 1.71 * log2K <= Q+1

In most real world scenarios, we can assume that 100 triples in a query is extremely rare

So, the maximum number of jobs required in this case is 12

Greedy Algorithm

Greedy algorithm Early elimination heuristic:

Make as many complete eliminations in each job as possible

This leaves the fewest number of variables for join in the next job

Must choose the join first that has the least e-count (least number of joining variables in the resultant triple)

Greedy Algorithm

Greedy Algorithm

Step I: remove non-joining variables Step II: sort the vars according to e-

count Step III: choose a var for elimination as

long as complete or partial elimination is possible – these joins make a job

Step IV: continue to step II if more triples are available

Outline



Experiment

Dataset and queries Cluster description Comparison with Jena In-Memory, SDB

and BigOWLIM frameworks Experiments with number of Reducers Algorithm runtimes: Greedy vs.

Exhaustive Some query results

Dataset And Queries

LUBM Dataset generator 14 benchmark queries Generates data of

some imaginary universities

Used for query execution performance comparison by many researches

Our Clusters

10 node cluster in SAIAL lab 4 GB main memory Intel Pentium IV 3.0 GHz

processor 640 GB hard drive

OpenCirrus HP labs test bed

Comparison: LUBM Query 2

Experiment with Number of Reducers

Greedy vs. Exhaustive Plan Generation

Some Query ResultsSeco

nd

s

Million Triples

Outline



Future Works

Enable plan generation algorithm to handle queries with complex structures

Ontology driven file partitioning for faster query answering

Balanced partitioning for data set with skewed distribution

Materialization with limited number of jobs for inference

Experiment with non-homogenous cluster

Publications Mohammad Husain, Latifur Khan, Murat Kantarcioglu, Bhavani M.

Thuraisingham: Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools, IEEE International Conference on Cloud Computing, 2010 (acceptance rate 20%)

Mohammad Husain, Pankil Doshi, Latifur Khan, Bhavani M. Thuraisingham: Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce, International Conference on Cloud Computing Technology and Science, Beijing, China, 2009

Mohammad Husain, Mohammad M. Masud, James McGlothlin, Latifur Khan, Bhavani Thuraisingham: Greedy Based Query Processing for Large RDF Graphs Using Cloud Computing, IEEE Transactions on Knowledge and Data Engineering Special Issue on Cloud Computing (submitted)

Mohammad Farhan Husain, Tahseen Al-Khateeb, Mohmmad Alam, Latifur Khan: Ontology based Policy Interoperability in Geo-Spatial Domain, CSI Journal (to appear)

Mohammad Farhan Husain, Mohmmad Alam, Tahseen Al-Khateeb, Latifur Khan: Ontology based policy interoperability in geo-spatial domain. ICDE Workshops 2008

Chuanjun Li, Latifur Khan, Bhavani M. Thuraisingham, M. Husain, Shaofei Chen, Fang Qiu : Geospatial Data Mining for National Security: Land Cover Classification and Semantic Grouping, Intelligence and Security Informatics, 2007

Questions/Discussion

A TOKEN-BASED ACCESS CONTROL SYSTEM FOR RDF DATA IN THE CLOUDS

Arindam KhaledMohammad Farhan HusainLatifur KhanKevin HamlenBhavani ThuraisinghamDepartment of Computer Science

University of Texas at Dallas

Research Funded by AFOSR

61

CloudCom 2010

Outline

Motivation and Background Semantic Web Security Scalability

Access control Proposed Architecture Results

62CloudCom 2010

Motivation

Semantic web is gaining immense popularity

Resource Description Framework (RDF) is one of the ways to represent data in Semantic web.

But most of the existing frameworks either lack scalability or don’t incorporate security.

Our framework incorporates both of those.

CloudCom 2010

63

Semantic Web

Originally proposed by Sir Tim Berners-Lee who envisioned it as a machine-understandable web.

Powerful since it allows relationships between web resources.

Semantic web and Ontologies are used to represent knowledge.

Resource Description Framework (RDF) is used for its expressive power, semantic interoperability, and reusability.

64CloudCom 2010

Semantic Web Technologies

Data in machine understandable format Infer new knowledge Standards

Data representation – RDF Triples

Example:

Ontology – OWL, DAML Query language - SPARQL

Subject Predicate Object

http://test.com/s1

foaf:name “John Smith”

65CloudCom 2010

Current Technologies

Joseki [15], Kowari [17], 3store [10], and Sesame [5] are few RDF stores.

Security is not addressed for these. In Jena [14, 20], efforts have been made

to incorporate security. But Jena lacks scalability – often queries

over large data become intractable [12, 13].

66CloudCom 2010

Cloud Computing Frameworks Proprietary

Amazon S3 Amazon EC2 Force.com

Open source tool Hadoop – Apache’s open source

implementation of Google’s proprietary GFS file system MapReduce – functional programming

paradigm using key-value pairs

67CloudCom 2010

Cloud as RDF Stores

Large RDF graphs can be efficiently stored and queried in the clouds [6, 12, 13, 18].

These stores lack access control. We address this problem by generating

tokens for specified access levels. Agents are assigned these tokens based

on their business requirements and restrictions.

68CloudCom 2010

System ArchitectureLUBM Data

Generator

Preprocessor

N-Triples Converter

Predicate Based Splitter

Object Type Based Splitter

Hadoop Distributed File System / Hadoop

Cluster

RDF/XML

Preprocessed Data

2. Jobs

3. Answer

3. Answer

1. Query

MapReduce Framework

Query Rewriter


Plan Executor

Acce

ss C

ontro

l

69CloudCom 2010

Storage Schema

Data in N-Triples Using namespaces

Example: http://utdallas.edu/res1 utd:resource1

Predicate based Splits (PS) Split data according to Predicates

Predicate Object based Splits (POS) Split further according to rdf:type of Objects

70CloudCom 2010

ExampleD0U0:GraduateStudent20 rdf:type lehigh:GraduateStudentlehigh:University0 rdf:type lehigh:UniversityD0U0:GraduateStudent20 lehigh:memberOf lehigh:University0

P

File: rdf_typeD0U0:GraduateStudent20 lehigh:GraduateStudentlehigh:University0 lehigh:University

File: lehigh_memberOfD0U0:GraduateStudent20 lehigh:University0

PS

File: rdf_type_GraduateStudentD0U0:GraduateStudent20

File: rdf_type_UniversityD0U0:University0

File: lehigh_memberOf_UniversityD0U0:GraduateStudent20 lehigh:University0

POS

71CloudCom 2010

Space Gain

Example

Steps Number of Files Size (GB) Space Gain

N-Triples 20020 24 --

Predicate Split (PS) 17 7.1 70.42%

Predicate Object Split (POS)

41 6.6 72.5%

Data size at various steps for LUBM1000

72CloudCom 2010

SPARQL Query

SPARQL – SPARQL Protocol And RDF Query Language

Example

SELECT ?x ?y WHERE{

?z foaf:name ?x ?z foaf:age ?y

} Query

Data

Result73

CloudCom 2010

SPAQL Query by MapReduce

Example querySELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu}

Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}

74CloudCom 2010

Inside Hadoop MapReduce Job

subOrganizationOf_UniversityDepartment1 http://University0.eduDepartment2 http://University1.edu

worksFor_DepartmentProfessor1 Deaprtment1Professor2 Department2

MapMap MapMap

Reduce

Reduce

OutputWF#Professor1

Department1 SO#http://University0.edu

Dep

artm

ent1

WF#Professor1

Depar

tmen

t2

WF#Professor2

FilteringObject ==

http://University0.edu

INPUT

MAP

SHUFFLE&SORT

REDUCE

OUTPUT

Department1 SO#http://University0.edu WF#Professor1Department2 WF#Professor2

75CloudCom 2010

Access Control in Our Architecture

CloudCom 2010

76

MapReduce Framework

Query Rewriter


Plan Executor

Acce

ss C

ontro

l

Access control module is linked to all the components of MapReduce Framework

Motivation

It’s important to keep the data safe from unwanted access.

Encryption can be used, but it has no or small semantic value.

By issuing and manipulating different levels of access control, the agent could access the data intended for him or make infereneces.

CloudCom 2010

77

Access Control Terminology

Access Tokens (AT): Denoted by integer numbers allow agents to access security-relevant data.

Access Token Tuples (ATT): Have the form <AccessToken, Element, ElementType, ElementName> where Element can be Subject, Object, or Predicate, and ElementType can be described as URI , DataType, Literal , Model (Subject), or BlankNode.

78CloudCom 2010

Six Access Control Levels

Predicate Data Access: Defined for a particular predicate. An agent can access the predicate file. For example: An agent possessing ATT <1, Predicate, isPaid, _> can access the entire predicate file isPaid.

Predicate and Subject Data Access: More restrictive than the previous one. Combining one of these Subject ATT’s with a Predicate data access ATT having the same AT grants the agent access to a specific subject of a specific predicate. For example, having ATT’s <1, Predicate, isPaid, _> and <1, Subject, URI , MichaelScott> permits an agent with AT 1 to access a subject with URI MichaelScott of predicate isPaid.

79CloudCom 2010

Access Control Levels (Cont.) Predicate and Object: This access level

permits a principal to extract the names of subjects satisfying a particular predicate and object.

Subject Access: One of the less restrictive access control levels. The subject can ne a URI , DataType, or BlankNode.

Object Access: The object can be a URI , DataType, Literal , or BlankNode.

80CloudCom 2010

Access Control Levels (Cont.) Subject Model Level Access: This

permits an agent to read all necessary predicate files to obtain all objects of a given subject. The ones which are URI objects obtained from the last step are treated as subjects to extract their respective predicates and objects. This iterative process continues until all objects finally become blank nodes or literals. Agents may generate models on a given subject.

81CloudCom 2010

Access Token Assignment

Each agent contains an Access Token list (AT-list) which contains 0 or more ATs assigned to the agents along with their issuing timestamps.

These timestamps are used to resolve conflicts (explained later).

The set of triples accessible by an agent is the union of the result sets of the AT’s in the agent’s AT-list.

82CloudCom 2010

Conflict

A conflict arises when the following three conditions occur: An agent possesses two AT’s 1 and 2, the result set of AT 2 is a proper subset of

AT 1, and the timestamp of AT 1 is earlier than the

timestamp of AT 2 Later, more specific AT supersedes the

former, so AT 1 is discarded from the AT-list to resolve the conflict.

83CloudCom 2010

Conflict Type

Subset Conflict: It occurs when AT 2 (later issued) is a conjunction of ATT’s that refine AT 1. For example, AT 1 is defined by <1, Subject, URI, Sam> and AT 2 is defined by <2, Subject, URI, Sam> and <2, Predicate, HasAccounts, _> ATT’s. If AT 2 is issued to the possessor of AT 1 at a later time, then a conflict will occur and AT 1 will be discarded from the agent’s AT-list.

84CloudCom 2010

Conflict Type

Subtype conflict: Subtype conflicts occur when the ATT’s in AT 2 involve data types that are subtypes of those in AT 1. The data types can be those of subjects, objects or both.

85CloudCom 2010

Conflict Resolution Algorithm

86CloudCom 2010

Experiment

Dataset and queries Cluster description Comparison with Jena In-Memory, SDB

and BigOWLIM frameworks Experiments with number of Reducers Algorithm runtimes: Greedy vs.

Exhaustive Some query results

87CloudCom 2010

Dataset And Queries

LUBM Dataset generator 14 benchmark queries Generates data of

some imaginary universities

Used for query execution performance comparison by many researches

88CloudCom 2010

Our Clusters

10 node cluster in SAIAL lab 4 GB main memory Intel Pentium IV 3.0 GHz

processor 640 GB hard drive

OpenCirrus HP labs test bed

89CloudCom 2010

Results

Scenario 1: “takesCourse”A list of sensitive courses cannot

be viewed by a normal user for any student90

CloudCom 2010

Results

Scenario 2: “displayTeachers”A normal user is allowed to view information

about the lecturers only91

CloudCom 2010

Future Works

Build a generic system that incorporates tokens and resolve policy conflicts.

Implement Subject Model Level Access that recursively extracts objects of subjects and treats these objects as subjects as long as these objects are URIs. An agent with proper access level can construct a model on that subject.

92CloudCom 2010

References

[1] Apache. Hadoop. http://hadoop.apache.org/.

[2] D. Beckett. RDF/XML syntax specification (revised). Technical report, W3C, February 2004.

[3] T. Berners-Lee. Semantic web road map. http://www.w3.org/DesignIssues/Semantic.html, 1998.

[4] L. Bouganim, F. D. Ngoc, and P. Pucheral. Client based access control management for XML documents. In Proc. 20émes Journées Bases de Données Avancées (BDA),pages 65–89, Montpellier, France, October 2004.

93CloudCom 2010

References

[5] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic architecture for storing and querying RDF. In Proc. 1st International Semantic Web Conference (ISWC), pages 54–68, Sardinia, Italy, June 2002.

[6] H. Choi, J. Son, Y. Cho, M. K. Sung, and Y. D. Chung. SPIDER: a system for scalable, parallel / distributed evaluation of large-scale RDF data. In Proc. 18th ACM Conference on Information and Knowledge Management (CIKM), pages 2087–2088, Hong Kong, China, November 2009.

[7] J. Grant and D. Beckett. RDF test cases. Technical report, W3C, February 2004.

[8] Y. Guo, Z. Pan, and J. Heflin. An evaluation of knowledge base systems for large OWL datasets. In In Proc. 3rd International Semantic Web Conference (ISWC), pages 274–288, Hiroshima, Japan, November 2004.

[9] Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics, 3(2–3):158–182, 2005.

94CloudCom 2010

References

[10] S. Harris and N. Shadbolt. SPARQL query processing with conventional relational database systems. In Proc. Web Information Systems Engineering (WISE) International Workshop on Scalable Semantic Web Knowledge Base Systems

(SSWS), pages 235–244, New York, New York, November 2005.

[11] L. E. Holmquist, J. Redstr¨om, and P. Ljungstrand. Token based access to digital information. In Proc. 1st International Symposium on Handheld and Ubiquitous Computing (HUC), pages 234–245, Karlsruhe, Germany, September 1999.

[12] M. F. Husain, P. Doshi, L. Khan, and B. M. Thuraisingham. Storage and retrieval of large RDF graph using Hadoop and MapReduce. In Proc. 1st International Conference on Cloud Computing (CloudCom), pages 680–686, Beijing, China, December 2009.

95CloudCom 2010

References

[13] M. F. Husain, L. Khan, M. Kantarcioglu, and B. Thuraisingham. Data intensive query processing for large RDF graphs using cloud computing tools. In Proc. IEEE 3rd International Conference on Cloud Computing (CLOUD), pages 1–10, Miami, Florida, July 2010.

[14] A. Jain and C. Farkas. Secure resource description framework: an access control model. In Proc. 11th ACM Symposium on Access Control Models and Technologies (SACMAT), pages 121–129, Lake Tahoe, California, June 2006.

[15] Joseki. http://www.joseki.org.

96CloudCom 2010

References

[16] J. Kim, K. Jung, and S. Park. An introduction to authorization conflict problem in RDF access control. In Proc. 12th International Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES), pages 583– 592, Zagreg, Croatia, September 2008.

[17] Kowari. http://kowari.sourceforge.net. [18] P. Mika and G. Tummarello. Web semantics in the

clouds. IEEE Intelligent Systems, 23(5):82–87, 2008. [19] E. Prud’hommeaux and A. Seaborne. SPARQL

query language for RDF. Technical report, W3C, January 2008.

[20] P. Reddivari, T. Finin, and A. Joshi. Policy based access control for an RDF store. In Proc. Policy Management for the Web Workshop, 2005.

97CloudCom 2010

CLOUD TOOLS OVERVIEW

HADOOP

Outline Hadoop - Basics HDFS

Goals Architecture Other functions

MapReduce Basics Word Count Example Handy tools Finding shortest path example

Related Apache sub-projects (Pig, HBase,Hive)

Hadoop - Why ? Need to process huge datasets on large

clusters of computers Very expensive to build reliability into each

application Nodes fail every day

Failure is expected, rather than exceptional The number of nodes in a cluster is not

constant Need a common infrastructure

Efficient, reliable, easy to use Open Source, Apache Licence

Who uses Hadoop? Amazon/A9 Facebook Google New York Times Veoh Yahoo! …. many more

Commodity Hardware

Typically in 2 level architecture Nodes are commodity PCs 30-40 nodes/rack Uplink from rack is 3-4 gigabit Rack-internal is 1 gigabit

Aggregation switch

Rack switch

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Original Slides by

Dhruba Borthakur

Apache Hadoop Project Management Committee

Goals of HDFS Very Large Distributed File System

10K nodes, 100 million files, 10PB Assumes Commodity Hardware

Files are replicated to handle hardware failure

Detect failures and recover from them Optimized for Batch Processing

Data locations exposed so that computations can move to where data resides

Provides very high aggregate bandwidth

Distributed File System Single Namespace for entire cluster Data Coherency

Write-once-read-many access model Client can only append to existing files

Files are broken up into blocks Typically 64MB block size Each block replicated on multiple DataNodes

Intelligent Client Client can find location of blocks Client accesses data directly from DataNode

HDFS Architecture

Functions of a NameNode Manages File System Namespace

Maps a file name to a set of blocks Maps a block to the DataNodes where it

resides Cluster Configuration Management Replication Engine for Blocks

NameNode Metadata Metadata in Memory

The entire metadata is in main memory No demand paging of metadata

Types of metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g. creation time, replication

factor A Transaction Log

Records file creations, file deletions etc

DataNode A Block Server

Stores data in the local file system (e.g. ext3)

Stores metadata of a block (e.g. CRC) Serves data and metadata to Clients

Block Report Periodically sends a report of all existing

blocks to the NameNode Facilitates Pipelining of Data

Forwards data to other specified DataNodes

Block Placement Current Strategy

One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed

Clients read from nearest replicas Would like to make this policy

pluggable

Heartbeats DataNodes send hearbeat to the

NameNode Once every 3 seconds

NameNode uses heartbeats to detect DataNode failure

Replication Engine NameNode detects DataNode failures

Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to

DataNodes

Data Correctness Use Checksums to validate data

Use CRC32 File Creation

Client computes checksum per 512 bytes DataNode stores the checksum

File access Client retrieves the data and checksum

from DataNode If Validation fails, Client tries other

replicas

NameNode Failure A single point of failure Transaction Log stored in multiple

directories A directory on the local file system A directory on a remote file system

(NFS/CIFS) Need to develop a real HA solution

Data Pieplining Client retrieves a list of DataNodes on

which to place replicas of a block Client writes block to the first

DataNode The first DataNode forwards the data

to the next node in the Pipeline When all replicas are written, the

Client moves on to write the next block in file

Rebalancer Goal: % disk full on DataNodes should

be similar Usually run when new DataNodes are

added Cluster is online when Rebalancer is

active Rebalancer is throttled to avoid network

congestion Command line tool

Secondary NameNode Copies FsImage and Transaction Log

from Namenode to a temporary directory

Merges FSImage and Transaction Log into a new FSImage in temporary directory

Uploads new FSImage to the NameNode Transaction Log on NameNode is purged

User Interface Commads for HDFS User:

hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir/myfile.txt

Commands for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommision

datanodename Web Interface

http://host:port/dfshealth.jsp

MAPREDUCE

Original Slides by

Owen O’Malley (Yahoo!)

&Christophe Bisciglia, Aaron Kimball & Sierra Michells-Slettvet

MapReduce - Dataflow

MapReduce - Features Fine grained Map and Reduce tasks

Improved load balancing Faster recovery from failed tasks

Automatic re-execution on failure In a large cluster, some nodes are always slow or

flaky Framework re-executes failed tasks

Locality optimizations With large data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when

possible

Word Count Example Mapper

Input: value: lines of text of input Output: key: word, value: 1

Reducer Input: key: word, value: set of counts Output: key: word, value: sum

Launching program Defines this job Submits job to cluster

Word Count Dataflow

Word Count Mapperpublic static class Map extends MapReduceBase implements

Mapper<LongWritable,Text,Text,IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text();

public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString(); StringTokenizer = new StringTokenizer(line); while(tokenizer.hasNext()) { word.set(tokenizer.nextToken()); output.collect(word,one); } } }

Word Count Reducerpublic static class Reduce extends MapReduceBase implements

Reducer<Text,IntWritable,Text,IntWritable> {public static void map(Text key, Iterator<IntWritable> values,

OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {

int sum = 0; while(values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Word Count Example Jobs are controlled by configuring JobConfs JobConfs are maps from attribute names to string

values The framework defines attributes to control how

the job is executed conf.set(“mapred.job.name”, “MyApp”);

Applications can add arbitrary values to the JobConf conf.set(“my.string”, “foo”); conf.set(“my.integer”, 12);

JobConf is available to all tasks

Putting it all together Create a launching program for your

application The launching program configures:

The Mapper and Reducer to use The output key and value types (input types

are inferred from the InputFormat) The locations for your input and output

The launching program then submits the job and typically waits for it to complete

Putting it all togetherJobConf conf = new JobConf(WordCount.class);conf.setJobName(“wordcount”);

conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class);Conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

Input and Output Formats A Map/Reduce may specify how it’s input is to

be read by specifying an InputFormat to be used A Map/Reduce may specify how it’s output is to

be written by specifying an OutputFormat to be used

These default to TextInputFormat and TextOutputFormat, which process line-based text data

Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary data

These are file-based, but they are not required to be

How many Maps and Reduces Maps

Usually as many as the number of HDFS blocks being processed, this is the default

Else the number of maps can be specified as a hint The number of maps can also be controlled by specifying

the minimum split size The actual sizes of the map inputs are computed by:

max(min(block_size,data/#maps), min_split_size

Reduces Unless the amount of data being processed

is small 0.95*num_nodes*mapred.tasktracker.tasks.max

imum

Some handy tools Partitioners Combiners Compression Counters Speculation Zero Reduces Distributed File Cache Tool

Partitioners Partitioners are application code that define

how keys are assigned to reduces Default partitioning spreads keys evenly, but

randomly Uses key.hashCode() % num_reduces

Custom partitioning is often required, for example, to produce a total order in the output Should implement Partitioner interface Set by calling conf.setPartitionerClass(MyPart.class) To get a total order, sample the map output keys and

pick values to divide the keys into roughly equal buckets and use that in your partitioner

Combiners When maps produce many repeated keys

It is often useful to do a local aggregation following the map Done by specifying a Combiner Goal is to decrease size of the transient data Combiners have the same interface as Reduces, and often

are the same class Combiners must not side effects, because they run an

intermdiate number of times In WordCount, conf.setCombinerClass(Reduce.class);

Compression Compressing the outputs and intermediate data will often

yield huge performance gains Can be specified via a configuration file or set

programmatically Set mapred.output.compress to true to compress job output Set mapred.compress.map.output to true to compress map

outputs Compression Types

(mapred(.map)?.output.compression.type) “block” - Group of keys and values are compressed together “record” - Each value is compressed individually Block compression is almost always best

Compression Codecs (mapred(.map)?.output.compression.codec) Default (zlib) - slower, but more compression LZO - faster, but less compression

Counters Often Map/Reduce applications have countable

events For example, framework counts records in to

and out of Mapper and Reducer To define user counters:

static enum Counter {EVENT1, EVENT2};reporter.incrCounter(Counter.EVENT1, 1);

Define nice names in a MyClass_Counter.properties fileCounterGroupName=MyCountersEVENT1.name=Event 1EVENT2.name=Event 2

Speculative execution The framework can run multiple instances of

slow tasks Output from instance that finishes first is used Controlled by the configuration variable

mapred.speculative.execution Can dramatically bring in long tails on jobs

Zero Reduces Frequently, we only need to run a filter on the

input data No sorting or shuffling required by the job Set the number of reduces to 0 Output from maps will go directly to OutputFormat and

disk

Distributed File Cache Sometimes need read-only copies of data on

the local computer Downloading 1GB of data for each Mapper is expensive

Define list of files you need to download in JobConf

Files are downloaded once per computer Add to launching program:

DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf);

Add to task:Path[] files = DistributedCache.getLocalCacheFiles(conf);

Tool Handle “standard” Hadoop command line options

-conf file - load a configuration file named file -D prop=value - define a single configuration property

prop Class looks like:

public class MyApp extends Configured implements Tool { public static void main(String[] args) throws Exception { System.exit(ToolRunner.run(new Configuration(),

new MyApp(), args));}public int run(String[] args) throws Exception { …. getConf() …. }}

Finding the Shortest Path A common graph search

application is finding the shortest path from a start node to one or more target nodes

Commonly done on a single machine with Dijkstra’s Algorithm

Can we use BFS to find the shortest path via MapReduce?

Finding the Shortest Path: Intuition

We can define the solution to this problem inductively DistanceTo(startNode) = 0 For all nodes n directly reachable from

startNode, DistanceTo(n) = 1 For all nodes n reachable from some other

set of nodes S,DistanceTo(n) = 1 + min(DistanceTo(m), m S)

From Intuition to Algorithm A map task receives a node n as a key,

and (D, points-to) as its value D is the distance to the node from the

start points-to is a list of nodes reachable from

n p points-to, emit (p, D+1) Reduces task gathers possible

distances to a given p and selects the minimum one

What This Gives Us This MapReduce task can advance the

known frontier by one hop To perform the whole BFS, a non-

MapReduce component then feeds the output of this step back into the MapReduce task for another iteration Problem: Where’d the points-to list go? Solution: Mapper emits (n, points-to) as

well

Blow-up and Termination This algorithm starts from one node Subsequent iterations include many

more nodes of the graph as the frontier advances

Does this ever terminate? Yes! Eventually, routes between nodes will

stop being discovered and no better distances will be found. When distance is the same, we stop

Mapper should emit (n,D) to ensure that “current distance” is carried into the reducer

HADOOP SUBPROJECTS

Hadoop Related Subprojects Pig

High-level language for data analysis HBase

Table storage for semi-structured data Zookeeper

Coordinating distributed applications Hive

SQL-like Query language and Metastore Mahout

Machine learning

PIG

Original Slides by

Matei Zaharia

UC Berkeley RAD Lab

Pig Started at Yahoo! Research Now runs about 30% of Yahoo!’s jobs Features

Expresses sequences of MapReduce jobs Data model: nested “bags” of items Provides relational (SQL) operators (JOIN, GROUP BY, etc.) Easy to plug in Java functions

An Example Problem Suppose you

have user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

In MapReduce

In Pig LatinUsers = load ‘users’ as (name, age);Filtered = filter Users by age >= 18 and age <= 25;

Pages = load ‘pages’ as (user, url);Joined = join Filtered by name, Pages by user;

Grouped = group Joined by url;Summed = foreach Grouped generate group, count(Joined) as clicks;

Sorted = order Summed by clicks desc;Top5 = limit Sorted 5;store Top5 into ‘top5sites’;

Ease of TranslationLoad Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …

Ease of Translation

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …

Job 1

Job 2

Job 3

HBASE

Original Slides by

Tom White

Lexeme Ltd.

HBase - What? Modeled on Google’s Bigtable Row/column store Billions of rows/millions on columns Column-oriented - nulls are free Untyped - stores byte[]

HBase - Data Model

Row TimestampColumn family:

animal:

Column family

repairs:

animal:type

animal:size repairs:cost

enclosure1

t2 zebra 1000 EUR

t1 lion big

enclosure2

… … … …

HBase - Data StorageColumn family animal:

(enclosure1, t2, animal:type)

zebra

(enclosure1, t1, animal:size)

big

(enclosure1, t1, animal:type)

lionColumn family repairs:

(enclosure1, t1, repairs:cost)

1000 EUR

HBase - CodeHTable table = …Text row = new Text(“enclosure1”);Text col1 = new Text(“animal:type”);Text col2 = new Text(“animal:size”);BatchUpdate update = new BatchUpdate(row);update.put(col1, “lion”.getBytes(“UTF-8”));update.put(col2, “big”.getBytes(“UTF-8));table.commit(update);

update = new BatchUpdate(row);update.put(col1, “zebra”.getBytes(“UTF-8”));table.commit(update);

HBase - Querying Retrieve a cell

Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();

Retrieve a rowRowResult = table.getRow( “enclosure1” );

Scan through a range of rowsScanner s = table.getScanner( new String[] { “animal:type” } );

HIVE

Original Slides by

Matei Zaharia

UC Berkeley RAD Lab

Hive Developed at Facebook Used for majority of Facebook jobs “Relational database” built on Hadoop

Maintains list of table schemas SQL-like query language (HiveQL) Can call Hadoop Streaming scripts from

HiveQL Supports table partitioning, clustering,

complex data types, some optimizations

Creating a Hive Table

Partitioning breaks table into separate files for each (dt, country) pairEx: /hive/page_view/dt=2008-06-08,country=USA

/hive/page_view/dt=2008-06-08,country=CA

CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING)STORED AS SEQUENCEFILE;

A Simple Query

SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01'AND page_views.date <= '2008-03-31'AND page_views.referrer_url like '%xyz.com';

• Hive only reads partition 2008-03-01,* instead of scanning entire table

• Find all page views coming from xyz.com on March 31st:

Aggregation and Joins• Count users who visited each page by gender:

• Sample output:

SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)FROM page_views pv JOIN user u ON (pv.userid = u.id)GROUP BY pv.page_url, u.genderWHERE pv.date = '2008-03-03';

Using a Hadoop Streaming Mapper ScriptSELECT TRANSFORM(page_views.userid, page_views.date)USING 'map_script.py'AS dt, uid CLUSTER BY dtFROM page_views;

STORM

Original Slides by

Nathan Marz

Twitter

Storm Developed by BackType which was acquired

by Twitter Lots of tools for data (i.e. batch) processing

Hadoop, Pig, HBase, Hive, … None of them are realtime systems which is

becoming a real requirement for businesses Storm provides realtime computation

Scalable Guarantees no data loss Extremely robust and fault-tolerant Programming language agnostic

Before Storm

Before Storm – Adding a worker

Deploy

Reconfigure/Redeploy

Problems Scaling is painful Poor fault-tolerance Coding is tedious

What we want Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message

passing “Just works” !!

Storm Cluster

Master node (similar to Hadoop JobTracker)

Used for cluster coordination

Run worker processes

Concepts Streams Spouts Bolts Topologies

Streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

Spouts

Source of streams

Bolts

Processes input streams and produces new streams:Can implement functions such as filters, aggregation, join, etc

Topology

Network of spouts and bolts

Topology

Spouts and bolts execute asmany tasks across the cluster

Stream Grouping

When a tuple is emitted which task does it go to?

Stream Grouping• Shuffle grouping: pick a random task• Fields grouping: consistent hashing

on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest

id

DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan...

Documents

Transcript of DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan...