DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan...
-
Upload
brice-jackson -
Category
Documents
-
view
220 -
download
3
Transcript of DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan...
DATA INTENSIVE QUERY PROCESSING FOR LARGE RDFGRAPHS USING CLOUD COMPUTING TOOLSMohammad Farhan HusainDr. Latifur KhanDr. Bhavani Thuraisingham
Department of Computer ScienceUniversity of Texas at Dallas
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Semantic Web Technologies
Data in machine understandable format Infer new knowledge Standards
Data representation – RDF Triples
Example:
Ontology – OWL, DAML Query language - SPARQL
Subject Predicate Object
http://test.com/s1
foaf:name “John Smith”
Cloud Computing Frameworks Proprietary
Amazon S3 Amazon EC2 Force.com
Open source tool Hadoop – Apache’s open source
implementation of Google’s proprietary GFS file system MapReduce – functional programming
paradigm using key-value pairs
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Goal
To build efficient storage using Hadoop for large amount of data (e.g. billion triples)
To build an efficient query mechanism Publish as open source project
http://code.google.com/p/hadooprdf/ Integrate with Jena as a Jena Model
Motivation
Current Semantic Web frameworks do not scale to large number of triples, e.g. Jena In-Memory, Jena RDB, Jena SDB AllegroGraph Virtuoso Universal Server BigOWLIM
There is a lack of distributed framework and persistent storage
Hadoop uses low end hardware providing a distributed framework with high fault tolerance and reliability
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Current Approaches
State-of-the-art approach Store RDF data in HDFS and query through
MapReduce programming (Our approach) Traditional approach
Store data in HDFS and process query outside of Hadoop Done in BIOMANTA1 project (details of querying
could not be found)
1. http://biomanta.org/
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage
Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
System ArchitectureLUBM Data
Generator
Preprocessor
N-Triples Converter
Predicate Based Splitter
Object Type Based Splitter
Hadoop Distributed File System / Hadoop
Cluster
MapReduce Framework
Query Rewriter
Query Plan Generator
Plan Executor
RDF/XML
Preprocessed Data
2. Jobs
3. Answer
3. Answer
1. Query
Storage Schema
Data in N-Triples Using namespaces
Example: http://utdallas.edu/res1 utd:resource1
Predicate based Splits (PS) Split data according to Predicates
Predicate Object based Splits (POS) Split further according to rdf:type of Objects
Example
D0U0:GraduateStudent20 rdf:type lehigh:GraduateStudentlehigh:University0 rdf:type lehigh:UniversityD0U0:GraduateStudent20 lehigh:memberOf lehigh:University0
P
File: rdf_typeD0U0:GraduateStudent20 lehigh:GraduateStudentlehigh:University0 lehigh:University
File: lehigh_memberOfD0U0:GraduateStudent20 lehigh:University0
PS
File: rdf_type_GraduateStudentD0U0:GraduateStudent20
File: rdf_type_UniversityD0U0:University0
File: lehigh_memberOf_UniversityD0U0:GraduateStudent20 lehigh:University0
POS
Space Gain
Example
Steps Number of Files Size (GB) Space Gain
N-Triples 20020 24 --
Predicate Split (PS) 17 7.1 70.42%
Predicate Object Split (POS)
41 6.6 72.5%
Data size at various steps for LUBM1000
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
SPARQL Query
SPARQL – SPARQL Protocol And RDF Query Language
Example
SELECT ?x ?y WHERE{
?z foaf:name ?x ?z foaf:age ?y
} QueryData
Result
SPAQL Query by MapReduce
Example querySELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu}
Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}
Inside Hadoop MapReduce Job
subOrganizationOf_University
Department1 http://University0.edu
Department2 http://University1.edu
worksFor_Department
Professor1 Deaprtment1Professor2 Department2
MapMap MapMap
Reduce
Reduce
OutputWF#Professor1
Department1 SO#http://University0.edu
Dep
artm
ent1
WF#
Prof
esso
r1
Dep
artm
ent2
WF#
Prof
esso
r2
FilteringObject ==
http://University0.edu
INPUT
MAP
SHUFFLE&SORT
REDUCE
OUTPUT
Department1 SO#http://University0.edu WF#Professor1Department2 WF#Professor2
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Query Plan Generation
Challenge One Hadoop job may not be sufficient to answer
a query In a single Hadoop job, a single triple pattern cannot
take part in joins on more than one variable simultaneously
Solution Algorithm for query plan generation
Query plan is a sequence of Hadoop jobs which answers the query
Exploit the fact that in a single Hadoop job, a single triple pattern can take part in more than one join on a single variable simultaneously
Example
Example query:SELECT ?X, ?Y, ?Z WHERE { ?X pred1 obj1 subj2 ?Z obj2 subj3 ?X ?Z ?Y pred4 obj4 ?Y pred5 ?X }
Simplified view:1. X2. Z3. XZ4. Y5. XY
Join Graph &Hadoop Jobs
2
3
1
5
4
Z
X
X
X
Y
Join Graph
2
3
1
5
4
Z
X
X
X
Y
Valid Job 1
2
3
1
5
4
Z
X
X
X
Y
Valid Job 2
2
3
1
5
4
Z
X
X
X
Y
Invalid Job
Possible Query Plans
A. job1: (x, xz, xy)=yz, job2: (yz, y) = z, job3: (z, z) = done
2
3
1
5
4
Z
X
X
X
Y
Join Graph
2
3
1
5
4
Z
X
X
X
Y
Job 1
2
1,3,5
4
Z
Y
Job 2
2
Job 3
1,3,4,5
Z1,2,3,4,
5
Result
Possible Query Plans
B. job1: (y, xy)=x; (z,xz)=x, job2: (x, x, x) = done
2
3
1
5
4
Z
X
X
X
Y
Join Graph
2
3
1
5
4
Z
X
X
X
Y
Job 1
2,3
1
4,5
X
X
X
Job 2
1,2,3,4,
5
Result
Query Plan Generation
Goal: generate a minimum cost job plan Back tracking approach
Exhaustively generates all possible plans. Uses two coloring scheme on a graph to
find jobs with colors WHITE and BLACK. Two WHITE nodes cannot be adjacent
User defined cost model. Chooses best plan according to cost model.
Some Definitions
Triple Pattern,TPA triple pattern is an ordered collection of subject, predicate and object which appears in a SPARQL query WHERE clause. The subject, predicate and object can be either a variable (unbounded) or a concrete value (bounded).
Triple Pattern Join,TPJA triple pattern join is a join between two TPs on a variable
MapReduceJoin, MRJA MapReduceJoin is a join between two or more triple patterns on a variable.
Some Definitions
Job, JBA job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.
Conflicting MapReduceJoins, CMRJ A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.
NON-Conflicting MapReduceJoins, NCMRJ Non-conflicting MapReduceJoins is a pair of MRJs either not sharing any triple pattern or sharing a triple pattern and the MRJs are on same variable.
Example
LUBM Query SELECT ?X WHERE { 1 ?X rdf : type ub : Chair . 2 ?Y rdf : type ub : Department . 3 ?X ub : worksFor ?Y . 4 ?Y ub : subOrganizat ionOf <http : /
/www.U0 . edu> }
Example (contd.)
Triple Pattern Graph and Join Graph for the LUBM Query
Triple Pattern Graph (TPG)#1
Join Graph (JG)#1
Join Graph (JG)#2
Triple Pattern Graph (TPG)#2
Example(contd.)
Figure shows TPG and JG for query. On left, we have TPG where each node represents a
triple pattern in query, and they are named in the order they appear.
In the middle, we have the JG. Each node in the JG represents an edge in the TPG
For the query, an FQP can have two jobs First one dealing with NCMRJ between triple patterns 2,
3, 4 Second one NCMRJ between triple pattern 1 and the
output of the first join. IQP would be first job having CMRJs between 1, 3
and 4 and the second having MRJ between triple pattern 2 and the output of the first join.
Query Plan Generation: Backtracking
Query Plan Generation: Backtracking
Query Plan Generation: Backtracking Drawbacks of back tracking approach
Computationally intractable Search space is exponential in size
Steps a Hadoop Job Goes Through Executable file (containing MapReduce code) is
transferred from client machine to JobTracker1
JobTracker decides which TaskTrackers2 will execute the job
Executable file is distributed to TaskTrackers over network
Map processes start by reading data from HDFS Map outputs are written to discs Map outputs are read from discs, shuffled (transferred
over the network to TaskTrackers which would run Reduce processes), sorted and written to discs
Reduce processes start by reading the input from the discs
Reduce outputs are written to discs
MapReduce Data Flow
http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow
Observations & an Approximate Solution Observations
Fixed overheads of a Hadoop job Multiple read-writes to disc Data transfer over network multiple times
Even a “Hello World” MapReduce job takes a couple of seconds because of the fixed overheads
Approximate solution Minimize number of jobs This is a good approximation since the overhead of
each job (e.g. jar file distribution, multiple disc read-writes, multiple network data transfer) and job switching is huge
Greedy Algorithm: Terms
Joining variable: A variable that is common in two or more triples Ex: x, y, xy, xz, za -> x,y,z are joining, a not
Complete elimination: A join operation that eliminates a joining variable y can be completely eliminated if we join (xy,y)
Partial elimination: A join that partially eliminates a joining variable After complete elimination of y, x can be partially
eliminated by joining (xz,x)
Greedy Algorithm: Terms
E-count: Number of joining variables in the resultant
triple after a complete elimination In the example x, y, z, xy, xz E-count of x is = 2 (resultant triple: yz) E-count of y is = 1 (resultant triple: x) E-count of z is = 1 (resultant triple: x)
Greedy Algorithm: Proposition Maximum job required for any SPARQL
query K, if K<=1; min( ceil(1.71*log2K), N), if K >
1 Where K is the number of triples in the
query N is the total number of joining variables
Greedy Algorithm: Proof
If we make just one join with each joining variable, then all joins can be done in N jobs (one join per job)
Special case scenario- Suppose each joining variable is common in
exactly two triples: Example- ab, bc, cd, de, ef, …. (like a chain)
At each job, we can make K/2 joins, which reduce the number of triples to half (i.e., K/2)
So, each job halves the number of triples Therefore, total jobs required is log2K <
1.71*log2K
Greedy Algorithm: Proof (Continued) General case: Suppose we sort (decreasing order) the variables
according to the frequency in different triples Let vi has frequency fi
Therefore, fi <= fi-1<=fi-2<=…<=f1 Note that if f1=2, then it reduces to the special
case Therefore, f1>2 in the general case, also, fN>=2 Now, we keep joining on v1, v2, … ,vN as long as
there is no conflict
Greedy Algorithm: Proof (Continued) Suppose L triples could not be reduced
because each of them are left alone with one/more joining variable that are conflicting (e.g. try reducing xy, yz, zx)
Therefore, M>=L joins have been performed, producing M triples (total M+L triples remaining)
Since each join involved at least 2 triples, 2M + L <= K 2(L+e) + L <= K (letting M = L +e, e >= 0) 3L + 2e <= K 2L + (4/3)e <= K*(2/3) (multiplying by 2/3 on
both sides)
Greedy Algorithm: Proof (Continued) 2L+e <= (2/3) * K So each job reduces #of triples to 2/3 Therefore,
K * (2/3)Q >= 1>= K * (2/3)Q+1
(3/2) Q <= K <= (3/2)Q+1 , Q <= log3/2K = 1.71 * log2K <= Q+1
In most real world scenarios, we can assume that 100 triples in a query is extremely rare
So, the maximum number of jobs required in this case is 12
Greedy Algorithm
Greedy algorithm Early elimination heuristic:
Make as many complete eliminations in each job as possible
This leaves the fewest number of variables for join in the next job
Must choose the join first that has the least e-count (least number of joining variables in the resultant triple)
Greedy Algorithm
Greedy Algorithm
Step I: remove non-joining variables Step II: sort the vars according to e-
count Step III: choose a var for elimination as
long as complete or partial elimination is possible – these joins make a job
Step IV: continue to step II if more triples are available
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Experiment
Dataset and queries Cluster description Comparison with Jena In-Memory, SDB
and BigOWLIM frameworks Experiments with number of Reducers Algorithm runtimes: Greedy vs.
Exhaustive Some query results
Dataset And Queries
LUBM Dataset generator 14 benchmark queries Generates data of
some imaginary universities
Used for query execution performance comparison by many researches
Our Clusters
10 node cluster in SAIAL lab 4 GB main memory Intel Pentium IV 3.0 GHz
processor 640 GB hard drive
OpenCirrus HP labs test bed
Comparison: LUBM Query 2
Comparison: LUBM Query 9
Comparison: LUBM Query 12
Experiment with Number of Reducers
Greedy vs. Exhaustive Plan Generation
Some Query ResultsSeco
nd
s
Million Triples
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Future Works
Enable plan generation algorithm to handle queries with complex structures
Ontology driven file partitioning for faster query answering
Balanced partitioning for data set with skewed distribution
Materialization with limited number of jobs for inference
Experiment with non-homogenous cluster
Publications Mohammad Husain, Latifur Khan, Murat Kantarcioglu, Bhavani M.
Thuraisingham: Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools, IEEE International Conference on Cloud Computing, 2010 (acceptance rate 20%)
Mohammad Husain, Pankil Doshi, Latifur Khan, Bhavani M. Thuraisingham: Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce, International Conference on Cloud Computing Technology and Science, Beijing, China, 2009
Mohammad Husain, Mohammad M. Masud, James McGlothlin, Latifur Khan, Bhavani Thuraisingham: Greedy Based Query Processing for Large RDF Graphs Using Cloud Computing, IEEE Transactions on Knowledge and Data Engineering Special Issue on Cloud Computing (submitted)
Mohammad Farhan Husain, Tahseen Al-Khateeb, Mohmmad Alam, Latifur Khan: Ontology based Policy Interoperability in Geo-Spatial Domain, CSI Journal (to appear)
Mohammad Farhan Husain, Mohmmad Alam, Tahseen Al-Khateeb, Latifur Khan: Ontology based policy interoperability in geo-spatial domain. ICDE Workshops 2008
Chuanjun Li, Latifur Khan, Bhavani M. Thuraisingham, M. Husain, Shaofei Chen, Fang Qiu : Geospatial Data Mining for National Security: Land Cover Classification and Semantic Grouping, Intelligence and Security Informatics, 2007
Questions/Discussion
A TOKEN-BASED ACCESS CONTROL SYSTEM FOR RDF DATA IN THE CLOUDS
Arindam KhaledMohammad Farhan HusainLatifur KhanKevin HamlenBhavani ThuraisinghamDepartment of Computer Science
University of Texas at Dallas
Research Funded by AFOSR
61
CloudCom 2010
Outline
Motivation and Background Semantic Web Security Scalability
Access control Proposed Architecture Results
62CloudCom 2010
Motivation
Semantic web is gaining immense popularity
Resource Description Framework (RDF) is one of the ways to represent data in Semantic web.
But most of the existing frameworks either lack scalability or don’t incorporate security.
Our framework incorporates both of those.
CloudCom 2010
63
Semantic Web
Originally proposed by Sir Tim Berners-Lee who envisioned it as a machine-understandable web.
Powerful since it allows relationships between web resources.
Semantic web and Ontologies are used to represent knowledge.
Resource Description Framework (RDF) is used for its expressive power, semantic interoperability, and reusability.
64CloudCom 2010
Semantic Web Technologies
Data in machine understandable format Infer new knowledge Standards
Data representation – RDF Triples
Example:
Ontology – OWL, DAML Query language - SPARQL
Subject Predicate Object
http://test.com/s1
foaf:name “John Smith”
65CloudCom 2010
Current Technologies
Joseki [15], Kowari [17], 3store [10], and Sesame [5] are few RDF stores.
Security is not addressed for these. In Jena [14, 20], efforts have been made
to incorporate security. But Jena lacks scalability – often queries
over large data become intractable [12, 13].
66CloudCom 2010
Cloud Computing Frameworks Proprietary
Amazon S3 Amazon EC2 Force.com
Open source tool Hadoop – Apache’s open source
implementation of Google’s proprietary GFS file system MapReduce – functional programming
paradigm using key-value pairs
67CloudCom 2010
Cloud as RDF Stores
Large RDF graphs can be efficiently stored and queried in the clouds [6, 12, 13, 18].
These stores lack access control. We address this problem by generating
tokens for specified access levels. Agents are assigned these tokens based
on their business requirements and restrictions.
68CloudCom 2010
System ArchitectureLUBM Data
Generator
Preprocessor
N-Triples Converter
Predicate Based Splitter
Object Type Based Splitter
Hadoop Distributed File System / Hadoop
Cluster
RDF/XML
Preprocessed Data
2. Jobs
3. Answer
3. Answer
1. Query
MapReduce Framework
Query Rewriter
Query Plan Generator
Plan Executor
Acce
ss C
ontro
l
69CloudCom 2010
Storage Schema
Data in N-Triples Using namespaces
Example: http://utdallas.edu/res1 utd:resource1
Predicate based Splits (PS) Split data according to Predicates
Predicate Object based Splits (POS) Split further according to rdf:type of Objects
70CloudCom 2010
ExampleD0U0:GraduateStudent20 rdf:type lehigh:GraduateStudentlehigh:University0 rdf:type lehigh:UniversityD0U0:GraduateStudent20 lehigh:memberOf lehigh:University0
P
File: rdf_typeD0U0:GraduateStudent20 lehigh:GraduateStudentlehigh:University0 lehigh:University
File: lehigh_memberOfD0U0:GraduateStudent20 lehigh:University0
PS
File: rdf_type_GraduateStudentD0U0:GraduateStudent20
File: rdf_type_UniversityD0U0:University0
File: lehigh_memberOf_UniversityD0U0:GraduateStudent20 lehigh:University0
POS
71CloudCom 2010
Space Gain
Example
Steps Number of Files Size (GB) Space Gain
N-Triples 20020 24 --
Predicate Split (PS) 17 7.1 70.42%
Predicate Object Split (POS)
41 6.6 72.5%
Data size at various steps for LUBM1000
72CloudCom 2010
SPARQL Query
SPARQL – SPARQL Protocol And RDF Query Language
Example
SELECT ?x ?y WHERE{
?z foaf:name ?x ?z foaf:age ?y
} Query
Data
Result73
CloudCom 2010
SPAQL Query by MapReduce
Example querySELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu}
Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}
74CloudCom 2010
Inside Hadoop MapReduce Job
subOrganizationOf_UniversityDepartment1 http://University0.eduDepartment2 http://University1.edu
worksFor_DepartmentProfessor1 Deaprtment1Professor2 Department2
MapMap MapMap
Reduce
Reduce
OutputWF#Professor1
Department1 SO#http://University0.edu
Dep
artm
ent1
WF#Professor1
Depar
tmen
t2
WF#Professor2
FilteringObject ==
http://University0.edu
INPUT
MAP
SHUFFLE&SORT
REDUCE
OUTPUT
Department1 SO#http://University0.edu WF#Professor1Department2 WF#Professor2
75CloudCom 2010
Access Control in Our Architecture
CloudCom 2010
76
MapReduce Framework
Query Rewriter
Query Plan Generator
Plan Executor
Acce
ss C
ontro
l
Access control module is linked to all the components of MapReduce Framework
Motivation
It’s important to keep the data safe from unwanted access.
Encryption can be used, but it has no or small semantic value.
By issuing and manipulating different levels of access control, the agent could access the data intended for him or make infereneces.
CloudCom 2010
77
Access Control Terminology
Access Tokens (AT): Denoted by integer numbers allow agents to access security-relevant data.
Access Token Tuples (ATT): Have the form <AccessToken, Element, ElementType, ElementName> where Element can be Subject, Object, or Predicate, and ElementType can be described as URI , DataType, Literal , Model (Subject), or BlankNode.
78CloudCom 2010
Six Access Control Levels
Predicate Data Access: Defined for a particular predicate. An agent can access the predicate file. For example: An agent possessing ATT <1, Predicate, isPaid, _> can access the entire predicate file isPaid.
Predicate and Subject Data Access: More restrictive than the previous one. Combining one of these Subject ATT’s with a Predicate data access ATT having the same AT grants the agent access to a specific subject of a specific predicate. For example, having ATT’s <1, Predicate, isPaid, _> and <1, Subject, URI , MichaelScott> permits an agent with AT 1 to access a subject with URI MichaelScott of predicate isPaid.
79CloudCom 2010
Access Control Levels (Cont.) Predicate and Object: This access level
permits a principal to extract the names of subjects satisfying a particular predicate and object.
Subject Access: One of the less restrictive access control levels. The subject can ne a URI , DataType, or BlankNode.
Object Access: The object can be a URI , DataType, Literal , or BlankNode.
80CloudCom 2010
Access Control Levels (Cont.) Subject Model Level Access: This
permits an agent to read all necessary predicate files to obtain all objects of a given subject. The ones which are URI objects obtained from the last step are treated as subjects to extract their respective predicates and objects. This iterative process continues until all objects finally become blank nodes or literals. Agents may generate models on a given subject.
81CloudCom 2010
Access Token Assignment
Each agent contains an Access Token list (AT-list) which contains 0 or more ATs assigned to the agents along with their issuing timestamps.
These timestamps are used to resolve conflicts (explained later).
The set of triples accessible by an agent is the union of the result sets of the AT’s in the agent’s AT-list.
82CloudCom 2010
Conflict
A conflict arises when the following three conditions occur: An agent possesses two AT’s 1 and 2, the result set of AT 2 is a proper subset of
AT 1, and the timestamp of AT 1 is earlier than the
timestamp of AT 2 Later, more specific AT supersedes the
former, so AT 1 is discarded from the AT-list to resolve the conflict.
83CloudCom 2010
Conflict Type
Subset Conflict: It occurs when AT 2 (later issued) is a conjunction of ATT’s that refine AT 1. For example, AT 1 is defined by <1, Subject, URI, Sam> and AT 2 is defined by <2, Subject, URI, Sam> and <2, Predicate, HasAccounts, _> ATT’s. If AT 2 is issued to the possessor of AT 1 at a later time, then a conflict will occur and AT 1 will be discarded from the agent’s AT-list.
84CloudCom 2010
Conflict Type
Subtype conflict: Subtype conflicts occur when the ATT’s in AT 2 involve data types that are subtypes of those in AT 1. The data types can be those of subjects, objects or both.
85CloudCom 2010
Conflict Resolution Algorithm
86CloudCom 2010
Experiment
Dataset and queries Cluster description Comparison with Jena In-Memory, SDB
and BigOWLIM frameworks Experiments with number of Reducers Algorithm runtimes: Greedy vs.
Exhaustive Some query results
87CloudCom 2010
Dataset And Queries
LUBM Dataset generator 14 benchmark queries Generates data of
some imaginary universities
Used for query execution performance comparison by many researches
88CloudCom 2010
Our Clusters
10 node cluster in SAIAL lab 4 GB main memory Intel Pentium IV 3.0 GHz
processor 640 GB hard drive
OpenCirrus HP labs test bed
89CloudCom 2010
Results
Scenario 1: “takesCourse”A list of sensitive courses cannot
be viewed by a normal user for any student90
CloudCom 2010
Results
Scenario 2: “displayTeachers”A normal user is allowed to view information
about the lecturers only91
CloudCom 2010
Future Works
Build a generic system that incorporates tokens and resolve policy conflicts.
Implement Subject Model Level Access that recursively extracts objects of subjects and treats these objects as subjects as long as these objects are URIs. An agent with proper access level can construct a model on that subject.
92CloudCom 2010
References
[1] Apache. Hadoop. http://hadoop.apache.org/.
[2] D. Beckett. RDF/XML syntax specification (revised). Technical report, W3C, February 2004.
[3] T. Berners-Lee. Semantic web road map. http://www.w3.org/DesignIssues/Semantic.html, 1998.
[4] L. Bouganim, F. D. Ngoc, and P. Pucheral. Client based access control management for XML documents. In Proc. 20´emes Journ´ees Bases de Donn´ees Avanc´ees (BDA),pages 65–89, Montpellier, France, October 2004.
93CloudCom 2010
References
[5] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic architecture for storing and querying RDF. In Proc. 1st International Semantic Web Conference (ISWC), pages 54–68, Sardinia, Italy, June 2002.
[6] H. Choi, J. Son, Y. Cho, M. K. Sung, and Y. D. Chung. SPIDER: a system for scalable, parallel / distributed evaluation of large-scale RDF data. In Proc. 18th ACM Conference on Information and Knowledge Management (CIKM), pages 2087–2088, Hong Kong, China, November 2009.
[7] J. Grant and D. Beckett. RDF test cases. Technical report, W3C, February 2004.
[8] Y. Guo, Z. Pan, and J. Heflin. An evaluation of knowledge base systems for large OWL datasets. In In Proc. 3rd International Semantic Web Conference (ISWC), pages 274–288, Hiroshima, Japan, November 2004.
[9] Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics, 3(2–3):158–182, 2005.
94CloudCom 2010
References
[10] S. Harris and N. Shadbolt. SPARQL query processing with conventional relational database systems. In Proc. Web Information Systems Engineering (WISE) International Workshop on Scalable Semantic Web Knowledge Base Systems
(SSWS), pages 235–244, New York, New York, November 2005.
[11] L. E. Holmquist, J. Redstr¨om, and P. Ljungstrand. Token based access to digital information. In Proc. 1st International Symposium on Handheld and Ubiquitous Computing (HUC), pages 234–245, Karlsruhe, Germany, September 1999.
[12] M. F. Husain, P. Doshi, L. Khan, and B. M. Thuraisingham. Storage and retrieval of large RDF graph using Hadoop and MapReduce. In Proc. 1st International Conference on Cloud Computing (CloudCom), pages 680–686, Beijing, China, December 2009.
95CloudCom 2010
References
[13] M. F. Husain, L. Khan, M. Kantarcioglu, and B. Thuraisingham. Data intensive query processing for large RDF graphs using cloud computing tools. In Proc. IEEE 3rd International Conference on Cloud Computing (CLOUD), pages 1–10, Miami, Florida, July 2010.
[14] A. Jain and C. Farkas. Secure resource description framework: an access control model. In Proc. 11th ACM Symposium on Access Control Models and Technologies (SACMAT), pages 121–129, Lake Tahoe, California, June 2006.
[15] Joseki. http://www.joseki.org.
96CloudCom 2010
References
[16] J. Kim, K. Jung, and S. Park. An introduction to authorization conflict problem in RDF access control. In Proc. 12th International Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES), pages 583– 592, Zagreg, Croatia, September 2008.
[17] Kowari. http://kowari.sourceforge.net. [18] P. Mika and G. Tummarello. Web semantics in the
clouds. IEEE Intelligent Systems, 23(5):82–87, 2008. [19] E. Prud’hommeaux and A. Seaborne. SPARQL
query language for RDF. Technical report, W3C, January 2008.
[20] P. Reddivari, T. Finin, and A. Joshi. Policy based access control for an RDF store. In Proc. Policy Management for the Web Workshop, 2005.
97CloudCom 2010
CLOUD TOOLS OVERVIEW
HADOOP
Outline Hadoop - Basics HDFS
Goals Architecture Other functions
MapReduce Basics Word Count Example Handy tools Finding shortest path example
Related Apache sub-projects (Pig, HBase,Hive)
Hadoop - Why ? Need to process huge datasets on large
clusters of computers Very expensive to build reliability into each
application Nodes fail every day
Failure is expected, rather than exceptional The number of nodes in a cluster is not
constant Need a common infrastructure
Efficient, reliable, easy to use Open Source, Apache Licence
Who uses Hadoop? Amazon/A9 Facebook Google New York Times Veoh Yahoo! …. many more
Commodity Hardware
Typically in 2 level architecture Nodes are commodity PCs 30-40 nodes/rack Uplink from rack is 3-4 gigabit Rack-internal is 1 gigabit
Aggregation switch
Rack switch
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Original Slides by
Dhruba Borthakur
Apache Hadoop Project Management Committee
Goals of HDFS Very Large Distributed File System
10K nodes, 100 million files, 10PB Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recover from them Optimized for Batch Processing
Data locations exposed so that computations can move to where data resides
Provides very high aggregate bandwidth
Distributed File System Single Namespace for entire cluster Data Coherency
Write-once-read-many access model Client can only append to existing files
Files are broken up into blocks Typically 64MB block size Each block replicated on multiple DataNodes
Intelligent Client Client can find location of blocks Client accesses data directly from DataNode
HDFS Architecture
Functions of a NameNode Manages File System Namespace
Maps a file name to a set of blocks Maps a block to the DataNodes where it
resides Cluster Configuration Management Replication Engine for Blocks
NameNode Metadata Metadata in Memory
The entire metadata is in main memory No demand paging of metadata
Types of metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g. creation time, replication
factor A Transaction Log
Records file creations, file deletions etc
DataNode A Block Server
Stores data in the local file system (e.g. ext3)
Stores metadata of a block (e.g. CRC) Serves data and metadata to Clients
Block Report Periodically sends a report of all existing
blocks to the NameNode Facilitates Pipelining of Data
Forwards data to other specified DataNodes
Block Placement Current Strategy
One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed
Clients read from nearest replicas Would like to make this policy
pluggable
Heartbeats DataNodes send hearbeat to the
NameNode Once every 3 seconds
NameNode uses heartbeats to detect DataNode failure
Replication Engine NameNode detects DataNode failures
Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to
DataNodes
Data Correctness Use Checksums to validate data
Use CRC32 File Creation
Client computes checksum per 512 bytes DataNode stores the checksum
File access Client retrieves the data and checksum
from DataNode If Validation fails, Client tries other
replicas
NameNode Failure A single point of failure Transaction Log stored in multiple
directories A directory on the local file system A directory on a remote file system
(NFS/CIFS) Need to develop a real HA solution
Data Pieplining Client retrieves a list of DataNodes on
which to place replicas of a block Client writes block to the first
DataNode The first DataNode forwards the data
to the next node in the Pipeline When all replicas are written, the
Client moves on to write the next block in file
Rebalancer Goal: % disk full on DataNodes should
be similar Usually run when new DataNodes are
added Cluster is online when Rebalancer is
active Rebalancer is throttled to avoid network
congestion Command line tool
Secondary NameNode Copies FsImage and Transaction Log
from Namenode to a temporary directory
Merges FSImage and Transaction Log into a new FSImage in temporary directory
Uploads new FSImage to the NameNode Transaction Log on NameNode is purged
User Interface Commads for HDFS User:
hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir/myfile.txt
Commands for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommision
datanodename Web Interface
http://host:port/dfshealth.jsp
MAPREDUCE
Original Slides by
Owen O’Malley (Yahoo!)
&Christophe Bisciglia, Aaron Kimball & Sierra Michells-Slettvet
MapReduce - What? MapReduce is a programming model for
efficient distributed computing It works like a Unix pipeline
cat input | grep | sort | uniq -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
Efficiency from Streaming through data, reducing seeks Pipelining
A good fit for a lot of applications Log processing Web index building
MapReduce - Dataflow
MapReduce - Features Fine grained Map and Reduce tasks
Improved load balancing Faster recovery from failed tasks
Automatic re-execution on failure In a large cluster, some nodes are always slow or
flaky Framework re-executes failed tasks
Locality optimizations With large data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when
possible
Word Count Example Mapper
Input: value: lines of text of input Output: key: word, value: 1
Reducer Input: key: word, value: set of counts Output: key: word, value: sum
Launching program Defines this job Submits job to cluster
Word Count Dataflow
Word Count Mapperpublic static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text();
public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString(); StringTokenizer = new StringTokenizer(line); while(tokenizer.hasNext()) { word.set(tokenizer.nextToken()); output.collect(word,one); } } }
Word Count Reducerpublic static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {public static void map(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
int sum = 0; while(values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Word Count Example Jobs are controlled by configuring JobConfs JobConfs are maps from attribute names to string
values The framework defines attributes to control how
the job is executed conf.set(“mapred.job.name”, “MyApp”);
Applications can add arbitrary values to the JobConf conf.set(“my.string”, “foo”); conf.set(“my.integer”, 12);
JobConf is available to all tasks
Putting it all together Create a launching program for your
application The launching program configures:
The Mapper and Reducer to use The output key and value types (input types
are inferred from the InputFormat) The locations for your input and output
The launching program then submits the job and typically waits for it to complete
Putting it all togetherJobConf conf = new JobConf(WordCount.class);conf.setJobName(“wordcount”);
conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);conf.setReducer(Reduce.class);
conf.setInputFormat(TextInputFormat.class);Conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
Input and Output Formats A Map/Reduce may specify how it’s input is to
be read by specifying an InputFormat to be used A Map/Reduce may specify how it’s output is to
be written by specifying an OutputFormat to be used
These default to TextInputFormat and TextOutputFormat, which process line-based text data
Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary data
These are file-based, but they are not required to be
How many Maps and Reduces Maps
Usually as many as the number of HDFS blocks being processed, this is the default
Else the number of maps can be specified as a hint The number of maps can also be controlled by specifying
the minimum split size The actual sizes of the map inputs are computed by:
max(min(block_size,data/#maps), min_split_size
Reduces Unless the amount of data being processed
is small 0.95*num_nodes*mapred.tasktracker.tasks.max
imum
Some handy tools Partitioners Combiners Compression Counters Speculation Zero Reduces Distributed File Cache Tool
Partitioners Partitioners are application code that define
how keys are assigned to reduces Default partitioning spreads keys evenly, but
randomly Uses key.hashCode() % num_reduces
Custom partitioning is often required, for example, to produce a total order in the output Should implement Partitioner interface Set by calling conf.setPartitionerClass(MyPart.class) To get a total order, sample the map output keys and
pick values to divide the keys into roughly equal buckets and use that in your partitioner
Combiners When maps produce many repeated keys
It is often useful to do a local aggregation following the map Done by specifying a Combiner Goal is to decrease size of the transient data Combiners have the same interface as Reduces, and often
are the same class Combiners must not side effects, because they run an
intermdiate number of times In WordCount, conf.setCombinerClass(Reduce.class);
Compression Compressing the outputs and intermediate data will often
yield huge performance gains Can be specified via a configuration file or set
programmatically Set mapred.output.compress to true to compress job output Set mapred.compress.map.output to true to compress map
outputs Compression Types
(mapred(.map)?.output.compression.type) “block” - Group of keys and values are compressed together “record” - Each value is compressed individually Block compression is almost always best
Compression Codecs (mapred(.map)?.output.compression.codec) Default (zlib) - slower, but more compression LZO - faster, but less compression
Counters Often Map/Reduce applications have countable
events For example, framework counts records in to
and out of Mapper and Reducer To define user counters:
static enum Counter {EVENT1, EVENT2};reporter.incrCounter(Counter.EVENT1, 1);
Define nice names in a MyClass_Counter.properties fileCounterGroupName=MyCountersEVENT1.name=Event 1EVENT2.name=Event 2
Speculative execution The framework can run multiple instances of
slow tasks Output from instance that finishes first is used Controlled by the configuration variable
mapred.speculative.execution Can dramatically bring in long tails on jobs
Zero Reduces Frequently, we only need to run a filter on the
input data No sorting or shuffling required by the job Set the number of reduces to 0 Output from maps will go directly to OutputFormat and
disk
Distributed File Cache Sometimes need read-only copies of data on
the local computer Downloading 1GB of data for each Mapper is expensive
Define list of files you need to download in JobConf
Files are downloaded once per computer Add to launching program:
DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf);
Add to task:Path[] files = DistributedCache.getLocalCacheFiles(conf);
Tool Handle “standard” Hadoop command line options
-conf file - load a configuration file named file -D prop=value - define a single configuration property
prop Class looks like:
public class MyApp extends Configured implements Tool { public static void main(String[] args) throws Exception { System.exit(ToolRunner.run(new Configuration(),
new MyApp(), args));}public int run(String[] args) throws Exception { …. getConf() …. }}
Finding the Shortest Path A common graph search
application is finding the shortest path from a start node to one or more target nodes
Commonly done on a single machine with Dijkstra’s Algorithm
Can we use BFS to find the shortest path via MapReduce?
Finding the Shortest Path: Intuition
We can define the solution to this problem inductively DistanceTo(startNode) = 0 For all nodes n directly reachable from
startNode, DistanceTo(n) = 1 For all nodes n reachable from some other
set of nodes S,DistanceTo(n) = 1 + min(DistanceTo(m), m S)
From Intuition to Algorithm A map task receives a node n as a key,
and (D, points-to) as its value D is the distance to the node from the
start points-to is a list of nodes reachable from
n p points-to, emit (p, D+1) Reduces task gathers possible
distances to a given p and selects the minimum one
What This Gives Us This MapReduce task can advance the
known frontier by one hop To perform the whole BFS, a non-
MapReduce component then feeds the output of this step back into the MapReduce task for another iteration Problem: Where’d the points-to list go? Solution: Mapper emits (n, points-to) as
well
Blow-up and Termination This algorithm starts from one node Subsequent iterations include many
more nodes of the graph as the frontier advances
Does this ever terminate? Yes! Eventually, routes between nodes will
stop being discovered and no better distances will be found. When distance is the same, we stop
Mapper should emit (n,D) to ensure that “current distance” is carried into the reducer
HADOOP SUBPROJECTS
Hadoop Related Subprojects Pig
High-level language for data analysis HBase
Table storage for semi-structured data Zookeeper
Coordinating distributed applications Hive
SQL-like Query language and Metastore Mahout
Machine learning
PIG
Original Slides by
Matei Zaharia
UC Berkeley RAD Lab
Pig Started at Yahoo! Research Now runs about 30% of Yahoo!’s jobs Features
Expresses sequences of MapReduce jobs Data model: nested “bags” of items Provides relational (SQL) operators (JOIN, GROUP BY, etc.) Easy to plug in Java functions
An Example Problem Suppose you
have user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
In MapReduce
In Pig LatinUsers = load ‘users’ as (name, age);Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;Summed = foreach Grouped generate group, count(Joined) as clicks;
Sorted = order Summed by clicks desc;Top5 = limit Sorted 5;store Top5 into ‘top5sites’;
Ease of TranslationLoad Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …
Ease of Translation
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …
Job 1
Job 2
Job 3
HBASE
Original Slides by
Tom White
Lexeme Ltd.
HBase - What? Modeled on Google’s Bigtable Row/column store Billions of rows/millions on columns Column-oriented - nulls are free Untyped - stores byte[]
HBase - Data Model
Row TimestampColumn family:
animal:
Column family
repairs:
animal:type
animal:size repairs:cost
enclosure1
t2 zebra 1000 EUR
t1 lion big
enclosure2
… … … …
HBase - Data StorageColumn family animal:
(enclosure1, t2, animal:type)
zebra
(enclosure1, t1, animal:size)
big
(enclosure1, t1, animal:type)
lionColumn family repairs:
(enclosure1, t1, repairs:cost)
1000 EUR
HBase - CodeHTable table = …Text row = new Text(“enclosure1”);Text col1 = new Text(“animal:type”);Text col2 = new Text(“animal:size”);BatchUpdate update = new BatchUpdate(row);update.put(col1, “lion”.getBytes(“UTF-8”));update.put(col2, “big”.getBytes(“UTF-8));table.commit(update);
update = new BatchUpdate(row);update.put(col1, “zebra”.getBytes(“UTF-8”));table.commit(update);
HBase - Querying Retrieve a cell
Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();
Retrieve a rowRowResult = table.getRow( “enclosure1” );
Scan through a range of rowsScanner s = table.getScanner( new String[] { “animal:type” } );
HIVE
Original Slides by
Matei Zaharia
UC Berkeley RAD Lab
Hive Developed at Facebook Used for majority of Facebook jobs “Relational database” built on Hadoop
Maintains list of table schemas SQL-like query language (HiveQL) Can call Hadoop Streaming scripts from
HiveQL Supports table partitioning, clustering,
complex data types, some optimizations
Creating a Hive Table
Partitioning breaks table into separate files for each (dt, country) pairEx: /hive/page_view/dt=2008-06-08,country=USA
/hive/page_view/dt=2008-06-08,country=CA
CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING)STORED AS SEQUENCEFILE;
A Simple Query
SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01'AND page_views.date <= '2008-03-31'AND page_views.referrer_url like '%xyz.com';
• Hive only reads partition 2008-03-01,* instead of scanning entire table
• Find all page views coming from xyz.com on March 31st:
Aggregation and Joins• Count users who visited each page by gender:
• Sample output:
SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)FROM page_views pv JOIN user u ON (pv.userid = u.id)GROUP BY pv.page_url, u.genderWHERE pv.date = '2008-03-03';
Using a Hadoop Streaming Mapper ScriptSELECT TRANSFORM(page_views.userid, page_views.date)USING 'map_script.py'AS dt, uid CLUSTER BY dtFROM page_views;
STORM
Original Slides by
Nathan Marz
Storm Developed by BackType which was acquired
by Twitter Lots of tools for data (i.e. batch) processing
Hadoop, Pig, HBase, Hive, … None of them are realtime systems which is
becoming a real requirement for businesses Storm provides realtime computation
Scalable Guarantees no data loss Extremely robust and fault-tolerant Programming language agnostic
Before Storm
Before Storm – Adding a worker
Deploy
Reconfigure/Redeploy
Problems Scaling is painful Poor fault-tolerance Coding is tedious
What we want Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message
passing “Just works” !!
Storm Cluster
Master node (similar to Hadoop JobTracker)
Used for cluster coordination
Run worker processes
Concepts Streams Spouts Bolts Topologies
Streams
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Unbounded sequence of tuples
Spouts
Source of streams
Bolts
Processes input streams and produces new streams:Can implement functions such as filters, aggregation, join, etc
Topology
Network of spouts and bolts
Topology
Spouts and bolts execute asmany tasks across the cluster
Stream Grouping
When a tuple is emitted which task does it go to?
Stream Grouping• Shuffle grouping: pick a random task• Fields grouping: consistent hashing
on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest
id