Massive Semantic Web data compression with MapReduce

Massive Semantic Web data com-pression with MapReduceJacopo Urbani, Jason Maassen, Henri BalVrije Universiteit, AmsterdamHPDC (High Performance Distributed Computing) 2010

20June. 2014SNU IDB Lab.

Lee, Inhoe

<2/38>

Outline

Introduction Conventional Approach MapReduce Data Compression MapReduce Data Decompression Evaluation Conclusions

<3/38>

Introduction

Semantic Web– An extension of the current World Wide Web

A information = a set of statements Each statement = three different terms;

– subject, predicate, and object– <http://www.vu.nl> <rdf:type> <dbpedia:University>

<4/38>

Introduction

the terms consist of long strings– Most semantic web applications compress the statements– to save space and increase the performance

the technique to compress data is dictionary encoding

<5/38>

Motivation

Currently the amount of Semantic Web data– Is steadily growing

Compressing many billions of statements – becomes more and more time-consuming.

A fast and scalable compression is crucial

A technique to compress and decompress Semantic Web state-ments – using the MapReduce programming model

Allowed us to reason directly on the compressed statements with a consequent increase of performance [1, 2]

<6/38>

Outline

<7/38>

Conventional Approach Dictionary encoding

– Compress data– Decompress data

<8/38>

Outline

<9/38>

MapReduce Data Compression

job 1: identifies the popular terms and assigns them a numerical ID

job 2: deconstructs the statements, builds the dictionary table and re-places all terms with a corresponding numerical ID

job 3: read the numerical terms and reconstruct the statements in their compressed form

<10/38>

Job1 : caching of popular terms

Identify the most popular terms and assigns them a numerical number– count the occurrences of the

terms– select the subset of the most

popular ones– Randomly sample the input

<11/38>

<12/38>

<13/38>

<14/38>

Job2: deconstruct statements

Deconstruct the statements and compress the terms with a nu-merical ID

Before the map phase starts, loading the popular terms into the main memory

The map function reads the statements and assigns each of them a numerical ID – Since the map tasks are executed in

parallel, we partition the numerical range of the IDs so that each task is allowed to assign only a specific range of numbers

<15/38>

<16/38>

<17/38>

<18/38>

Job3: reconstruct statements Read the previous job’s output and reconstructs the statements

using the numerical IDs

<19/38>

Job3: reconstruct statements

<20/38>

<21/38>

<22/38>

Outline

<23/38>

MapReduce data decompression Join between the compressed state-

ments and the dictionary table

job 1: identifies the popular terms job 2: perform the join between the

popular resources and the dictionary table

job 3: deconstruct the statements and decompresses the terms performing a join on the input

job 4: reconstruct the statements in the original format

<24/38>

Job 1: identify popular terms

<25/38>

Job 2 : join with dictionary table

<26/38>

Job 3: join with compressed input

<27/38>

<28/38>

(20, www.cyworld.com)(21, www.snu.ac.kr)….(113, www.hotmail.com)(114, mail)

<29/38>

Job 4: reconstruct statements

<30/38>

<31/38>

<32/38>

Outline

<33/38>

Evaluation Environments

– 32 nodes of the DAS3 cluster to set up our Hadoop framework

Each node– two dual-core 2.4 GHz AMD Opteron CPUs– 4 GB main memory– 250 GB storage

<34/38>

Results The throughput of the compression algorithm is higher for a larger

datasets than for a smaller one– our technique is more efficient on larger inputs, where the computation is

not dominated by the platform overhead

Decompression is slower than Compression

<35/38>

Results

The beneficial effects of the popular-terms cache

<36/38>

Results Scalability

– Different input size– Varying the number of nodes

<37/38>

Outline

<38/38>

Conclusions

Proposed a technique to compress Semantic Web statements – using the MapReduce programming model

Evaluated the performance measuring the runtime– More efficient for larger inputs

Tested the scalability– Compression algo. scales more efficiently

A major contribution to solve this crucial problem in the Semantic Web

<39/38>

References [1] J. Urbani, S. Kotoulas, J. Maassen, F. van Harmelen, and H. Bal.

Owl reasoning with mapreduce: calculating the closure of 100 bil-lion triples. Currently under submission, 2010.

[2] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen. Scalable distributed reasoning using mapreduce. In Proceedings of the ISWC '09, 2009.

<40/38>

Outline

Introduction Conventional Approach MapReduce Data Compression

– Job 1: caching of popular terms– Job 2: deconstruct statements– Job 3: reconstruct statements

MapReduce Data Decompression– Job 2: join with dictionary table– Job 3: join with compressed input

Evaluation– Runtime– Scalability

Conclusions

<41/38>

Conventional Approach Dictionary encoding

Input : ABABBABCABABBA

Output : 124523461

Massive Semantic Web data compression with MapReduce

Documents

Transcript of Massive Semantic Web data compression with MapReduce

Providing legendary customer experience · 2020. 8. 2. · conventionally delivered in a lake, they require massive amounts of coding, done by expensive MapReduce coders,” he says.

Pipelined-MapReduce an Improved MapReduce

Query Processing of Massive Trajectory Data based on MapReduce Qiang Ma, Bin Yang (Fudan University) Weining Qian, Aoying Zhou (ECNU) Presented By: Xin.

MapReduce vs Pig | MapReduce Pig Integration

SISEL INTERNATIONAL PRESENTS. 8-LEVEL DAILY PAYOUT Depth determined by Monthly Personal Qualification No Cap. No Limit! Massive Dynamic Compression!!

Why suggested hybrid architecture -Mapreduce Massive ...journal.esrgroups.org/jes/icraes/CDICRAESFinal/... · Fig 1 Comparison of core technology [16] Teradata Active Enterprise Data

MapReduce and Hadoop File Systemnsrit.edu.in/admin/img/cms/10096mapreduce.pdf · The Outline Introduction to MapReduce From CS Foundation to MapReduce MapReduce programming model

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

Maximizing Hadoop Performance with Hardware Compression Excels at Sifting through Huge Masses of Data to Find what is Useful HDFS MapReduce Pig Hive Sqoop HBase. MapReduce Data Flow

MapReduce a distribuovane´ vy´pocˇtyufal.mff.cuni.cz/~straka/courses/npfl102/mapreduce_slides_2009.pdf · MapReduce Implementace MapReduce Google MapReduce Hadoop Phoenix Mars.

EE324 DISTRIBUTED SYSTEMS FALL 2015 MapReduce. Overview 2 MapReduce.

1. Introduction to MapReduce - UPMlsd.ls.fi.upm.es/.../IntroToMapReduce.pdf · Processing of massive data: MapReduce – 1. Introduction to MapReduce MapReduce has a 'low semantic

Mapreduce tuning

MapReduce-MPI Library Users Manualmapreduce.sandia.gov/doc/Manual.pdf · MapReduce-MPI WWW Site - MapReduce-MPI Documentation What is a MapReduce? The canonical example of a MapReduce

Data Management in Large-Scale Distributed Systems - MapReduce … · Introduction to MapReduce The Hadoop Eco-System HDFS Hadoop MapReduce 4. MapReduce at Google Publication The

Why suggested hybrid architecture -Mapreduce Massive ...Why suggested hybrid architecture -Mapreduce Massive Parallel Processing meter data repository- for a Smart Grid? A. Mehenni

Towards Zero-Overhead Static and Adaptive Indexing in Hadoop Zero-Overhead... · Abstract Hadoop MapReduce has evolved to an important industry standard for massive parallel data

Processing with What is MapReduce? Hadoop/MapReduce ...

Image compression: ER Mapper 6.0 ECW v2.0 versus MrSID 1 · Image compression: ER Mapper 6.0 ECW v2.0 versus MrSID 1.3 With massive file sizes for digital imagery now commonplace,

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.