A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

22
A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud” [1] Speakers: Vasileios Komianos, Georgios Tsoumanis, Eleni Moustaka Supervisor: Spyridon Sioutas Ionian University, Dept. of Informatics, Postgraduate For the course: Advanced Topics in Database Systems

description

 

Transcript of A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

Page 1: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

A discuss on“Distributed Indexing of Web Scale

Datasets for the Cloud”[1]

Speakers: Vasileios Komianos,

Georgios Tsoumanis,

Eleni Moustaka

Supervisor: Spyridon Sioutas

Ionian University, Dept. of Informatics, Postgraduate

For the course: Advanced Topics in Database Systems

Page 2: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

The focus of this presentation is a distributed architecture, from now on called System, for indexing large datasets. Hadoop, MapReduce, HBase and NoSQLDatabases are a few terms used often in this as these are the keystone technologies enabling such tasks.

Page 3: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

Why Cloud?

• Cost• Device and Location Independence• Virtualization• Performance• Scalability• Infrastructure as a Service• Platform as a Service• Software as a Service

Page 4: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

Why Web Scale?

• Google

• Facebook

• Wikipedia

• Amazon

• Internet Archive

Page 5: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

Why Distributed?

• Huge volumes of data

• Computational problems

• Failure tolerance

• Scalability

Page 6: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

What Hadoop[2] is

It is a open-source java framework capable of distributed processing of large data sets by using a distributed file system called HDFS[3] and MapReduce[4] model.

Page 7: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

Hadoop

HDFS MapReduce

NameNode DataNodes JobTracker TaskTrackers

HadoopArchitecture

Usually NameNode is at the same time JobTracker and DataNodesare also TaskTrackers.

Page 8: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

What HBase[5] is

An open-source distributed data store belonging to the known category of NoSQLdatabases. HBase is capable of storing large data sets that can be structured, semi-structured and unstructured offering also rapid query execution.

Page 9: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

HBaseArchitecture

HBase

HMaster Region Servers

HBase runs on top of Hadoop and it is modelled after Google’s BitTable[6]. ACIDity* is sacrificed to improve performance and scalability.

*ACID: Atomicity, Consistency, Isolation and Durability

Page 10: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

HBasecharacteristics

• NoSQL

• Schema free

• Very large tables

• Scalable

• Sharding

• JSON enable

Page 11: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

NoSQLParadigmMongoDB[7]

> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({Presentation: "NoSQL databases"})> db.test.find(){ "_id" : ObjectId("4fbac827f119ef630e74638d"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac835f119ef630e74638e"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac85df119ef630e74638f"), "Presentation" : "NoSQL databases" }>

MongoDB is an easy to use NoSQL Database, it is free and it is supported by a large community. Suitable if there is no previous NoSQL experience.

NoSQL JSON Schema free

Page 12: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

System Architecture

DatasetsUploader

MapReducetask

Content table

IndexerMapReduce

task

Index table

Client API

SearchGetConsisting of: 1 master and

11 worker nodes.

Having: 66 Mappers and 22 Reducers.

Dataset is composed of: 23GB of structured data,300GB of semi-structured data and20GB of unstructured data.

Page 13: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

The experiment

The purpose was to test the System’s performance in various conditions such as:

• several datasets sizes,

• different datasets types,

• varying number of nodes,

• different index rules.

Page 14: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''
Page 15: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''
Page 16: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''
Page 17: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

Index creation time

TXT dataset is the most demanding of processing when indexed.

Page 18: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''
Page 19: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

5GB HTML dataset index creation time for different index rules

0

2

4

6

8

10

12

1 2 3 4

Iteration No1) 7 indexed tags,2) 14,3) 19,4) 27

Tim

e(m

in)

Page 20: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

5GB HTML index size for different index rules

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1 2 3 4

Iteration No:1) 7 tags indexed (table, li, p, b, I, u, title), 2) 14 tags, 3) 19, 4) 27

Ind

ex s

ize

(GB

)

Page 21: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

System performance under query load

• Client instances were run concurrently on 14 machines sending queries to the system.

• Types of queries: exact specific attribute,exact any attributerange any attribute.

• Range query loads above 140 queries/sec failed.

• Tests were run with load of 14 queries/sec.

Response time per request:Exact specific queries: 20 ms.Exact any queries: 150ms.Range queries any: 27secs.

Page 22: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

References

[1] Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and Nectarios Koziris: Distributed Indexing of Web Scale Datasets for the Cloud. In MDAC ’10, April 26, 2010 Raleigh, NC, USA.

[2] http://hadoop.apache.org/ [3] HDFS Scalability: The limits to growth KV Shvachko - The USENIX

Magazine. v35 i2, 2010 - usenix.org[4] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data

processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.

[5] Ankur Khetrapal, Vinay Ganesh: HBase and Hypertable for large scale distributed storage systems, Dept. of Computer Science, Purdue University

[6] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages.

[7] http://www.mongodb.org