A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

A discuss on“Distributed Indexing of Web Scale

Datasets for the Cloud”[1]

Speakers: Vasileios Komianos,

Georgios Tsoumanis,

Eleni Moustaka

Supervisor: Spyridon Sioutas

Ionian University, Dept. of Informatics, Postgraduate

For the course: Advanced Topics in Database Systems

The focus of this presentation is a distributed architecture, from now on called System, for indexing large datasets. Hadoop, MapReduce, HBase and NoSQLDatabases are a few terms used often in this as these are the keystone technologies enabling such tasks.

Why Cloud?

• Cost• Device and Location Independence• Virtualization• Performance• Scalability• Infrastructure as a Service• Platform as a Service• Software as a Service

Why Web Scale?

• Google

• Facebook

• Wikipedia

• Amazon

• Internet Archive

Why Distributed?

• Huge volumes of data

• Computational problems

• Failure tolerance

• Scalability

What Hadoop[2] is

It is a open-source java framework capable of distributed processing of large data sets by using a distributed file system called HDFS[3] and MapReduce[4] model.

Hadoop

HDFS MapReduce

NameNode DataNodes JobTracker TaskTrackers

HadoopArchitecture

Usually NameNode is at the same time JobTracker and DataNodesare also TaskTrackers.

What HBase[5] is

An open-source distributed data store belonging to the known category of NoSQLdatabases. HBase is capable of storing large data sets that can be structured, semi-structured and unstructured offering also rapid query execution.

HBaseArchitecture

HMaster Region Servers

HBase runs on top of Hadoop and it is modelled after Google’s BitTable[6]. ACIDity* is sacrificed to improve performance and scalability.

*ACID: Atomicity, Consistency, Isolation and Durability

HBasecharacteristics

• NoSQL

• Schema free

• Very large tables

• Scalable

• Sharding

• JSON enable

NoSQLParadigmMongoDB[7]

> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({Presentation: "NoSQL databases"})> db.test.find(){ "_id" : ObjectId("4fbac827f119ef630e74638d"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac835f119ef630e74638e"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac85df119ef630e74638f"), "Presentation" : "NoSQL databases" }>

MongoDB is an easy to use NoSQL Database, it is free and it is supported by a large community. Suitable if there is no previous NoSQL experience.

NoSQL JSON Schema free

System Architecture

DatasetsUploader

MapReducetask

Content table

IndexerMapReduce

Index table

Client API

SearchGetConsisting of: 1 master and

11 worker nodes.

Having: 66 Mappers and 22 Reducers.

Dataset is composed of: 23GB of structured data,300GB of semi-structured data and20GB of unstructured data.

The experiment

The purpose was to test the System’s performance in various conditions such as:

• several datasets sizes,

• different datasets types,

• varying number of nodes,

• different index rules.

Index creation time

TXT dataset is the most demanding of processing when indexed.

5GB HTML dataset index creation time for different index rules

1 2 3 4

Iteration No1) 7 indexed tags,2) 14,3) 19,4) 27

5GB HTML index size for different index rules

1 2 3 4

Iteration No:1) 7 tags indexed (table, li, p, b, I, u, title), 2) 14 tags, 3) 19, 4) 27

System performance under query load

• Client instances were run concurrently on 14 machines sending queries to the system.

• Types of queries: exact specific attribute,exact any attributerange any attribute.

• Range query loads above 140 queries/sec failed.

• Tests were run with load of 14 queries/sec.

Response time per request:Exact specific queries: 20 ms.Exact any queries: 150ms.Range queries any: 27secs.

References

[1] Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and Nectarios Koziris: Distributed Indexing of Web Scale Datasets for the Cloud. In MDAC ’10, April 26, 2010 Raleigh, NC, USA.

[2] http://hadoop.apache.org/ [3] HDFS Scalability: The limits to growth KV Shvachko - The USENIX

Magazine. v35 i2, 2010 - usenix.org[4] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data

processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.

[5] Ankur Khetrapal, Vinay Ganesh: HBase and Hypertable for large scale distributed storage systems, Dept. of Computer Science, Purdue University

[6] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages.

[7] http://www.mongodb.org

A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

Technology

Transcript of A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud''

FamilySearch Indexing : Indexing - LDS

INDEXING* INDEXING*

static-curis.ku.dk · Web view(Word indexing is therefore often called derivative indexing; concept indexing is often called assignment indexing.)” 56. Consider, however, that Bernier

Indexing Strategies

Indexing Techniques Indexing Techniques in Warehousing …H.Haddouti/UB_Tree.pdf · Indexing Techniques Indexing Techniques in ... Processing Relational OLAP Queries with UB- ...

Indexing & retrieval. Approaches to indexing Key word indexing Concept indexing Social indexing Non-text indexing.

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Global Static Indexing for Real-time Exploration of Very ...pascucci.org/pdf-presentations/SC2001-talk.pdf · CASC VP 3 We must achieve real-time interaction with large datasets on

A survey on massively Parallelism for indexing multidimensional datasets on the GPU

Chapter 3 Geostatistics for Large Datasets · Chapter 3 Geostatistics for Large Datasets ... to discuss for the geostatistical analysis of large spatial ... said to be of Toeplitz

hashing & indexing

Distributed Indexing of Web Scale Datasets for the Cloud Email:{ikons, eangelou, dtsouma}@cslab.ece.ntua.gr Computing Systems Laboratory School of Electrical.

DiskANN: Fast Accurate Billion-point Nearest Neighbor ... · datasets are often graph-based algorithms such as HNSW [21] and NSG [13] where the indexing algorithm constructs a navigable

Indexing languages

Indexing and Active Fund Management: International Evidence...indexing and lower in countries with more closet indexing. Overall, our evidence suggests that explicit indexing improves

Conceptual indexing

New SPECIAL MACHINES 512 October 2019 · 2020. 2. 6. · Plain indexing head Universal indexing head Optical indexing head Methods of indexing Direct indexing Simple or plain indexing

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

HDF 3.11 for Workgroups - GitHub · Reading Datasets So if they come from NumPy arrays, they may be accessed in a numpy-like fashion (slicing, fancy indexing, masking). >>> f.root.knights[1]

Visualization of High Dimensional Datasets · 1 1 Visualization of High Dimensional Datasets Class 10 2 Challenges of High Dimensional Datasets High dimensional datasets are common: