Real time analytics using Hadoop and Elasticsearch

27
Real time analytics using Hadoop and Elasticsearch ABHISHEK ANDHAVARAPU by

description

Real time analytics using Hadoop and Elasticsearch

Transcript of Real time analytics using Hadoop and Elasticsearch

Page 1: Real time analytics using Hadoop and Elasticsearch

Real time analytics using

Hadoop

and

Elasticsearch

ABHISHEK ANDHAVARAPU

by

Page 2: Real time analytics using Hadoop and Elasticsearch

Thank you Sponsors!

Page 3: Real time analytics using Hadoop and Elasticsearch

About Me

• Currently working as Software

Engineer (Data Platform) at

Allegiance Software Inc.

• Passion for Distributed

System, Data visualizations.

• Masters in Distributed

Systems.

• abhishek376.wordpress.com

Page 4: Real time analytics using Hadoop and Elasticsearch

Agenda

Use Case.

Architecture.

Elasticsearch 101.

Demo.

Lessons learnt.

Page 5: Real time analytics using Hadoop and Elasticsearch

Legacy Architecture

5

Page 6: Real time analytics using Hadoop and Elasticsearch

Current Architecture

Page 7: Real time analytics using Hadoop and Elasticsearch

Why Hadoop ?

Page 8: Real time analytics using Hadoop and Elasticsearch

Elasticsearch 101

• Document oriented search engine Json based, apache

lucene under covers.

• Schema free.

• Its distributed, supports aggregations similar to group by .

• Uses bit sets to efficiently cache.

• It’s fast. Super fast.

• Its has REST and Java based API’s

Page 9: Real time analytics using Hadoop and Elasticsearch

Elasticsearch CRUDIndex a person:

curl -XPUT ‘localhost:9200/person/1’ -d '{

"first_name" : "Abhishek",

"last_name" : "Andhavarapu"

}’

Get a person:

curl -XGET 'localhost:9200/person/1'

Delete a person:

curl -XDELETE ‘localhost:9200/person/1’

Update a person:

curl -XPOST 'localhost:9200/person/1/_update' -d '{

"doc" : {

"first_name" : "Abhi"

}

}'

Page 10: Real time analytics using Hadoop and Elasticsearch

Elasticsearch data

Node2Node1

S1S0

Shard

Page 11: Real time analytics using Hadoop and Elasticsearch

Replicas

Node2Node1

S1 S1

S0S0

Blue - Replica

Red - Primary

Shard

Page 12: Real time analytics using Hadoop and Elasticsearch

Node2Node1

S1S0

Blue - Replica

Red - Primary

Node4Node3

S1 S0

More nodes..

Page 13: Real time analytics using Hadoop and Elasticsearch

Node2Node1

S1S0

Blue - Replica

Red - Primary

Node4Node3

S1 S0

Node down

Page 14: Real time analytics using Hadoop and Elasticsearch

Node1

S0

Blue - Replica

Red - Primary

Node4Node3

A1 S0

Node down

S1

S1

Promoted to Primary

Re-replicated

Page 15: Real time analytics using Hadoop and Elasticsearch

Elasticsearch 101

• Lucene is under covers.

• Each index (like a database) is made up of multiple

shards(lucene instance).

• Shards are distributed amongst all nodes in the

cluster.

• In case of failure or the addition of new nodes

shards are automatically moved from one to

another.

Page 16: Real time analytics using Hadoop and Elasticsearch

How is it Fast ?

Distributed execution

Client

Node 2Node 1

S1S0S1S0

Query

Red - Primary

Blue - Replica

Page 17: Real time analytics using Hadoop and Elasticsearch

DEMO

• Import data from SQL database

in to Hive. (Extract)

• Run the necessary

computations using

Hadoop/Hive. (Transform)

• Push the data in to

Elasticsearch. (Load)

• Run queries against

Elasticsearch.

Page 18: Real time analytics using Hadoop and Elasticsearch

Current Elasticsearch Cluster

• 9 bare metal boxes

• 128 GB RAM

• 2X SSD

• 10 GB Ethernet

• 2X 10 core Xeon Processors

• 2X 30GB Elasticsearch instances per box

• 1 Elasticsearch load balancing instance to handle index requests

Page 19: Real time analytics using Hadoop and Elasticsearch

Zabbix

What’s slow ?

Any request that takes more than 300ms is slow

Page 20: Real time analytics using Hadoop and Elasticsearch

Lessons Learnt

Page 21: Real time analytics using Hadoop and Elasticsearch

Concurrency

• More replication for more currency. Updates are costly.

• More shards much faster.

• SQL 3 to 5k per minute

Page 22: Real time analytics using Hadoop and Elasticsearch

Filter Cache

• All the filters have a cache flag that controls if they

are cached or not.

• Once the filter cache is warmed, all the requests are

served from the memory.

• Defaults - 10% for the filter cache.

• LRU.

• Bit Sets.

Page 23: Real time analytics using Hadoop and Elasticsearch

Field Data

• For sorting, aggegration etc.. all the field values are

loaded in to memory called field data.

• By default its unbounded.

• Expensive to build, its recommended to hold this in

memory.

• They are circuit breakers to protect against this.

• If the query is gonna use more than 60% of the JVM

heap it will kill the query.

Page 24: Real time analytics using Hadoop and Elasticsearch

JVM memory - Friend or Foe ?

Once a node is down, it causes the other nodes to replicate which are still serving requests causing additional heap pressure

Page 25: Real time analytics using Hadoop and Elasticsearch

Getting Bad

Solution ?

More memory.

Not necessarily more boxes.

Page 26: Real time analytics using Hadoop and Elasticsearch

Elasticsearch Cons

• Not commodity hardware 6K (Hadoop) vs 10K (SSD)

• GC issues.

• Circuit breakers doesn’t protect you against everything.

• No built in security. Use ngnix proxy with authentication.

• Learning curve.

• Lot of updates hurt. Filter cache should be rebuilt, merges etc..

Page 27: Real time analytics using Hadoop and Elasticsearch

Thank you

• abhishek376.wordpress.com

[email protected]

• Twitter : abhishek376We are Hiring !!