Real time analytics using Hadoop and Elasticsearch
-
Upload
abhishek-andhavarapu -
Category
Technology
-
view
657 -
download
2
description
Transcript of Real time analytics using Hadoop and Elasticsearch
Real time analytics using
Hadoop
and
Elasticsearch
ABHISHEK ANDHAVARAPU
by
Thank you Sponsors!
About Me
• Currently working as Software
Engineer (Data Platform) at
Allegiance Software Inc.
• Passion for Distributed
System, Data visualizations.
• Masters in Distributed
Systems.
• abhishek376.wordpress.com
Agenda
Use Case.
Architecture.
Elasticsearch 101.
Demo.
Lessons learnt.
Legacy Architecture
5
Current Architecture
Why Hadoop ?
Elasticsearch 101
• Document oriented search engine Json based, apache
lucene under covers.
• Schema free.
• Its distributed, supports aggregations similar to group by .
• Uses bit sets to efficiently cache.
• It’s fast. Super fast.
• Its has REST and Java based API’s
Elasticsearch CRUDIndex a person:
curl -XPUT ‘localhost:9200/person/1’ -d '{
"first_name" : "Abhishek",
"last_name" : "Andhavarapu"
}’
Get a person:
curl -XGET 'localhost:9200/person/1'
Delete a person:
curl -XDELETE ‘localhost:9200/person/1’
Update a person:
curl -XPOST 'localhost:9200/person/1/_update' -d '{
"doc" : {
"first_name" : "Abhi"
}
}'
Elasticsearch data
Node2Node1
S1S0
Shard
Replicas
Node2Node1
S1 S1
S0S0
Blue - Replica
Red - Primary
Shard
Node2Node1
S1S0
Blue - Replica
Red - Primary
Node4Node3
S1 S0
More nodes..
Node2Node1
S1S0
Blue - Replica
Red - Primary
Node4Node3
S1 S0
Node down
Node1
S0
Blue - Replica
Red - Primary
Node4Node3
A1 S0
Node down
S1
S1
Promoted to Primary
Re-replicated
Elasticsearch 101
• Lucene is under covers.
• Each index (like a database) is made up of multiple
shards(lucene instance).
• Shards are distributed amongst all nodes in the
cluster.
• In case of failure or the addition of new nodes
shards are automatically moved from one to
another.
How is it Fast ?
Distributed execution
Client
Node 2Node 1
S1S0S1S0
Query
Red - Primary
Blue - Replica
DEMO
• Import data from SQL database
in to Hive. (Extract)
• Run the necessary
computations using
Hadoop/Hive. (Transform)
• Push the data in to
Elasticsearch. (Load)
• Run queries against
Elasticsearch.
Current Elasticsearch Cluster
• 9 bare metal boxes
• 128 GB RAM
• 2X SSD
• 10 GB Ethernet
• 2X 10 core Xeon Processors
• 2X 30GB Elasticsearch instances per box
• 1 Elasticsearch load balancing instance to handle index requests
Zabbix
What’s slow ?
Any request that takes more than 300ms is slow
Lessons Learnt
Concurrency
• More replication for more currency. Updates are costly.
• More shards much faster.
• SQL 3 to 5k per minute
Filter Cache
• All the filters have a cache flag that controls if they
are cached or not.
• Once the filter cache is warmed, all the requests are
served from the memory.
• Defaults - 10% for the filter cache.
• LRU.
• Bit Sets.
Field Data
• For sorting, aggegration etc.. all the field values are
loaded in to memory called field data.
• By default its unbounded.
• Expensive to build, its recommended to hold this in
memory.
• They are circuit breakers to protect against this.
• If the query is gonna use more than 60% of the JVM
heap it will kill the query.
JVM memory - Friend or Foe ?
Once a node is down, it causes the other nodes to replicate which are still serving requests causing additional heap pressure
Getting Bad
Solution ?
More memory.
Not necessarily more boxes.
Elasticsearch Cons
• Not commodity hardware 6K (Hadoop) vs 10K (SSD)
• GC issues.
• Circuit breakers doesn’t protect you against everything.
• No built in security. Use ngnix proxy with authentication.
• Learning curve.
• Lot of updates hurt. Filter cache should be rebuilt, merges etc..
Thank you
• abhishek376.wordpress.com
• Twitter : abhishek376We are Hiring !!