Post on 18-Jun-2015
description
Go Big Quick
Jason SchellerPlatform & Content Analytics, Eikon
Pricing & Text Analytics Platform
• Mission - Ingest, enrich, store, analyze everything. Provide a single platform for search and analytics capabilities over any hosted content. Serve as a platform for future innovation.
• Content
• Twitter (~675 Tweets/sec, 15 days history)
• News (~40 articles/sec, 18 months history)
• Research (40 million docs, 3 million/year)
• Filings (29 million docs, 2.5 million/year)
• Trade data (500k RICS, 30K/sec, 10 years)
• Various metadata and derived content sets
Pricing & Text Analytics Platform
Pricing & Text Analytics Platform
Infrastructure
IBM Streams30 servers
18 servers86 TB
Where to start?
Data
Max Shard
Index
Shard 0
Data
JMeter
Max Shard• Disk space• Request load• RAM usage
Maximum Shard Size
• This same experiment will also give you the ratio of data to index size, which is great for planning. Just make sure you’re using your real analyzer settings.
• The rest is just math!
• Don’t forget to account for:
• Memory required to facet & sort
• Replica shards
• Data compression
Max Total Index Size / Max Shard Size = # Nodes
SPREADSHEET
But do I always use Max Shards?
ALLOCATION & HARDWARE
Cluster Allocation• Elasticsearch will figure out which node should host which shard. Let it! Its
better than you at figuring this out and moving shards around.
• Well mostly….
• Let’s say you have indices A – D, 4 shards each, 0 replicas, 4 nodes. Elasticsearch might arrange your shards like this based on the size of each shard.
A1
C1
B1
C4D4C3
B3A3B4A4B2A2
D2C2D3D1
Cluster Allocation• But what about other considerations?
• Hot spotting
• Access frequency
• Connectivity for River-based ingestion
• Heterogeneous hardware
A1
C1
B1
C4D4C3
B3A3B4A4B2A2
D2C2D3D1
Cluster Allocation – Heterogeneous Hardware• Suppose you know that indices A and B get queried 1000s of times per
second, but C and D are only hit ~1 a second. Maybe bought some better hardware to host A and B and don’t want to waste those machines on C and D.
• Is this a good allocation?
Slow HW Slow HW Fast HW Fast HW
A1
C1
B1
C4D4C2
B3A1B4A4B2A2
D2C3D3D1
Cluster Allocation – Heterogeneous Hardware• Suppose you know that indices A and B get queried 1000s of times per
second, but C and D are only hit ~1 a second. Maybe bought some better hardware to host A and B and don’t want to waste those machines on C and D.
• Is this a good allocation?
• Not really. The slower machines will slow all queries to A & B. And I’m not getting my money’s worth from that better hardware!
Slow HW Slow HW Fast HW Fast HW
A1
C1
B1
C4D4C2
B3A1B4A4B2A2
D2C3D3D1
Cluster Allocation – Heterogeneous Hardware• Wouldn’t this be better?
• Shard allocation settings allow us to “control” which nodes host which indices without ever specifying specific machines or IPs.
Slow HW Slow HW Fast HW Fast HW
A1C1 B1
C4
D4C2
B3A1B4A4
B2A2
D2C3D3
D1
Cluster Allocation – Heterogeneous Hardware
Slow HW Slow HW Fast HW Fast HW
A1C1 B1
C4
D4C2
B3A1B4A4
B2A2
D2C3D3
D1
node.hardware: slow node.hardware: fast
Index.routing.allocation.require.hardware: fast
Node Settings Node Settings
Index Settings: A & B
Cluster Allocation – Heterogeneous Hardware
Slow HW Fast HW Fast HW Fast HW
A1C1 B1
C4 D4
C2
B3A1
B4
A4
B2A2
D2C3D3
D1
• Is this ok? …Sure, why not?!
Cluster Allocation – Archive Example• We can use the same feature for large data sets of a time-based feed. Say
we keep an index for all news ever. People are generally searching the most recent 12 months, not the last 30 years.
Slow HW
Slow HW
Slow HW
Slow HWSlow
HWSlow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HWSlow
HWSlow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW Slow
HWSlow HW
Slow HW
Slow HW Slow
HW Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Slow HW
Fast HW
Fast HW
Fast HW
Fast HW
Fast HW
Fast HW
Fast HW
Fast HW