Real-Time Big Data at In-Memory Speed, Using Storm
-
Upload
nati-shalom -
Category
Technology
-
view
4.703 -
download
3
description
Transcript of Real-Time Big Data at In-Memory Speed, Using Storm
![Page 1: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/1.jpg)
Real Time Big Data With Storm, Cassandra, and In-Memory Computing
Nati Shalom @natishalomDeWayne Filppi @dfilppi
![Page 2: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/2.jpg)
Introduction to Real Time AnalyticsHomeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved2
![Page 3: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/3.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
![Page 4: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/4.jpg)
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
![Page 5: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/5.jpg)
It’s All about Timing
• Event driven / stream processing • High resolution – every tweet gets counted
• Ad-hoc querying • Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
This is what we’re here to discuss
![Page 6: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/6.jpg)
Facebook & Twitter Real Time Analytics
![Page 7: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/7.jpg)
FACEBOOK REAL-TIMEANALYTICS SYSTEM
(LOGGING CENTRIC APPROACH)
7
![Page 8: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/8.jpg)
8
The actual analytics.. Like button analytics
Comments box analytics
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
![Page 9: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/9.jpg)
PTail
Scribe
Puma
HbaseFACEBOOK
Log
Log
Log
HDFS
Real Time Long Term
Batch1.5 Sec
Facebook architecture..10,000 write/sec per server
![Page 10: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/10.jpg)
TWITTER REAL-TIMEANALYTICS SYSTEM
(EVENT DRIVEN APPROACH)
10
![Page 11: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/11.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
URL Mentions – Here’s One Use Case
![Page 12: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/12.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved12
Twitter Real Time Analytics based on Storm
![Page 13: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/13.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13
Comparing the two approaches..
Facebook Rely on Hadoop for Real
Time and Batch RT = 10’s Sec Suits for Simple processing Low parallelization
Twitter Use Hadoop for Batch and
Storm for real time RT = Msec, Sec Suits for Complex
processing Extremely parallel
This is what we’re here to discuss
![Page 14: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/14.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14
Introduction to Storm
![Page 15: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/15.jpg)
Popular open source, real time, in-memory, streaming computation platform.
Includes distributed runtime and intuitive API for defining distributed processing flows.
Scalable and fault tolerant. Developed at BackType, and open sourced by Twitter
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15
Storm Background
![Page 16: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/16.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved16
Storm Cluster
![Page 17: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/17.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved17
Streams Unbounded sequence of tuples
Spouts Source of streams (Queues)
Bolts Functions, Filters, Joins, Aggregations
Topologies
Storm ConceptsSpouts
Bolt
Topologies
![Page 18: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/18.jpg)
Challenge – Word Count
Word:Count
Tweets
Count?® Copyright 2011 Gigaspaces Ltd. All Rights Reserved18
• Hottest topics• URL mentions• etc.
![Page 19: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/19.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved19
Streaming word count with Storm
![Page 20: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/20.jpg)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20
Computing Reach with Event Streams
![Page 21: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/21.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21
But where is my
Big Data?
![Page 22: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/22.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22
Bolt
Bolt
Spout
The Big Picture …
Twitter feed
Twitter Feed
Twiter Feed
Web Activity
Web Activity
Web Activity
Analytics Data
Research Data
Counters
Reference Data
StormData feeds (Kafka, Twitter,..) Cassandra, MongoDB, Hbase,..
End to End Latency
![Page 23: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/23.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23
Storm performance and reliability Assumes success is normal Uses batching and pipelining for performance
Storm plug-ins has significant effect on performance and reliability Spout must be able to replay tuples on demand in case of error.
Storm uses topology semantics for ensuring consistency through event ordering Can be tedious for handling counters Doesn’t ensure the state of the counters
Your as as strong as your weakest link
![Page 24: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/24.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
Typical user experience…
Now, Kafka is *fast*. When running the Kafka Spout by itself, I easily reproduced Kafka's claim that you can consume "hundreds of thousands of messages per second".
When I first fired up the topology, things went
well for the first minute, but then quickly crashed as the Kafka spout emitted too fast for the Cassandra Bolt to keep up. Even though Cassandra is fast as well, it is still
orders of magnitude slower than Kafka
Source: A Big Data Trifecta: Storm, Kafka and Cassandra. Brian Oniells Blog
![Page 25: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/25.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25
What if we could put everything In Memory?
An Alternative Approach
![Page 26: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/26.jpg)
Did you know?
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec
![Page 27: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/27.jpg)
RAM is the new disk Data partitioned across a cluster
Large “virtual” memory space Transactional Highly available Code with Data
In Memory Data Grid Review
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27
![Page 28: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/28.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28
Integrating with Storm
Bolt
Bolt
Spout
Web Activity
Web Activity
Web Activity
Analytics Data
Research Data
Counters
Reference Data
In Memory Data Grid(via Storm Trident State plug-in)
In Memory Data Stream (Via Storm Spout Plugin)
![Page 29: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/29.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29
In Memory Streaming Word Count with Storm
Storm has a simple builder interface to creating stream processing topologies
Storm delegates persistence to external providers
![Page 30: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/30.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30
Integrating with Hadoop, NoSQL DB..
Bolt
Bolt
Spout
Web Activity
Web Activity
Web Activity
Analytics Data
Research Data
Counters
Reference Data
In Memory Data Grid In Memory Data Stream Storm Plugin
Hadoop, NoSQL, RDBMS,…
Write Behind LRU based Policy
![Page 31: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/31.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
Live Demo – Word Count At In Memory Speed
![Page 32: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/32.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved32
Recent Benchmarks..
Gresham Computing plc, achieved over 50,000 equity trade transactions per second of load and match into a database.
![Page 33: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/33.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33
![Page 34: Real-Time Big Data at In-Memory Speed, Using Storm](https://reader035.fdocuments.net/reader035/viewer/2022062418/554a0ecbb4c90507558b4ad1/html5/thumbnails/34.jpg)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34
References Try the Cloudify recipe
Download Cloudify : http://www.cloudifysource.org/ Download the Recipe (apps/xapstream, services/xapstream):
– https://github.com/CloudifySource/cloudify-recipes XAP – Cassandra Interface Details;
http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency Check out the source for the XAP Spout and a sample state
implementation backed by XAP, and a Storm friendly streaming implemention on github: https://github.com/Gigaspaces/storm-integration
For more background on the effort, check out my recent blog posts at http://blog.gigaspaces.com/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/ Part 3 coming soon.