Post on 07-Aug-2015
Big Data Analysis in Java Worldby Serhiy Masyutin
Agenda
The Big Data Problem Map-Reduce MPP-based Analytical Database In-Memory Data Grid Real-Life Project Q&A
The Big Data Problem
http://www.datameer.com/images/product/big_data_hadoop/img_bigdata.png
The Big Data Problem
Map-Reduce MPP AD IMDG
When do I need it?
In an hour In a minute Now
What do I need to do with it?
Exploratory analytics
Structured analytics
Singular event processing
(some analytics),
Transactions
How will I query and search?
Unstructured Ad hoc SQL Structured
How do I need to store it?
I do, but not required to
I must and I am required to
Temporarily
Where is it coming from?
File/ETL File/ETL Event/Stream/File/
ETLhttp://blog.pivotal.io/pivotal/products/exploring-big-data-solutions-when-to-use-hadoop-vs-in-memory-vs-mpp
The Big Data Problem
Map-Reduce
MPP AD IMDG
Transactions
Customer records
Geo-spatial
Sensors
Social Media
XML, JSON
Raw Logs
Text
Image
Video
more
pro
cessin
g
http://blog.pivotal.io/big-data-pivotal/products/exploratory-data-science-when-to-use-an-mpp-database-sql-on-hadoop-or-map-reduce
The Big Data Problem
Data is not Information
- Clifford Stoll
Map-Reduce
http://jeremykun.files.wordpress.com/2014/10/mapreduceimage.gif?w=1800
Map-Reduce
https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png
Map-Reduce
http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif
Map-Reduce
https://anonymousbi.files.wordpress.com/2012/11/hadoopdiagram.png
Map-Reduce
Volume Variety VelocityMedium-
LargeUnstructure
d dataBatch
processing
MPP Analytical Database
http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png
MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagram.png
MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramOneNodeDown.png
MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/K-SafetyServerDiagramTwoNodesDown.png
MPP Analytical Database
http://my.vertica.com/docs/7.1.x/HTML/Content/Resources/Images/DataK-Safety-K2Nodes2And3Failed.png
MPP Analytical Database
JDBC
http://www.ndm.net/datawarehouse/images/stories/greenplum/gp-dia-3-0.png
MPP Analytical Database
Volume Variety VelocitySmall-
Medium-Large
Structured data
Interactive
ASTER DATABASE
Matrix
In-Memory Data Grid
https://ignite.incubator.apache.org/images/in_memory_data.png
In-Memory Data Grid
https://ignite.incubator.apache.org/images/in_memory_data.png
In-Memory Data Grid
https://ignite.incubator.apache.org/images/in_memory_compute.png
In-Memory Data Grid
http://hazelcast.com/wp-content/uploads/2013/12/IMDGEmbeddedMode_w1000px.png
In-Memory Data Grid
Volume Variety VelocitySmall-
MediumStructured
data(Near) Real-
Time
Real-Life Project
Sensor data Currently number of devices
doubles every year Data flow ~200GB/month Target data flow
~500GB/month
Real-Life Project
Requirements
When do I need it? In a minute
What do I need to do with it?
Structured analytics
How will I query and search?
Ad hoc SQL
How do I need to store it? I must and I am required to
Where is it coming from? XML
Real-Life Project
Time-series data RESTful API Extendable analytics Scalability Speed to Market
Real-Life Project
Availability Zone C
Availability Zone B
Availability Zone A
Real-Life Project
Processor
Raw message store
Client API
Collector
Analytic Executor Pool
Analytics API
Clients
Devices
3rd Party Services
Post-Processor
UI
Recent data store
Permanent data store
Real-Life Project
Vertica stores time-series data only Append-only data store Store organizational data separately Use Vertica’s ExternalFilter for data
load R analytics as UDFs on Vertica Scale Vertica cluster accordingly
Real-Life Project
Choose the right tool for the job, late changes are expensive
You can do everything yourself. Should you?
Q&A