BIG DATA: From mammoth to elephant
-
Upload
roman-nikitchenko -
Category
Technology
-
view
815 -
download
0
Transcript of BIG DATA: From mammoth to elephant
![Page 1: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/1.jpg)
Roman Nikitchenko, 10.05.2015
BIG DATA: FROM MAMMOTH TO ELEPHANT
![Page 2: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/2.jpg)
MAMMOTHThe only real truth we know about them is their rests. Do you feel your enterprise data infrastructure goes this way?
Come and see in the nearest data center...
2
![Page 3: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/3.jpg)
TWO YEARS AGO● Our exciting high scalability realtime
BIG DATA solution with broad technologies stack in production.
3
![Page 4: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/4.jpg)
This is our PRESENT DAY
.. yet is powered by
4
![Page 5: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/5.jpg)
storage storage
SQL DB Processed inbound data
Inbound Outbound
SQL DB Processed inbound data
Healthcare providers data: labs, cares ...
Mostly insurance companies
SQL DB Application data
SQL DB Outbound information
OUR INITIAL STATE: TOP VIEW
CLIENT APPLICATIONS
CLIENT APPLICATIONS
CLIENT APPLICATIONS
CLIENT APPLICATIONS
5
![Page 6: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/6.jpg)
storage storage
SQL DB Processed inbound data
Inbound Outbound
SQL DB Processed inbound data
Mostly insurance companies
SQL DB Application data
SQL DB Outbound information
OUR INITIAL STATE: TOP VIEW
CLIENT APPLICATIONS
CLIENT APPLICATIONS
CLIENT APPLICATIONS
CLIENT APPLICATIONS
Inbound data archives(pretty short cycle)
One SQL DB per application
Huge amount of data. Serious amount of duplicates
How about retention and data issues investigation?
Healthcare providers data: labs, cares ...
6
![Page 7: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/7.jpg)
Outbound flow is slow because of RDBMS processing
storage storage
SQL DB Processed inbound data
Inbound Outbound
SQL DB Processed inbound data
Mostly insurance companies
SQL DB Application data
SQL DB Outbound information
OUR INITIAL STATE: TOP VIEW
CLIENT APPLICATIONS
CLIENT APPLICATIONS
CLIENT APPLICATIONS
CLIENT APPLICATIONS
Inbound data retention cycle is short, so prolonged period data investigation is hard
Overall huge amount of SQL databases, high operational complexity
One application DB per service client makes inter-application analytics and monitoring extremely hard
YELLOW ALARMS
Healthcare providers data: labs, cares ...
7
![Page 8: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/8.jpg)
8
![Page 9: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/9.jpg)
BIG DATA
Better ways to store huge data volumes: cheaper, safer and easier.
WHAT TO RUN FOR?
MORE STORAGE
9
![Page 10: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/10.jpg)
BIG DATAWHAT TO RUN FOR?
Scalable effective distributed processing models to open new opportunities like machine learning.
MORE POWER
10
![Page 11: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/11.jpg)
BIG DATAWHAT TO RUN FOR?
More flexible data structures closer to subject area and real world.
11
![Page 12: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/12.jpg)
RDBMS LIMITS● Good for anything
● Not so good for anything in particular
OUR MAIN ENEMY WAS ...
12
![Page 13: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/13.jpg)
MASSIVE ANALYSISIs about massive access to your data objects
Yourdatabase
Subject area objects data
Subject area objects data
Subject area objects data
Subject area objects data
Processing
Processing
Processing
Processing
Transformation from database structure into object structure
Distributed parallel
processing
Effective results collection
Distributed processing
results to be joined
WHY SQL IS EVIL
13
![Page 14: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/14.jpg)
RDBMS LIMITS
When you go massive processing, objects collection is getting too complex. Think about 100.000.000 people data scan.
Address ID City Street
1 New York 1020, Blue lake
2 Atlanta 203, Bricks av.
3 Seattle 120, Green drv.
FirstName LastName Address Payer
John Smith 1 2
Kate Davis 2 1
Samuel Brown 3 2
Payer ID Name State
1 SaferLife GA
2 YourGuard CA
Kate Davis,Atlanta 203, Bricks av.SafeLife, GA
SUBJECT AREA OBJECT COLLECTION
14
![Page 15: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/15.jpg)
FirstName
LastName
Address
Payer
Birthday
RDBMS LIMITSFirstName LastName Address Payer
John Smith 1 2
Kate Davis 2 1
Samuel Brown 3 2
And now let us add new «Birthday» column.Easy as pie!
Let it be Patients table ...
ALTER TABLE Patient ADD Birthday ...
TABLE STRUCTURE MODIFICATION
Let's do this with 2.000.000.000 rows MySQL table in production. What to do if your table grows further?
15
![Page 16: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/16.jpg)
ANY RELATIONAL DATA MODELSOONER OR LATER
16
![Page 17: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/17.jpg)
Your SQLdatabase
Shard
Shard
Shard
Shard
Processing
Processing
Processing
Processing
How to partition data? What to do
when new shard is added?
Need another cluster for
processing?
Distributed processing
results to be joined
HOW TO SCALE?
RDBMS LIMITS
17
![Page 18: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/18.jpg)
If you need to store plain text log, collection of objects for a long time or current user session attributes do you really need SQL?
18
![Page 19: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/19.jpg)
Cross-application data storage
SQL DB Application data
SQL DB Application data
SQL DB Application data
Small realtime requests Batch analytic and reporting
load
ETL
ETL
ETL
● One-time ETL as initial step and backup strategy.● Full migration to Apache Hbase.● As a transition period solution — realtime synchronization.
OUR INITIALBIG PLAN WAS
19
![Page 20: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/20.jpg)
OPEN SOURCE framework for big data. Both distributed storage and processing
Provides RELIABILITY and fault tolerance by SOFTWARE design (for example file system with replication factor 3 as default one.Horizontal scalability from
single computer up to thousands of nodes
Why Hadoop (initially 1.x)?
20
![Page 21: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/21.jpg)
First ever worldDATA OS
10.000 nodes computer... Can start in production from just 4 servers, 1 of them is for management and coordination. Single server is enough for development environment.
21
![Page 22: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/22.jpg)
HBase motivationWHY
LATENCY, SPEED AND ALL HADOOP PROPERTIES
22
![Page 23: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/23.jpg)
Database
Region server
Distributed processing
WHY YET ?
DataNode Node
File system Hardware
TaskTracker
Region server DataNode NodeTaskTracker
Region server DataNode NodeTaskTracker
Region server DataNode NodeTaskTracker
● Good both for OLTP and batch load.● Natural scaling and reliability with Hadoop.● Data processing locality, natural sharding with regions.● Coordination with ZooKeeper.
23
![Page 24: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/24.jpg)
ZooKeeperBecause coordinating distributed systems is a Zoo.
● Quorum based service for fast distributed system coordination.
● Came in our stack with Apache Hbase where it was needed for coordination. Now is part of core Hadoop infrastructure.
● Yet we use it for our own applications,
24
![Page 25: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/25.jpg)
Finally we went initial production with HADOOP 2.0
RESOURCE MANAGEMENT
DISTRIBUTED PROCESSING
FILE SYSTEM
COORDINATION
HADOOP 2.x CORE
25
![Page 26: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/26.jpg)
Database
Region server
Distributed processing & coordination
Real initial approach
DataNode Node
File system Hardware
Region server DataNode Node
Region server DataNode Node
Region server DataNode Node
● ZooKeeper Instances are distributed among cluster.● MapReduce is not service in Hadoop 2.x, just YARN application.
Resource management
NodeManager
NodeManager
NodeManager
NodeManager
26
![Page 27: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/27.jpg)
FIRST REAL RESULT
Cross-application data storage
SQL DB Application data
SQL DB Application data
SQL DB Application data
Small realtime requests Batch analytic and reporting
load
ETL
ETL
ETL
CLOSE BUT NOT EXACT PLAN
Daily ETL. Satisfied our daily reporting needs with major SQL infrastructure offload. Direct profit — massive processing is much faster, can handle inter-application data.
DO NOT WEAR PINK GLASSES
27
![Page 28: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/28.jpg)
APPROACH WE HAVE FIXED MUCH LATER
SQLserver
JOIN
Table1
Table2
Table3
Table4
ETL stream
SQLserver
JOIN
Table1
Table2
Table3
Table4
ETL stream
ETL stream
ETL stream
ETL stream
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
Bulkload
Bulkload
28
![Page 29: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/29.jpg)
Hadoop: don't do it yourself
DON'T DO IT YOURSELF
Because of number of factors starting from our distributed team support needs we have selected
29
![Page 30: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/30.jpg)
x MAX+
=
BIG DATA
BIG DATA
BIG DATA
HADOOP as INFRASTRUCTURE
30
![Page 31: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/31.jpg)
WHERE TO GO FROM HERE?31
![Page 32: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/32.jpg)
The admission of temporary residents into Canada is a privilege, not a right.
http://www.cic.gc.ca/
SEARCH / SECONDARY INDICES
32
![Page 33: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/33.jpg)
NO SEARCH OUT OF THE BOX OTHER THAN LINEAR SCAN OVER THE TABLE AND FILTERS.
SEARCH / SECONDARY INDICES
The same happened to be applicable to secondary indices in Hbase.
33
![Page 34: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/34.jpg)
SEARCH / SECONDARY INDICES
HOW WE MADE IT
HBase handles user data changes
Indexes are built on SOLR
NGData Lily indexer transforms data
changes into SOLR index updates
34
![Page 35: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/35.jpg)
HBase: Data and search integration
Data update
Client
User just puts (or deletes) data.
Search responses
Lily HBase NRT indexerREPLICATION
Translates data changes into SOLR
index updates.
SOLR cloud
Search requests (HTTP)
Apache Zookeeper does all coordination Provides real
indexing
Search and indexing together
35
![Page 36: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/36.jpg)
● Kafka is a high throughput distributed messaging system.
● Allows true realtime system reaction through publish-subscribe approach.
● New services can subscribe to data events stream.
GOING REALTIMEBatch load
Realtime load
New data
36
![Page 37: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/37.jpg)
● Kafka can be separated from Hadoop infrastructure or have backup cluster.
● Data publishers can switch to another cluster.
● Subscribers (including Spark on Hadoop) keep 2 places of subscription.
● So now you are free to put Kafka cluster in maintenance or backup subscribers.
GOING REALTIME
GENTLY
MAINTENANCE
37
![Page 38: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/38.jpg)
This is our PRESENT DAY
.. yet is powered by
38
![Page 39: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/39.jpg)
SO WHERE ARE WE GOING?
39
![Page 40: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/40.jpg)
OVER BIG DATAREACTIVE MANIFESTO
MOTIVATION
… users expect millisecond response times and 100% uptime. Data is measured in Petabytes. Today's demands are simply not met by yesterday’s software architectures.
40
![Page 41: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/41.jpg)
OVER BIG DATAREACTIVE MANIFESTO
… we want systems that are Responsive, Resilient, Elastic and Message Driven. We call these Reactive Systems. http://www.reactivemanifesto.org/
41
![Page 42: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/42.jpg)
OVER BIG DATAREACTIVE MANIFESTO
Responsiveness is the cornerstone of usability and utility, but more than that, responsiveness means that problems may be detected quickly and dealt with effectively.
RESPONSIVE
42
![Page 43: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/43.jpg)
OVER BIG DATAREACTIVE MANIFESTOThe system stays responsive in the face of failure.
… The client of a component is not burdened with handling its failures.
RESILIENT All services here are located through ZooKeeper which is quorum based so resilience is achieved
43
![Page 44: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/44.jpg)
OVER BIG DATAREACTIVE MANIFESTO
Reactive Systems can react to changes in the input rate by increasing or decreasing the resources allocated to service these inputs.
ELASTICBoth HDFS and Hbase
allow dynamic node addition / removal
YARN already handles most resource allocation
work and makes progress
44
![Page 45: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/45.jpg)
OVER BIG DATAREACTIVE MANIFESTO
Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling.
MESSAGE DRIVEN
Asynchronous messages from
applications
Any application can subscribe, not only Hadoop services
45
![Page 46: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/46.jpg)
LESSONS LEARNED
● No transition in one step. You enter Big Data world step by step.
● Change you mind first. You should stop thinking in old style. Do not try simply to map your existing approaches.
● No silver bullet. Don't ruin your existing infrastructure. Extend it. NoSQL is not always good and some cases are really to be kept on SQL. Use the right tool.
● As you progress you pay more attention to operations and reactive system properties.
46
![Page 47: BIG DATA: From mammoth to elephant](https://reader030.fdocuments.net/reader030/viewer/2022032715/55adaa041a28aba4748b47d7/html5/thumbnails/47.jpg)
QUESTION?
47