Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming …...
-
Upload
mark-rittman -
Category
Data & Analytics
-
view
238 -
download
0
Transcript of Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming …...
![Page 1: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/1.jpg)
Mark Rittman, Independent Analyst, MJR Analytics
DATA INTEGRATION AND DATA WAREHOUSING FOR CLOUD, BIG DATA AND IOT: WHAT’S NEW, WHAT’S COMING … AND WHAT’S MISSING RIGHT NOW
BIG DATA WORLD, LONDON
London, March 2017
![Page 2: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/2.jpg)
•Oracle ACE Director, Independent Analyst •Past ODTUG Exec Board Member + Oracle Scene Editor •Author of two books on Oracle BI •Co-founder & CTO of Rittman Mead •15+ Years in Oracle BI, DW, ETL + now Big Data •Host of the Drill to Detail Podcast (www.drilltodetail.com) •Based in Brighton & work in London, UK
About The Presenter
2
![Page 3: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/3.jpg)
A BIT OF HISTORY…
3
![Page 4: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/4.jpg)
•Data warehouses provided a unified view of the business •Single place to store key data and metrics •Joined-up view of the business •Aggregates and conformed dimensions •ETL routines to load, cleanse and conform data
•BI tools for simple, guided access to information •Tabular data access using SQL-generating tools •Drill paths, hierarchies, facts, attributes •Fast access to pre-computed aggregates •Packaged BI for fast-start ERP analytics
4
Oracle
MongoDB
Oracle
Sybase
IBMDB/2
MSSQL
MSSQLServer
CoreERPPlatformRetail
Banking
CallCenterE-Commerce
CRM
Business
IntelligenceTools
DataWarehouse
Access&Performance
Layer
ODS/Foundation
Layer
4
Data Warehousing Back in Mid-2000’s
![Page 5: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/5.jpg)
How Traditional RDBMS Data Warehousing Scaled-Up
5
Shared-EverythingArchitectures(i.e.OracleRAC,Exadata)
Shared-NothingArchitectures(e.g.Teradata,Netezza)
![Page 6: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/6.jpg)
•Google needed to store and query their vast amount of server log files •And wanted to do so using cheap, commodity hardware •Google File System and MapReduce designed together for this use
Around the Same Time…
6
![Page 7: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/7.jpg)
•GFS optimised for particular task at hand - computing PageRank for sites •Streaming reads for PageRank calcs, block writes for crawler whole-site dumps
•Master node only holds metadata •Stops client/master I/O being bottleneck, also acts as traffic controller for clients
•Simple design, optimised for specific Google Need •MapReduce focused on simple computations on
abstraction framework •Select & filter (MAP) and reduce (aggregate) functions, easily to distribute on cluster
•MapReduce abstracted cluster compute, HDFS abstracted cluster storage
•Projects that inspired Apache Hadoop + HDFS
Google File System + MapReduce Key Innovations
7
![Page 8: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/8.jpg)
•A way of storing (non-relational) data cheaply and easily expandable •Gave us a way of scaling beyond TB-size without paying $$$ •First use-cases were offline storage, active archive of data
Hadoop’s Original Appeal to Data Warehouse Owners
8
(c) 2013
![Page 9: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/9.jpg)
•Driven by pace of business, and user demands for more agility and control •Traditional IT-governed data loading not always appropriate •Not all data needed to be modelled right-away •Not all data suited storing in tabular form •New ways of analyzing data beyond SQL •Graph analysis •Machine learning
Data Warehousing and ETL Needed Some Agility
9
![Page 10: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/10.jpg)
•Hadoop started by being synonymous with MapReduce, and Java coding •But YARN (Yet another Resource Negotiator) broke this dependency •Hadoop now just handles resource management •Multiple different query engines can run against data in-place •General-purpose (e.g. MapReduce) •Graph processing •Machine Learning •Real-Time Processing
Hadoop 2.0 - Enabling Multiple Query Engines
10
![Page 11: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/11.jpg)
•Storing data in format it arrived in, and then applying schema at query time •Suits data that may be analysed in different ways by different tools •In addition, some datatypes may have schema embedded in file format •Key benefit - fast arriving data of unknown value can get to users earlier •Made possible by tools such as Apache Hive + SerDes, Apache Drill and self-describing file formats, HDFS storage
Advent of Schema-on-Read, and Data Lakes
11
![Page 12: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/12.jpg)
•Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage •Flexible data storage platform with cheap storage, flexible schema support + compute •Solves the problem of how to store new types of data + choose best time/way to process it •Hadoop/NoSQL increasingly used for all store/transform/query tasks
Data Warehousing Circa 2010 : The “Data Lake”
12
DataTransfer DataAccess
DataFactory DataReservoir
BusinessIntelligenceTools
HadoopPlatform
FileBasedIntegration
StreamBased
Integration
Datastreams
Discovery&DevelopmentLabsSafe&secureDiscoveryandDevelopment
environment
Datasetsandsamples
Models andprograms
Marketing/SalesApplications
Models
MachineLearning
Segments
OperationalData
Transactions
CustomerMasterata
UnstructuredData
Voice+ChatTranscripts
ETLBasedIntegration
RawCustomerData
Datastoredintheoriginal
format(usuallyfiles)suchasSS7,ASN.1,JSONetc.
MappedCustomerData
Datasetsproducedbymappingandtransformingrawdata
![Page 13: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/13.jpg)
DATA WAREHOUSING & BIG DATA TODAY…
13
![Page 14: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/14.jpg)
•On-premise Hadoop, even with simple resilient clustering, will hit limits •Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc •Scale limits are encountered way beyond those for DWs… •… but future is elastically-scaled, query and compute-as-a-service
On-Premise Big Data Analytics Hits Its Limits
14
OracleBigDataCloudComputeEditionFree$300developercreditat:https://cloud.oracle.com/en_US/tryit
![Page 15: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/15.jpg)
•New generation of big data platform services from Google, Amazon, Oracle •Combines three key innovations from earlier technologies: •Organising of data into tables and columns (from RDBMS DWs) •Massively-scalable and distributed storage and query (from Big Data) •Elastically-scalable Platform-as-a-Service (from Cloud)
Elastically-Scalable Data Warehouse-as-a-Service
15
![Page 16: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/16.jpg)
Example Architecture : Google BigQuery
16
![Page 17: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/17.jpg)
•And things come full-circle … analytics typically requires tabular data
•Google BigQuery based-on DremelX massively-parallel query engine
•But stores data columnar and provides SQL interface
•Solves the problem of providing DW-like functionality at scale, as-a-service
•This is the future … ;-)
BigQuery : Big Data Meets Data Warehousing
17
![Page 18: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/18.jpg)
DATAFLOW PIPELINES ARE THE NEW ETL…
18
![Page 19: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/19.jpg)
New ways to do BI
![Page 20: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/20.jpg)
New ways to do BI
![Page 21: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/21.jpg)
MACHINE LEARNING & SEARCH FOR “AUTOMAGIC” SCHEMA DISCOVERY
21
![Page 22: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/22.jpg)
New ways to do BI
![Page 23: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/23.jpg)
•By definition there's lots of data in a big data system ... so how do you find the data you want?
•Google's own internal solution - GOODS ("Google Dataset Search") •Uses crawler to discover new datasets •ML classification routines to infer domain •Data provenance and lineage •Indexes and catalogs 26bn datasets
•Other users, vendors also have solutions •Oracle Big Data Discovery •Datameer •Platfora •Cloudera Navigator
Google GOODS - Catalog + Search At Google-Scale
23
![Page 24: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/24.jpg)
A NEW TAKE ON BI…
24
![Page 25: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/25.jpg)
•Came out if the data science movement, as a way to "show workings" •A set of reproducible steps that tell a story about the data •as well as being a better command-line environment for data analysis
•One example is Jupyter, evolution of iPython notebook •supports pySpark, Pandas etc •See also Apache Zepplin
Web-Based Data Analysis Notebooks
25
![Page 26: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/26.jpg)
AND EMERGING OPEN-SOURCE BI TOOLS AND PLATFORMS
26
![Page 27: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/27.jpg)
And Emerging Open-SourceBI Tools and Platforms
http://larrr.com/wp-content/uploads/2016/05/paper.pdf
![Page 28: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/28.jpg)
![Page 29: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/29.jpg)
And Emerging Open-SourceBI Tools and Platforms
![Page 30: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/30.jpg)
… Which Is What I’m Working On Right Now
30
![Page 31: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/31.jpg)
THANK YOU
31
![Page 32: Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s New, What’s Coming … and What’s Missing Right Now](https://reader031.fdocuments.net/reader031/viewer/2022021920/58d135b11a28abe3298b6415/html5/thumbnails/32.jpg)
Mark Rittman, Independent Analyst, MJR Analytics
DATA INTEGRATION AND DATA WAREHOUSING FOR CLOUD, BIG DATA AND IOT: WHAT’S NEW, WHAT’S COMING … AND WHAT’S MISSING RIGHT NOW
BIG DATA WORLD, LONDON
London, March 2017