What is Big Data and why should i care?

34
What is Big Data and why should I care? James Serra, PDW Technology Solution Professional 08/26/14

description

You may understand what a data warehouse is, but what is big data? And why should I care about it? How will it help me? I'll talk about the things to look at to understand if you have big data, and cover those buzz words you may have heard but don't know what they mean (data scientist, Hadoop, Internet of Things, data lake, modern data warehouse). I'll also give examples of how big data is making companies make better business decisions.

Transcript of What is Big Data and why should i care?

Page 1: What is Big Data and why should i care?

What is Big Data and why should I care?

James Serra, PDW Technology Solution Professional08/26/14

Page 2: What is Big Data and why should i care?

About Me Business Intelligence Consultant, in IT for 28 years Microsoft, PDW Technology Solution Professional (TSP) Owner of Serra Consulting Services, specializing in end-to-

end Business Intelligence and Data Warehouse solutions using the Microsoft BI stack

Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW developer

Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS

Summit MCSE for SQL Server 2012: Data Platform and BI SME for SQL Server 2012 certs Contributing writer for SQL Server Pro magazine Blog at JamesSerra.com SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”

Page 3: What is Big Data and why should i care?

How can big data help me?Being able to extract data from various sources across the enterprise and outside the enterprise and then transform it all into key business insights can provide a significant competitive advantage by making better business decisions

- It all comes down to: The more data you have, the better business decisions you can make- First step is to understand the importance of a data warehouse- You need to understand what big data is- You need to make sure your data warehouse can handle big data (do you have a big data

problem?)- You need examples of how big data can help you- You need to understand Hadoop and its use cases with a data warehouse- You need to understand the difference between scaling up (SMP) and scaling out (MPP)- Understand the limitations of a traditional modern data warehouse and then build a modern data

warehouse

Page 4: What is Big Data and why should i care?

Why use a Data Warehouse?

5

Legacy applications + databases = chaos

Production Control

MRP

InventoryControl

Parts Management

Logistics

Shipping

Raw Goods

Order Control

Purchasing

Marketing

Finance

Sales

Accounting

Management Reporting

Engineering

Actuarial

Human Resources

ContinuityConsolidationControlComplianceCollaboration

Enterprise data warehouse = order

Single version of the truth

Enterprise DataWarehouse

Every question = decision

Two purposes of data warehouse: 1) save time building reports; 2) slice in dice in ways you could not do before

Page 5: What is Big Data and why should i care?

What is a Data Warehouse and why use one?All these reasons are for data warehouses only (not OLTP): Reduce stress on production system Optimized for read access, sequential disk scans Integrate many sources of data Keep historical records (no need to save hardcopy reports) Restructure/rename tables and fields, model data Protect against source system upgrades Use Master Data Management, including hierarchies No IT involvement needed for users to create reports Improve data quality and plugs holes in source systems One version of the truth Easy to create BI solutions on top of it (i.e. SSAS Cubes)

6

Page 6: What is Big Data and why should i care?

The traditional data warehouse

7

… data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing.

– Gartner, “The State of Data Warehousing in 2012”

Data sources

OLTP ERP CRM LOB

ETL

Data warehouse

BI and analytics

Will your current solution handle future needs? 

Page 7: What is Big Data and why should i care?

An illustration of the velocity of data created

Kalakota, R. (2012, October 22). Sizing “Mobile + Social” Big Data Stats. Retrieved from http://practicalanalytics.wordpress.com/

Page 8: What is Big Data and why should i care?

Social and web analytics

Live data feeds

Advanced analytics

The three V’s

Page 9: What is Big Data and why should i care?

Megabytes

What is big data and why is it valuable to the business A evolution in the nature and use of data in the enterprise

Data complexity: variety and velocity

Peta

byte

s/Volu

me

Page 10: What is Big Data and why should i care?

What is a data scientist?Excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge

- Evolution from data analyst role- Strong business acumen- Part analyst, part artist- Good with data modeling, machine learning, data mining- Azure ML, SAS and R

Page 11: What is Big Data and why should i care?

Big data defined: Is it the size of the data?- Volume- Quantity- Big data does not just mean the size of the data

Page 12: What is Big Data and why should i care?

Big data defined: Is it the frequency of the data?- Velocity- The rate at which the data changes

Page 13: What is Big Data and why should i care?

Big data defined: Is it the type of data?- Variety- Different types of data such as audio, video, text- Structured (from a relational database)- Unstructured (videos, pictures, PDF document, email)- Semi-structured (twitter feed, Facebook, XML, Excel)- Variability: The different meanings/contexts associated with a given piece of data

Page 14: What is Big Data and why should i care?

Why do I need data in a relational format?• Creation of metadata

• To join multiple tables/files via a column• Referential integrity• Constraints• Default values• Optimizations, Indexes• Transactions• Use of SQL• User authentication and access (security)• Updating and maintenance, consistency, reliability

Page 15: What is Big Data and why should i care?

Big data defined: Is it the performance of the data?- Are you using a dashboard (slice and dice) or a operational reporting tool?- What is the Service Level Agreement (SLA)?

Page 16: What is Big Data and why should i care?

Questions to see if you have a big data problem

17

1

Qualification QuestionsIs your data volume growth becoming unmanageable using currently implemented DW technologies? (>20-30% annually)

2

Is there a specific Big Data business need (e.g. social media analysis, fraud detection) in a high-priority industry (Retail, Financial, Pub Sec)?

3Is your DW or storage spend consuming a disproportionate and increasing amount of your IT budget?

4Do your business users need to find, combine, and refine structured and unstructured data? Internal and external sources?

5In the near future do you expect to need both on-premise and cloud-based BI capabilities?

6Do you have a need to capture and analyze streaming data? At what scale and velocity?

7

8

9

10

Do you currently (or plan to) collect, store, and analyze multiple forms of unstructured data (XML, JSON, CSV, etc.)?

Are you able to serve your business users’ analytics provisioning and data requests in a timely manner?

Are you experiencing data management issues such as security or compliance due to business owners (“shadow” IT) creating their own unmanaged data stores?

Are you trying to build, grow, and manage your next-generation DW without adding new headcount or talent (data scientists, external consultants, etc.)?

Page 17: What is Big Data and why should i care?

Examples of when big data has become a problem?- When queries are slow- When you run out of disk space- When your data warehouse can’t import certain types of data- When your maintenance window gets overrun- When you are not able to give the users data more frequently- When you can’t integrate with cloud data

Page 18: What is Big Data and why should i care?

Using “Big” data to complete the picture

1Social media: customer sentiment

2Bike sensors: complete journey

3Bus GPS: React to traffic

4Wi-Fi: customer movement in stations

Page 19: What is Big Data and why should i care?

What is Hadoop?

Microsoft Confidential

20

Distributed, scalable system on commodity HW

Composed of a few parts:

HDFS – Distributed file system

MapReduce – Programming model

Other tools: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper

Main players are Hortonworks and Cloudera

Core Services

OPERATIONAL SERVICES

DATASERVICES

HDFS

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

OOZIE

AMBARI

YARN

MAP REDUCE

HIVE &HCATALOGPIG

HBASEFALCON

Hadoop Cluster

compute&

storage . . .

. . .

. .compute

&storage

.

.

Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware

Page 20: What is Big Data and why should i care?

The “Expanded” Hadoop Ecosystem

Page 21: What is Big Data and why should i care?

Hadoop benefits• Provides storage for big data at a reasonable cost, since it is build

around commodity hardware• Provides a robust environment as it was designed to provide a fault-

tolerant environment and high throughput for extremely large datasets

• Allows for the capture of new or more data such as unstructured, semi-structured, and structured in batch or real-time

• Data can be stored longer, so you no longer have to purge older data• Provides scalable analytics via distributed storage and distributed

processing• Provides rich analytics via support for languages such as Java, Mahout,

Ruby, Python, and R

Page 22: What is Big Data and why should i care?

Reasons not to use Hadoop as your DW• Hadoop is slow for reading queries.  HDP 2.0 today will not perform anywhere near

PDW for interactive querying.  This is why PolyBase is so important, as it bridges the gap between the two technologies so customers can take advantage of both the unique features of Hadoop and realize the benefits of a EDW.  Truth be told users won’t want to wait 20+ seconds for a MapReduce job to start up to execute a Hive query

• Hadoop is not relational, as all the data is in files in HDFS, so there always is a conversion process to convert the data to a relational format

• Hadoop is not a database management system.  It does not have functionality such as update of data, referential integrity, statistics, ACID compliance, data security, and the plethora of tools and facilities needed to govern corporate data assets

• There is no metadata stored in HDFS, so another tool needs to be used to store that, adding complexity and slowing performance

• Finding expertise in Hadoop is very difficult: The small number of people who understand Hadoop and all its various versions and products versus the large number of people who know SQL

• Super complex, lot’s of integration with multiple technologies to make it work• Many tools/technologies/versions/vendors, no standards• Some reporting tools don’t work against Hadoop

Page 23: What is Big Data and why should i care?

What is a data lake?

Large object-based storage repository that holds data in its native format until it is needed.

• A place to store unlimited amounts of data in any format inexpensively• Usually Hadoop• A way to describe any large data pool in which the schema and data

requirements are not defined until the data is queried• Also called bit bucket or landing zone

Page 24: What is Big Data and why should i care?

Select… Result set Provides a single T-SQL query model (“semantic layer”) for APS and Hadoop with rich features of T-SQL, including joins without ETL

Query Hadoop data with T-SQL using PolyBaseBringing the worlds or big data and the data warehouse together for users and IT

SQL ServerParallel DataWarehouse

Cloudera CHD Linux 4.6

Hortonworks HDP 2.1 (Windows, Linux)

Windows AzureHDInsight 2.4 (HDFS)

PolyBase

Microsoft HDInsightHDP 1.3

(2.0 in AU2)Query re la t i ona l + non

re la t i ona l

Others (SQL Server, DB2, Oracle)? True federated query engine

AU1: Windows Azure storage blob (WASB)

Page 25: What is Big Data and why should i care?

Use cases where PolyBase simplifies using Hadoop dataBringing islands of Hadoop data together

High performance queries against Hadoop data

Archiving data warehouse data to Hadoop (move)

Exporting relational data to Hadoop (copy)

Importing Hadoop data into data warehouse (copy)

Page 26: What is Big Data and why should i care?

Big Data Landscape

Page 27: What is Big Data and why should i care?

Big Data Landscape (Version 2.0)

Page 28: What is Big Data and why should i care?

What is the Internet of Things (IoT)?Internet-connected devices that can perceive the environment in some way, share their data, and communicate with you

- Has it one processor and sensor to collect information- Examples: heart monitoring implants, biochip transponders on farm animals, automobiles with

build-in sensors, field operation devices that assist firefighters in search and rescue- Excludes computers, tablets, and smart phones

Cool possibilities- When a milk carton is almost empty it will ping you when you are near a store- An alarm clock that signals your coffee maker to start brewing when you wake up- An embedded chip that monitors your vital signs and notifies a medical provider if exceeds limit

Page 29: What is Big Data and why should i care?

What is SMP and MPP?This is a Data Warehouse and MPP (massively parallel processing) solution and not a OLTP (online transaction processing) and SMP (symmetric multiprocessing) solution. SMP is one server where each CPU in the server shares the same memory, disk, and network controllers (scale-up). MPP means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk (scale-out).

When do you need a MPP solution?

- We need at least 3x performance improvement- We are near disk capacity and see a lot of growth in the upcoming years- We need to support queries during our maintenance window- We need to load data outside of our maintenance window- We want to make non-relational data part of our data warehouse- We will spend a lot of money for FusionIO cards, SSDs, more SAN space, more memory, faster cpu

Page 30: What is Big Data and why should i care?

How to “break” the traditional data warehouse

31

Data sources

OLTP ERP CRM LOB

ETL

Data warehouse

BI and analytics

Increasing data volumes

1

Real-time Performance/Data

2

Non-Relational Data

Devices

Web Sensors

Social

New data sources & types

3

Cloud-born data

4

Page 31: What is Big Data and why should i care?

INFRASTRUCTURE

DATA MANAGEMENT & PROCESSING

DATA ENRICHMENT AND FEDERATED QUERY

BI & ANALYTICS

Self-service CollaborationCorporate PredictiveMobile

Extract, transform, load

Single query model Data quality Master data

management

Non-relationalRelational Analytical Streaming Internal & External

Data sources

OLTP ERP CRM LOB

Non-relational data

Devices

Web Sensors

Social

Modern data warehouse defined

Page 32: What is Big Data and why should i care?

SOURCE DATA

STAGING HADOOP

STAGINGRDBMS

DATA WAREHOUSE

OLAP USER PRESENTATION

Big Data

UnstructuredData (Word Docs, Blobs, Logs)

Semi-StructuredData (XML, JSON)

Structured Data(.TXT, CSV, Delimited)

OtherSocial Media, Sensors, Devices

Hadoop Ecosystem

Staging DB

SQL Server Analytical Services

APS/HDI APS/PDW

ODS

EDW

Polybase

Polybase

Polybase

The Microsoft Modern Data Warehouse

Page 33: What is Big Data and why should i care?

Introducing the Microsoft Analytics Platform SystemYour turnkey modern data warehouse appliance

Next-generation performance at scale

Enterprise-ready big data

Engineered foroptimal value

• Relational and non-relational data in a single appliance

• Or, integrate relational data with non-relational data in an external Hadoop cluster on premise or data stored in the Cloud (hot, warm, cold)

• Enterprise-ready Hadoop

• Integrated querying across Hadoop and APS using T-SQL (PolyBase)

• Direct integration with Microsoft BI tools such as Power BI

• Near real-time performance with In-Memory

• Scale-out to accommodate your growing data or to increase performance (2-nodes to 56-nodes)

• Remove SMP DW bottlenecks with MPP SQL Server

• No rip and replace when more performance needed

• No performance tuning required

• Concurrency that fuels rapid adoption

• Industry’s lowest DW price/TB

• Value through a single appliance solution

• Value with flexible hardware options using commodity hardware

• Free up space on SAN (cost averages 10k per TB)

Page 34: What is Big Data and why should i care?

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Questions?

James [email protected]

Blog about PDW topics: http://www.jamesserra.com/archive/category/pdw/