BIG-05 - Qlikcloudera.qlik.com/files/Qlik Intro to Hadoop.pdf · 1 BIG-05 Introduction to Hadoop:...

16
1 BIG-05 Introduction to Hadoop: Qlik integration points and ecosystem David Freriks Technology Evangelist February 2017

Transcript of BIG-05 - Qlikcloudera.qlik.com/files/Qlik Intro to Hadoop.pdf · 1 BIG-05 Introduction to Hadoop:...

1

BIG-05

Introduction to Hadoop:

Qlik integration points and ecosystem

David Freriks – Technology Evangelist

February 2017

2

Agenda

• Apache Hadoop basics

• Big Data Vendors

• Hadoop Native Components

• “Big Data” Qlik connection options

• Hadoop Distribution Vendors

3

Background

• History of Hadoop: Hadoop was a joint creation by Doug Cutting and

Mike Cafarella in 2005 as a project funded by Yahoo to crawl webdata for

a search engine project.

─ What we call Hadoop (named after Cutting’s son’s toy elephant) today was

started as a fusion of two separate technologies, HDFS (originally GFS – Google

File System) and MapReduce (also Google).

• The project was donated to the Apache foundation in 2006. The Apache

Foundation (https://www.apache.org/) is an open source community that

collaborates, incubates, and curates over 350 projects of which Hadoop

and ancillary components are a major percentage of the work ongoing

within the organization.

4

Apache Big Data Master Map

• There are clearly a lot of components

involved with a Big Data implementation

• Very few of these are important for Qlik

• Many of the components will or have been

replaced/superseded as Hadoop evolves

• These are just a sample of the open

source projects currently incubating by

Apache

* courtesy Geoffrey Fox

5

• Big Data is: Nebulous

• Big Data is: Really Big or Not

• Big Data is: Lots of Noise

• Big Data is: Slow

• Big Data is: Difficult

Let’s Recap - What is traditionally “Big Data”?

6

Popular “Big Data” Myths

• You need to have Ga-zinga-bytes of data to deploy a Big Data solution

– Typical Cloudera Cluster is 15-20 nodes, < 10TB of data

– Hadoop storage is 300-400% cheaper than an EDW

• Hadoop is all you need

– Hadoop is an enabling technology that provides the foundation for Big Data solutions

– Focus today is on data management

• The RDBMS is dead

– RDBMS is still critical – but not for high volume, low quality analytics

• BI tools can’t handle Big Data

– Reality is a Human can’t handle Big Data

– Most tools can’t… Unless they have a platform behind them.

7

Big Data vs. Fast Data vs. Right Data

• Big Data is rapidly shifting from how much data you can handle to how quickly you can deliver value

– Volume of Data is just one, less and less critical factor

– Context and value are key, but difficult to pinpoint

• Big Data:

– Hadoop is designed to support petabytes and beyond

• Fast Data:

– Teradata, Vertica, JethroData, AtScale, etc

• Big Data is slow & cheap, Fast Data is neither

• A Big Data Solution requires components that address both

– Hadoop is the data system that combines Fast and Big platform

– Qlik is the platform that supports both scenarios simultaneously

8

•Hadoop Distributed File System HDFS

•Processing framework for writing scalable data applicationsMapReduce

•Built in SQL interface for Hive and Impala enginesHue

•Resource management for core HadoopYARN

•System for querying data on top of HDFS (SQL-like query)Hive / Tez / LLAP

•NOSQL Databases used in conjunction with HadoopHbase/Cassandra

•Hadoop accelerated SQL query engines that bypass MapReduceImpala / Drill / Presto

•In-memory large-scale data processing– 100x faster than HadoopSpark / Streaming

•SQL engine on top of Spark Spark SQL

•Hadoop search and indexing engine (connect via REST)SOLR

And on, and on… 350+ components and growing

Core Hadoop

Spark

“Big Data” Refresher - Hadoop Native Interfaces

*Indicates available

interface via

SQL/REST for Qlik

9

Qlik “Big Data” Options

10

Hadoop Distribution Options

• Cloudera

• Hortonworks

• MapR

• Databricks

• Amazon EMR

11

Cloudera

• Founded in 2008 by some of Silicon

Valley’s leading companies—including

Google (Christophe Bisciglia), Yahoo

(Amr Awadallah), Oracle (Mike Olson),

and Facebook (Jeff Hammerbacher).

• Cloudera has over 1,400 employees

across the globe that are committed to

excellence in big data management.

• Cloudera has won numerous awards

and accolades from industry

watchdogs—including the 2014 and

2015 Database Trends and

Applications Magazine Readers’

Choice Award for best analytical

platform.

*from Cloudera website

12

Hortonworks

• Hortonworks, Inc.® (NASDAQ: HDP) is a leading

innovator in the data industry, creating, distributing

and supporting enterprise-ready open data

platforms and modern data applications. Our

mission is to manage the world’s data. We have a

single-minded focus on driving innovation in open

source communities such as Apache Hadoop,

NiFi, and Spark.

• We along with our partners provide the expertise,

training and services that allow our customers to

unlock transformational value for their

organizations across any line of business. Our

connected data platforms power modern data

applications that deliver actionable intelligence

from all data: data-in-motion and data-at-rest.

• We are Powering the Future of Data™.

*from Hortonworks website

13

MapR

• MapR provides the industry’s only converged

data platform that uniquely allows applying

analytical insights to operational processes in

real-time to create competitive advantage for

our customers.

• Our vision is of a platform that converges

historically separate product

segments/categories in order to enable

extraordinary new value never before possible.

• The MapR Platform is powered by the

industry’s fastest, most reliable, secure, and

open data infrastructure that dramatically

lowers TCO and enables global real-time data

applications.

*from MapR website

14

Databricks• Databricks was founded out of the UC Berkeley

AMPLab by the team that created Apache Spark.

We believe that Big Data is a huge opportunity

that is still largely untapped, and we’re working to

revolutionize what you can do with it.

• Apache Spark is 100% open source, hosted at

the vendor-independent Apache Software

Foundation. At Databricks, we are fully committed

to maintaining this open development model. We

believe that no computing platform will win in the

Big Data space unless it is fully open.

• Spark has one of the largest open source

communities in Big Data, with over 1000

contributors from 250+ organizations. Databricks

works within the open source community to

maintain this momentum.*from Databricks website

• Note: Databricks is not really a distribution vendor at all,

but a cloud provider of Spark and other components as a

hosted service. The unique offering however, does

sometimes compete against the other distro’s.

15

Amazon EMR

• Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to

process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run

other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR,

and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

• Amazon EMR is based on Hadoop, a Java-based programming framework that supports the processing

of large data sets in a distributed computing environment. MapReduce is a software framework that

allows developers to write programs that process massive amounts of unstructured data in parallel

across a distributed cluster of processors or stand-alone computers. Amazon EMR processes data

across a Hadoop cluster of virtual servers on the Amazon Elastic Compute Cloud (EC2). The elastic in

EMR's name refers to its dynamic resizing ability, which allows it to ramp up or reduce resource use

depending on the demand at any given time.

• Amazon is comparable to DataBricks in that it is a hosted environment, however, you can host other

mainline distributions (i.e. MapR) or run Amazon’s own version of Hadoop. *from Amazon’s website

Unique Offerings related to Qlik: Amazon provides ODBC drivers to connect to standard Hadoop SQL

interfaces such as Hive, Impala, and Spark SQL.

16

• Hadoop is a very complicated system, but for Qlik

there are only a few SQL components we need to use

• Each distribution offers it’s own unique set of

advantages and disadvantages for SQL engines used

by Qlik

• The next topic BIG-06 will discuss the performance of

the individual native/open source SQL engines for

Hadoop.

• Resources:

─ “Hadoop the Definitive Guide (vol 4)” by Tom White

Summary & Additional Resources