BIG-05 - Qlikcloudera.qlik.com/files/Qlik Intro to Hadoop.pdf · 1 BIG-05 Introduction to Hadoop:...
Transcript of BIG-05 - Qlikcloudera.qlik.com/files/Qlik Intro to Hadoop.pdf · 1 BIG-05 Introduction to Hadoop:...
1
BIG-05
Introduction to Hadoop:
Qlik integration points and ecosystem
David Freriks – Technology Evangelist
February 2017
2
Agenda
• Apache Hadoop basics
• Big Data Vendors
• Hadoop Native Components
• “Big Data” Qlik connection options
• Hadoop Distribution Vendors
3
Background
• History of Hadoop: Hadoop was a joint creation by Doug Cutting and
Mike Cafarella in 2005 as a project funded by Yahoo to crawl webdata for
a search engine project.
─ What we call Hadoop (named after Cutting’s son’s toy elephant) today was
started as a fusion of two separate technologies, HDFS (originally GFS – Google
File System) and MapReduce (also Google).
• The project was donated to the Apache foundation in 2006. The Apache
Foundation (https://www.apache.org/) is an open source community that
collaborates, incubates, and curates over 350 projects of which Hadoop
and ancillary components are a major percentage of the work ongoing
within the organization.
4
Apache Big Data Master Map
• There are clearly a lot of components
involved with a Big Data implementation
• Very few of these are important for Qlik
• Many of the components will or have been
replaced/superseded as Hadoop evolves
• These are just a sample of the open
source projects currently incubating by
Apache
* courtesy Geoffrey Fox
5
• Big Data is: Nebulous
• Big Data is: Really Big or Not
• Big Data is: Lots of Noise
• Big Data is: Slow
• Big Data is: Difficult
Let’s Recap - What is traditionally “Big Data”?
6
Popular “Big Data” Myths
• You need to have Ga-zinga-bytes of data to deploy a Big Data solution
– Typical Cloudera Cluster is 15-20 nodes, < 10TB of data
– Hadoop storage is 300-400% cheaper than an EDW
• Hadoop is all you need
– Hadoop is an enabling technology that provides the foundation for Big Data solutions
– Focus today is on data management
• The RDBMS is dead
– RDBMS is still critical – but not for high volume, low quality analytics
• BI tools can’t handle Big Data
– Reality is a Human can’t handle Big Data
– Most tools can’t… Unless they have a platform behind them.
7
Big Data vs. Fast Data vs. Right Data
• Big Data is rapidly shifting from how much data you can handle to how quickly you can deliver value
– Volume of Data is just one, less and less critical factor
– Context and value are key, but difficult to pinpoint
• Big Data:
– Hadoop is designed to support petabytes and beyond
• Fast Data:
– Teradata, Vertica, JethroData, AtScale, etc
• Big Data is slow & cheap, Fast Data is neither
• A Big Data Solution requires components that address both
– Hadoop is the data system that combines Fast and Big platform
– Qlik is the platform that supports both scenarios simultaneously
8
•Hadoop Distributed File System HDFS
•Processing framework for writing scalable data applicationsMapReduce
•Built in SQL interface for Hive and Impala enginesHue
•Resource management for core HadoopYARN
•System for querying data on top of HDFS (SQL-like query)Hive / Tez / LLAP
•NOSQL Databases used in conjunction with HadoopHbase/Cassandra
•Hadoop accelerated SQL query engines that bypass MapReduceImpala / Drill / Presto
•In-memory large-scale data processing– 100x faster than HadoopSpark / Streaming
•SQL engine on top of Spark Spark SQL
•Hadoop search and indexing engine (connect via REST)SOLR
And on, and on… 350+ components and growing
Core Hadoop
Spark
“Big Data” Refresher - Hadoop Native Interfaces
*Indicates available
interface via
SQL/REST for Qlik
11
Cloudera
• Founded in 2008 by some of Silicon
Valley’s leading companies—including
Google (Christophe Bisciglia), Yahoo
(Amr Awadallah), Oracle (Mike Olson),
and Facebook (Jeff Hammerbacher).
• Cloudera has over 1,400 employees
across the globe that are committed to
excellence in big data management.
• Cloudera has won numerous awards
and accolades from industry
watchdogs—including the 2014 and
2015 Database Trends and
Applications Magazine Readers’
Choice Award for best analytical
platform.
*from Cloudera website
12
Hortonworks
• Hortonworks, Inc.® (NASDAQ: HDP) is a leading
innovator in the data industry, creating, distributing
and supporting enterprise-ready open data
platforms and modern data applications. Our
mission is to manage the world’s data. We have a
single-minded focus on driving innovation in open
source communities such as Apache Hadoop,
NiFi, and Spark.
• We along with our partners provide the expertise,
training and services that allow our customers to
unlock transformational value for their
organizations across any line of business. Our
connected data platforms power modern data
applications that deliver actionable intelligence
from all data: data-in-motion and data-at-rest.
• We are Powering the Future of Data™.
*from Hortonworks website
13
MapR
• MapR provides the industry’s only converged
data platform that uniquely allows applying
analytical insights to operational processes in
real-time to create competitive advantage for
our customers.
• Our vision is of a platform that converges
historically separate product
segments/categories in order to enable
extraordinary new value never before possible.
• The MapR Platform is powered by the
industry’s fastest, most reliable, secure, and
open data infrastructure that dramatically
lowers TCO and enables global real-time data
applications.
*from MapR website
14
Databricks• Databricks was founded out of the UC Berkeley
AMPLab by the team that created Apache Spark.
We believe that Big Data is a huge opportunity
that is still largely untapped, and we’re working to
revolutionize what you can do with it.
• Apache Spark is 100% open source, hosted at
the vendor-independent Apache Software
Foundation. At Databricks, we are fully committed
to maintaining this open development model. We
believe that no computing platform will win in the
Big Data space unless it is fully open.
• Spark has one of the largest open source
communities in Big Data, with over 1000
contributors from 250+ organizations. Databricks
works within the open source community to
maintain this momentum.*from Databricks website
• Note: Databricks is not really a distribution vendor at all,
but a cloud provider of Spark and other components as a
hosted service. The unique offering however, does
sometimes compete against the other distro’s.
15
Amazon EMR
• Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to
process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run
other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR,
and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.
• Amazon EMR is based on Hadoop, a Java-based programming framework that supports the processing
of large data sets in a distributed computing environment. MapReduce is a software framework that
allows developers to write programs that process massive amounts of unstructured data in parallel
across a distributed cluster of processors or stand-alone computers. Amazon EMR processes data
across a Hadoop cluster of virtual servers on the Amazon Elastic Compute Cloud (EC2). The elastic in
EMR's name refers to its dynamic resizing ability, which allows it to ramp up or reduce resource use
depending on the demand at any given time.
• Amazon is comparable to DataBricks in that it is a hosted environment, however, you can host other
mainline distributions (i.e. MapR) or run Amazon’s own version of Hadoop. *from Amazon’s website
Unique Offerings related to Qlik: Amazon provides ODBC drivers to connect to standard Hadoop SQL
interfaces such as Hive, Impala, and Spark SQL.
16
• Hadoop is a very complicated system, but for Qlik
there are only a few SQL components we need to use
• Each distribution offers it’s own unique set of
advantages and disadvantages for SQL engines used
by Qlik
• The next topic BIG-06 will discuss the performance of
the individual native/open source SQL engines for
Hadoop.
• Resources:
─ “Hadoop the Definitive Guide (vol 4)” by Tom White
Summary & Additional Resources