HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM … DATA Hadoop workshop... · HOW TO LIVE WITH...

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM

APACHE HADOOP WORKSHOP

AGENDA

• Introduction

• What is Hadoop and the rationale behind it

• Hadoop Distributed File System (HDFS) and MapReduce

• Common Hadoop use cases

• How Hadoop integrates with other systems like Relational Databases and Data Warehouses

• The other components in a typical Hadoop “stack” such as: Hive, Pig, HBase, Sqoop, Flume and Oozie

• Conclusion

ABOUT TRIFORCE

Triforce provides critical, reliable IT infrastructure solutions and services to

Australian and New Zealand listed corporations and government agencies. Triforce has

qualified and experienced technical and sales consultants and demonstrated experience in designing and delivering enterprise Apache

Hadoop solutions.

TRIFORCE BIG DATA PARTNERSHIP

NetApp Cloudera

• Cloudera is the market leader in

Hadoop enterprise solutions.

Cloudera’s 100% open-source

distribution including Apache

Hadoop (CDH), combined with

Cloudera Enterprise, comprises

the most reliable and complete

Hadoop solution available.

• The NetApp Open Solution for

Hadoop provides customers with

flexible choices for delivering

enterprise-class Hadoop.

WHAT IS HADOOP?

• “a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.” (http://hadoop.apache.org/)

• “Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data.” (http://en.wikipedia.org/wiki/Hadoop/)

http://hadoop.apache.org/

http://en.wikipedia.org/wiki/Hadoop/

THE RATIONALE FOR HADOOP

• “Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big.” (http://www.cloudera.com)

• Hadoop processes petabytes of unstructured data in parallel across potentially thousands of commodity boxes using an open source file-system and related tools

• Hadoop has been all about innovative ways to process, store, and eventually analyse huge volumes of multi-structured data.

http://www.cloudera.com/

EXAMPLES

• 2.7 Zettabytes of data exist in the digital universe today. (Gigabyte, Terabyte, Petabyte, Exabyte, Zettabyte)

• Facebook stores, accesses, and analyses 30+ Petabytes of user generated data.

• Decoding the human genome originally took 10 years to process; now it can be achieved in one week.

• YouTube users upload 48 hours of new video every minute of the day.

• 100 terabytes of data uploaded daily to Facebook

HADOOP

• Handles all types of data – structured, unstructured, log files, pictures, audio files, communications records, email

• No prior need for a schema– you don’t need to know how you intend to query your data before you store it

• Makes all of your data useable– By making all of your data useable, not just what’s in your databases, Hadoop lets you

see relationships that were hidden before and reveal answers that have always been just out of reach. You can start making more decisions based on hard data instead of hunches and look at complete data sets, not just samples.

• Two parts to Hadoop – MapReduce

– Hadoop Distributed File System (HDFS)

What is this Big Elephant ? HADOOP

Geever Paul Pulikkottil

BigData Solutions Architect (CCAH,CCDH)

CASE FOR BIGDATA

Databases

– here for more than 20yrs

– continue to store structured transactional data• Large server (s)

• Multi CPUs

• Huge Memory Buffer

• SAN disks

– Relatively low latency queries, indexed data

TYPICAL WORKLOADS – DATABASE

OLTP (online transaction processing)

• Typical Use: e-commerce, banking

• Nature: User facing, real-time, low latency, highly-concurrent

• Job: relatively small set of “standard” transactional queries

• Data access pattern: random reads, updates, writes (relatively small data)

OLAP (online analytical processing)

• Typical Use: BI, Data Mining

• Nature: Back-end processing, Batch workloads

• Job: complex analytical queries, often ad hoc

• Data access: Table scans, Large query

CASE FOR BIGDATA

CASE FOR BIGDATA

Data warehouse:– Consolidated database loaded from CRM, ERP, OLTP

– Process: Staging, Cleansing, Loading

– Purpose: BI Reporting, Forecasts, Quarterly reporting

– Size: larger server, multiple CPUs, SAN disks- many TBs

• Challenge:• As the data grows overtime, things getting slower

• Batch should fit in within daily, weekly loading cycle

• Relatively expensive to license, store, manage

CASE FOR BIGDATA

New Objective: Businesses wants to “connect” with the customer• We are generating lots of data – most discarded them

• Likes and Dislikes – Facebook, Twitter, Linked-in

• Predictable outcomes - you can when you know the customer

• React quickly – time missed = opportunity lost !

Question: Can DW provide that ? • Where can you store TB or PB’s unstructured data more economically

• How can you scale out easily, rather than forklift upgrades

• How can I finish batch jobs when the data grows beyond TBs

• Need a scalable, distributed system that can store and process large amounts of data

CASE FOR BIGDATA

• Distributed systems are not NEW:– Common frameworks include MPI, PVM

– Focuses on distributing the processing workload

– Powerful compute nodes with Separate systems for data storage

– Fast network connections – Infiniband

• Typical processing pattern:– Step 1: Copy input data from storage to compute node

– Step 2: Perform necessary processing

– Step 3: Copy output data back to storage

– Often hundreds to thousands of nodes with GPUs

CASE FOR BIGDATA

Distributed HPC

– relatively small amounts of data

– doesn’t scale with large amounts of data

– more time spent copying data than actually processing

– getting data to the processors is the bottleneck

– getting worse as more compute nodes are added

– each node competing for the same bandwidth

– compute nodes become starved for data

“Distributed systems pay for compute scalability by adding complexity CudaFortran , PGI programing ? ”

BIGDATA SOLUTION: HADOOP

What is Hadoop

– open source distributed computing platform

– based on Google’s GFS File system

– commodity hardware, no SAN, no infiniband

– scale up from single servers to thousands of machines

– each offering local computation and storage

– designed to detect and handle failures at the application layer

– adding more nodes, increase “performance” and “capacity” with no penalty

– commodity hardware is prone to failures, Hadoop knows that !

HADOOP CLUSTER STACK

Master Nodes (1st rack)- Name Node- Standby Name Node- Job Tracker

Slave Nodes (all racks)- Data Nodes with direct attached large capacity disks (SATA)

Plus:- Management or Admin Node- Hadoop Client Node(s)- Typical setup

MAPREDUCE PROGRAMING

Hadoop is great for large-data processing !- MapReduce code requires you to write Java class, driver code- Its complicated to write MapReduce jobs so we need a

simpler method. - Develop a higher-level language to facilitate large data

processing- Hive: SQL language for Hadoop , called HQL- Pig: Pig Latin is scripting language, a bit like Perl

- Both translate and run a series of Map only or MapReduce Jobs

ECOSYSTEM TOOLS: HIVE AND PIG

Hive: - Data warehousing application in Hadoop- Query language is HQL, variant of SQL- Tables stored on HDFS as flat files- Developed by Facebook, now open source

Pig: - large-scale data processing system- Scripts are written in Pig Latin- Dataflow language Developed by Yahoo!, now open source

Objective:- Higher-level language to facilitate large-data processing

- Higher-level language “compiles down” to Hadoop jobs

HIVE AND PIG EXAMPLE CODEHive example:

Pig example:

ECOSYSTEM TOOLS: SQOOP

- Import data from RDBMS to Hadoop

- Individual tables, Portions (where clause) or entire Databases

- Stored to HDFS as delimited text files or Sequence Files

- Provides the ability to import from SQL databases straight into your Hive Datawarehouse

- JDBC to connect to RDBMS, additional connectors available to BI/DW

- Sqoop automatically generates a Java class to import data into Hadoop

- Sqoop provides an incremental import mode

- Export tables to RDBMS from Hadoop

SQOOP IMPORT EXAMPLES

> Importing Data into HDFS as Hive table using SQOOP

user@dbserver$> sqoop --connect jdbc:mysql://db.example.com/website --table USERS --local \

--hive-import

> Importing Data to HDFS as compressed sequence files (No Hive) using SQOOP

user@dbserver$>sqoop --connect jdbc:mysql://db.example.com/website --table USERS \

--as-sequencefile

> Importing Data into HBase using SQOOP:

$ sqoop import --connect jdbc:mysql://localhost/acmedb \

--table ORDERS --username test --password **** \

--hbase-create-table --hbase-table ORDERS --column-family mysql

>Exporting Data to RDBMS using SQOOP:

$ sqoop export --connect jdbc:mysql://localhost/acmedb \

--table ORDERS --username test --password **** \

--export-dir /user/arvind/ORDERS

• This would connect to the MySQL database on this server and import the USERS table into HDFS.

• The –-local option instructs Sqoop to take advantage of a local MySQL connection.

• The –-hive-import option after reading the data into HDFS, Sqoop will connect to the Hive metastore, create a table named USERS with the same columns and types (translated into their closest analogues in Hive), and load the data into the Hive warehouse directory on HDFS (instead of a subdir of your HDFS home dir)

SQOOP CUSTOM CONNECTORS

Sqoop Works with standard JDBC connection with common

Databases, custom faster tuned connectors available for

– Cloudera Connector for Teradata

– Cloudera Connector for Netezza

– Cloudera Connector for MicroStrategy

– Cloudera Connector for Tableau

– Quest Data Connector for Oracle and Hadoop

ECOSYSTEM TOOLS: FLUME

Flume:Gather data/logs from Multiple systems, inserting them into HDFS as they are

generated. Typically used to ingest log files from real-time systems such as Web servers, firewalls and mail servers into HDFS.

Each Flume agent has a source and a sink Source

– Tells the node where to receive data from Sink

– Tells the node where to send data to Channel

– A queue between the Source and Sink– Can be in memory only or ‘Durable’– Durable channels will not lose data if power is lost

ECOSYSTEM TOOLS: FUSE

FUSE :“ Filesystem in Userspace “– Allows HDFS to be mounted as a UNIX file system

– User can operate 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep', or use

standard Posix libraries like open, write, read, close.

– You can export a fuse mount using NFS,

ECOSYSTEM TOOLS: OOZIE

Oozie: – Oozie is a ‘workflow engine’

– Runs workflows of Hadoop jobs

– Pig, Hive, Sqoop jobs

– Jobs can be run at specific times, One-off or recurring

– Jobs can also be run when data is present in a directory

ECOSYSTEM TOOLS: MAHOUT

Mahout:- Mahout is a Machine Learning

library

- Contains many pre written ML algorithms

- R is another set of open source library used by Data Scientists

ECOSYSTEM TOOLS: IMPALA <CDH4.1>

IMPALA:– Brings real-time, ad hoc query

– Query data stored in HDFS or HBase

– SELECT, JOIN, and aggregate functions in real time.

– Uses the same Hive Metadata

– SQL syntax (Hive SQL), ODBC driver

– User interface (Hue Beeswax) as Hive and Impala shell

– Released 26th Oct 2012 CDH4.1

HBASE – REAL TIME DATA WITH UPDATEHBase is a distributed, sparse, column-oriented data store

– Real-time read/write access to data on HDFS

– Modeled after Google’s Bitable data store

– Designed to use multiple machines to store and serve data

– Leverages HDFS to store data

– Each row may or may not have values for all columns

– Data is stored grouped by column, rather than by row

– Columns are grouped into ‘column families’, which define what columns are physically stored together

– Scales to provide very high write throughput – Hundreds of thousands of inserts per second

– Has a constrained access model: NO SQL• Insert a row, retrieve a row, do a full or partial table scan

• Only one column (the ‘row key’) is indexed

– Based on Key/value Store: [rowkey, column family, column qualifier, timestamp] -> Cell Value

• [TheRealMT, info, password, 1329088818321] -> abc123

• [TheRealMT, info, password, 13290888321289] -> newpass123

HBASE

Hbase:– Indexed by [rowkey+column qualifier +timestamp]

• HBase is Not a Relational Database– No SQL Query language (GET/PUT/SCAN)

– No Joins, No Secondary Indexing, No Transactions

– Table is split into Regions

– Regions are served by Region Servers

– Region Servers are Java processes, on DataNodes

– two special tables: ROOT and .META

– MemStore, Hfiles

– Every Memstore flush creates one HFile per Col.Fam

– Compactions Major/Minor – reduce consolidated hfiles

DATA HAS CHANGED

HADOOP USE CASES:

• What do we know today?• We love to be connected and collaborated

• We love to share emotions likes and dislikes

• Digital marketing has focus towards social media

• Get more insights across collection of data

• Need all sorts of data to store and analyse

• Real-time recommendation engines

• Predictive modelling with data science

COMMON HADOOP USE CASES

Financial Services –

Consumer & market risk modelling

Personalization & recommendations

Fraud detection & anti-money laundering

Portfolio valuations


• Government –

– Cyber security & fraud detection,

– Geospatial image & video processing


• Media & Entertainment –

– Search & recommendation optimization,

– User engagement & digital content analysis,

– Ad/offer targeting,

– Sentiment & social media analysis

HADOOP USE CASES: DATA STORES

OLTP database (OLTP)

• for user-facing transaction, Retain records

Extract-Transform-Load (ETL)

• Periodic ETL (e.g., nightly), Extract records from source

• Transform: clean data, check integrity, aggregate, etc.

• Load into OLAP database

OLAP database for Data Warehousing (DW)

• Business Intelligence: reporting, ad hoc queries, data mining

HADOOP USE CASES: REPLACE DW ?

Reporting is often a nightly task

• ETL is often slow, runs after the day

• What happens if processing 24 hours of data takes longer than 24hrHadoop is perfect

• Most likely, you already have some DW

• Ingest is limited by speed of HDFS

• Scales out with more nodes

• Massively parallel

• Ability to use any processing tool

• Much cheaper than parallel databases

• ETL is a batch process anyway!

CLOUDERA DISTRIBUTION HADOOP 4.1

Cloudera Enterprise Subscription Options:

• Cloudera Enterprise Core

• Cloudera Enterprise RTD (Real-Time Delivery)

• Cloudera Enterprise RTQ (Real-Time Query)

WHERE TO FROM HERE?

UnderstandUse Cases

Build a

business Case

Design a solution

Deploy Hadoop

Infrastructure

Confirm Data

sources

Use Hadoop to answer questions

CONTACT TRIFORCE

• Call 1300 664 667

• Email: [email protected]

• View our Big Data Resources page at www.triforce.com.au

• Follow us on LinkedINhttp://www.linkedin.com/company/triforce-australia

mailto:[email protected]

http://www.triforce.com.au/

http://www.linkedin.com/company/triforce-australia

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM … DATA Hadoop workshop... · HOW TO LIVE WITH...

Documents

Transcript of HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM … DATA Hadoop workshop... · HOW TO LIVE WITH...