An Introduction of Big data; Big data for beginners; Overview of Big Data; Big data Tutorial

Big DataIntroduction to Big Data, Hadoop and Spark

DEFINITIONSBY DEFINITION: Big data refers to large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic.Hadoop & Spark are programming & processing technologies for Big Data.

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics.

Current Scenario

Enterprise applications

Operational Decision Support

Traditionally enterprise applications can be broadly categorized into Operational and Decision support systems.

Lately new set of applications such as Customer Analytics is gaining momentum (eg: YouTube Channel for different categories of users)

Customer Analytics

Current Scenario – Architecture(Typical Enterprise Application)

Client(Browser)

App Server

App ServerDatabase

Current Scenario - Architecture

Recent trends Standardization and consolidation of hardware (servers,

storage, network) etc., to cut down the costs Storage is physically separated from servers and connected

with high speed fiber optics

Database

ServerDatabas

e Server

Database

Server

Network Switch Network Switch Storage Cluster

*Typical database architecture in an enterprise

Oracle ArchitectureStorage

Network Switch

(interconnect)

Database Servers

Databases Databases are clustered (Oracle – RAC)

High availability Fault tolerance Load balancing Scalable (not linear)

Common network storage File abstraction – file can be of any size Fault tolerance (using RAID)

Current Scenario - Architecture Almost all these applications follow similar n-tier architecture

Core applications (operational) EAI (Integrating Enterprise Applications) CRM ERP DW/BI Tools like Informatica, Cognos, Business Objects etc

However there are exceptions – legacy (Mainframes based) applications which uses closed architecture

Application Servers Database Servers Storage

Servers

*Birds eye view – after standardization and consolidation using cloud architecture

Current Scenario - Challenges Almost all operational systems are using relational databases (RDBMS like

Oracle). RDBMS are originally designed for Operational and transactional.

Not linearly scalable. Transactions Data integrity

Expensive Predefined Schema Data processing do not happen where data is stored (storage layer)

Some processing happens at database server level (SQL) Some processing happens at application server level (Java/.net) Some processing happens at client/browser level (Java Script)

Evolution of Databases• Relational Databases (Oracle, Informix, Sybase, MySQL etc)• NoSQL Databases (Cassandra, HBase, MongoDB etc)• In memory Databases (Gemfire, Coherence etc)• Search based Databases (Elastic Search, Solr etc)• Batch processing frameworks (Map Reduce, Spark etc)

* Modern applications need to be polyglot (different modules need different category of databases)

Big Data eco system – Advantages Distributed storage

Fault tolerance (RAID is replaced by replication) Distributed computing/processing

Data locality (code goes to data) Scalability (almost linear) Low cost hardware (commodity) Low licensing costs

Hadoop eco system Evolution of Hadoop eco system Use cases that can be addressed using Hadoop eco system Hadoop eco system tools/landscape

Evolution of Hadoop eco system

GFS to HDFS Google Map Reduce to Hadoop Map Reduce Big Table to HBase

Use cases that can be addressed using Hadoop eco system

ETL Real time reporting Batch reporting Operational but not transactional

Hadoop eco system tools/landscape Operational and real time data integration

HBase ETL

Map reduce, Hive/Pig, Sqoop etc Reporting

Hive (Batch) Impala/Presto (Real time)

Analytics API Map reduce Other frameworks

Miscellaneous/complementary tools Zoo Keeper (co-ordination service for masters) Oozie (Workflow/Scheduler) Chef/Puppet (automation for administrators) Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)

ClosedMain

Frames

XMLExternal

Source(s)

EDW(Current Architecture)

Data Warehouse

Data Integration(ETL/Real

ODSEDW/ODS

Visualization/Reporting

Reporting

Decision Support

Use Case – EDW(Current Architecture)

Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds

Data Integration ODS (Operational Data Store)

Sources – Disparate Real time – Tools/custom (Goldengate, Shareplex etc) Batch – Tools/custom Uses – Compliance, data lineage, reports etc

Enterprise Datawarehouse Sources – ODS or other sources ETL – Tools/custom (Informatica, Ab Initio, Talend)

Reporting/Visualization ODS (Compliance related reporting) Enterprise Datawarehouse Tools (Cognos, Business Objects, Microstrategy, Tableau etc)

EDW(Big Data eco system)

ClosedMain

Frames

XMLExternal

Source(s)

Reporting

Decision Support

Hadoop Cluster(EDW/ODS)

Real Time/Batch (No ETL)

Reporting Database

Hadoop eco system

Hadoop Core Components

Non Map Reduce

Mahout

Hadoop eco system

Hadoop Components

Distributed File System (HDFS)

Map Reduce

Impala

Presto

Hadoop eco system

Distributed File System (HDFS)

Map Reduce

Hadoop Core Components

HiveT and L

Batch Reporting

Non Map Reduce

ImpalaInteractive/adhoc

Reporting

SqoopE and L

OozieWorkflows

Hadoop eco system

Hadoop Components

Custom Map Reduce

E, T and LHBase

Real Time data integration or

Reporting

Disadvantages of Map Reduce Disadvantages of Map Reduce based solutions

Designed for batch, not meant for interactive and ad hoc reporting I/O bound and processing of micro batches can be an issue Too many tools/technologies (Map Reduce, Hive, Pig, Sqoop, Flume etc.) to build

applications Not suitable for enterprise hardware where storage is typically network mounted

Apache Spark Spark can work with any file system including HDFS Processing is done in memory – hence I/O is minimized Suitable for ad hoc or interactive querying or reporting Streaming jobs can be done much faster than map reduce Applications can be developed using Scala, Python, Java etc Choose one programming language and perform

Data integration from RDBMS using JDBC (no need of sqoop) Stream data using spark streaming Leverage data frames and SQL embedded in programming language As processing is done in memory Spark works well with Enterprise Hardware with network

file system

EDW(Big Data eco system - Spark)

ClosedMain

Frames

XMLExternal

Source(s)

Reporting

Decision Support

Hadoop Cluster(EDW/ODS)

Real Time/Batch (No ETL)

Role of Apache• Each of these are separate projects incubated under Apache

– HDFS and MapReduce/YARN– Hive– Pig– Sqoop– HBaseEtc.

Installation (plain vanilla) In plain vanilla mode, depending up on the architecture each

tool/technology needs to be manually downloaded, installed and configured.

Typically people use Puppet or Chef to set up clusters using plain vanilla tools

Advantages You can set up your cluster with latest versions from Apache directly

Disadvantages Installation is tedious and error prone Need to integrate with monitoring tools

Hadoop Distributions Different vendors pre-package apache suite of big data tools into their distribution to facilitate

Easier installation/upgrade using wizards Better monitoring Easier maintenance and many more

Leading distributions include, but not limited to Cloudera Hortonworks MapR AWS EMR IBM Big Insights and many more

Hadoop Distributions

HDFS/YARN/MRHivePig

Apache Foundation

ImpalaTez

Ganglia

HBaseImpala

Zookeeper

Cloudera

Hortonworks

Be a Big Data Expert with SpringPeople

Adminstrator

Apache Hadoop +

Hadoop Administration

Developer

Apache Hadoop +

Apache Spark with Scala

Data Scientist

Apache Hadoop+

Analytics with R / Machine Learning

Become a Big Data Expert in 10 days.

World class training by Certified Subject Matter Experts

More Details

Big Data Bundled Training

Become a Big Data Expert in 8 days.

More Details

Suggested Audience & Other Details Suggested Audience: Developers & Architects Duration: 8 days Prerequisites: -Familiarity with Linux/Unix & Hadoop and some Database experience

Become a Hadoop Guru

Become an overall Hadoop Expert in 5 days.

More Details

Suggested Audience & Other Details Suggested Audience: Developers & Architects Duration: 5 days Prerequisites: -Familiarity with Linux/Unix and some Database experience

How To Become A Big Data Analyst?

Join the bundled training program of Big Data

More Details

Suggested Audience & Other Details Suggested Audience: Developers & Architects Duration: 10 days Prerequisites: -Hands-on experience in Java and some Database experience

Get Certified & #BeTheExpert

Our Certified Partners

For further info/assistance contact:training@springpeople.com

+91 80 6567 9700www.springpeople.com

An Introduction of Big data; Big data for beginners; Overview of Big Data; Big data Tutorial

Data & Analytics

Transcript of An Introduction of Big data; Big data for beginners; Overview of Big Data; Big data Tutorial

Big Data and Hadoop - How Big is this Big Data?

Introduction to Big Data, Big Data Processing, and Big ...

Actionable Data for Google Analytics Beginners

Big Data Meets Big Data Analytics

Beginners Guide To Mastering the 4 Big Compound Lifts

Oracle Big Data Science Oracle OpenWorld 2016vlamiscdn.com/papers/Oracle Big Data Science.pdf · Oracle Portfolio of Big Data Science Products Big Data Discovery Big Data SQL Oracle

Big Data Technology Big Data - aakritsubedi9.com.npaakritsubedi9.com.np/files/Big Data Technology.pdf · Big Data Technology Big Data 1"Big data" is a field that treats ways to analyze,

Big Data Visualization: Turning Big Data into Big Insights · PDF fileWhite Paper Big Data Visualization: Turning Big Data Into Big Insights The Rise of Visualization-based Data Discovery

Caterpillar Big Data Infrastructure Big Data, Data ...

Plug-in for Big Data User's Guide TIBCO ActiveMatrix ... · This tutorial is designed for the beginners who want to use ActiveMatrix BusinessWorks Plug-in for Big Data in TIBCO Business

Caterpillar Big Data Infrastructure Big Data, Data Analytics, and … · Caterpillar Big Data Infrastructure Big Data, Data Analytics, and Machine Learning. Caterpillar is the world’s

techniques, issues and challenges involved in Big Data ...cape.utp.edu.my/.../2017/06/BIG-DATA-FOR-BEGINNERS-COURSE-BR… · analyzing and managing big data. The course is aimed:

Big Data and Business Analytics: The Engine of Digital ... · Enterprise Big Data Strategy . BIG DATA MANAGEMENT . BIG DATA ANALYTICS . BIG DATA APPLICATIONS . BIG DATA INTEGRATION

Big Data to Big Results - AMT-SYBEX · Big Data – really? Big Data – a bigger definition Pioneers of Big Data ... 16 May 2012 From Big Data to Big Results 9 Smart meters, security

with Google Data Studio Data Visualization for Beginners

Tutorial on how to use dropbox for beginners, save your files no matter how big the data is.

BIG DATA Jeroen Wolfs. Agenda •Big data •Check-out & big data •Toepassingen van big data in eCommerce.

Understanding Data for Beginners

Real Time Big data Applications: file · Web viewUNIT I. INTRODUCTION TO BIG DATA. Big Data – Definition, Characteristic Features – Big Data Applications - Big Data vs. Traditional

Big Data meets Big Data