Big DataIntroduction to Big Data, Hadoop and Spark
DEFINITIONSBY DEFINITION: Big data refers to large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic.Hadoop & Spark are programming & processing technologies for Big Data.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics.
Current Scenario
Enterprise applications
Operational Decision Support
Traditionally enterprise applications can be broadly categorized into Operational and Decision support systems.
Lately new set of applications such as Customer Analytics is gaining momentum (eg: YouTube Channel for different categories of users)
Customer Analytics
Current Scenario – Architecture(Typical Enterprise Application)
Client(Browser)
Client(Browser)
Client(Browser)
App Server
App ServerDatabase
Current Scenario - Architecture
Recent trends Standardization and consolidation of hardware (servers,
storage, network) etc., to cut down the costs Storage is physically separated from servers and connected
with high speed fiber optics
Current Scenario - Architecture
Database
ServerDatabas
e Server
Database
Server
Network Switch Network Switch Storage Cluster
*Typical database architecture in an enterprise
Oracle ArchitectureStorage
Network Switch
Network Switch
(interconnect)
Database Servers
Current Scenario - Architecture
Databases Databases are clustered (Oracle – RAC)
High availability Fault tolerance Load balancing Scalable (not linear)
Common network storage File abstraction – file can be of any size Fault tolerance (using RAID)
Current Scenario - Architecture Almost all these applications follow similar n-tier architecture
Core applications (operational) EAI (Integrating Enterprise Applications) CRM ERP DW/BI Tools like Informatica, Cognos, Business Objects etc
However there are exceptions – legacy (Mainframes based) applications which uses closed architecture
Current Scenario - Architecture
Application Servers Database Servers Storage
Servers
*Birds eye view – after standardization and consolidation using cloud architecture
Current Scenario - Challenges Almost all operational systems are using relational databases (RDBMS like
Oracle). RDBMS are originally designed for Operational and transactional.
Not linearly scalable. Transactions Data integrity
Expensive Predefined Schema Data processing do not happen where data is stored (storage layer)
Some processing happens at database server level (SQL) Some processing happens at application server level (Java/.net) Some processing happens at client/browser level (Java Script)
Evolution of Databases• Relational Databases (Oracle, Informix, Sybase, MySQL etc)• NoSQL Databases (Cassandra, HBase, MongoDB etc)• In memory Databases (Gemfire, Coherence etc)• Search based Databases (Elastic Search, Solr etc)• Batch processing frameworks (Map Reduce, Spark etc)
* Modern applications need to be polyglot (different modules need different category of databases)
Big Data eco system – Advantages Distributed storage
Fault tolerance (RAID is replaced by replication) Distributed computing/processing
Data locality (code goes to data) Scalability (almost linear) Low cost hardware (commodity) Low licensing costs
Hadoop eco system Evolution of Hadoop eco system Use cases that can be addressed using Hadoop eco system Hadoop eco system tools/landscape
Evolution of Hadoop eco system
GFS to HDFS Google Map Reduce to Hadoop Map Reduce Big Table to HBase
Use cases that can be addressed using Hadoop eco system
ETL Real time reporting Batch reporting Operational but not transactional
Hadoop eco system tools/landscape Operational and real time data integration
HBase ETL
Map reduce, Hive/Pig, Sqoop etc Reporting
Hive (Batch) Impala/Presto (Real time)
Analytics API Map reduce Other frameworks
Miscellaneous/complementary tools Zoo Keeper (co-ordination service for masters) Oozie (Workflow/Scheduler) Chef/Puppet (automation for administrators) Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)
OLTP
ClosedMain
Frames
XMLExternal
apps
Source(s)
EDW(Current Architecture)
Data Warehouse
Data Integration(ETL/Real
Time)
ODSEDW/ODS
Visualization/Reporting
Reporting
Decision Support
Use Case – EDW(Current Architecture)
Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds
Data Integration ODS (Operational Data Store)
Sources – Disparate Real time – Tools/custom (Goldengate, Shareplex etc) Batch – Tools/custom Uses – Compliance, data lineage, reports etc
Enterprise Datawarehouse Sources – ODS or other sources ETL – Tools/custom (Informatica, Ab Initio, Talend)
Reporting/Visualization ODS (Compliance related reporting) Enterprise Datawarehouse Tools (Cognos, Business Objects, Microstrategy, Tableau etc)
EDW(Big Data eco system)
OLTP
ClosedMain
Frames
XMLExternal
apps
Source(s)
Visualization/Reporting
Reporting
Decision Support
Node
Node
Node
Hadoop Cluster(EDW/ODS)
ETL
Real Time/Batch (No ETL)
Reporting Database
Hadoop eco system
Hadoop Core Components
Non Map Reduce
Hive
Pig
Flume
Sqoop
Oozie
Mahout
Hadoop eco system
Hadoop Components
Distributed File System (HDFS)
Map Reduce
Impala
Presto
HBase
Spark
Hadoop eco system
Distributed File System (HDFS)
Map Reduce
Hadoop Core Components
HiveT and L
Batch Reporting
Non Map Reduce
ImpalaInteractive/adhoc
Reporting
SqoopE and L
OozieWorkflows
Hadoop eco system
Hadoop Components
Custom Map Reduce
E, T and LHBase
Real Time data integration or
Reporting
Disadvantages of Map Reduce Disadvantages of Map Reduce based solutions
Designed for batch, not meant for interactive and ad hoc reporting I/O bound and processing of micro batches can be an issue Too many tools/technologies (Map Reduce, Hive, Pig, Sqoop, Flume etc.) to build
applications Not suitable for enterprise hardware where storage is typically network mounted
Apache Spark Spark can work with any file system including HDFS Processing is done in memory – hence I/O is minimized Suitable for ad hoc or interactive querying or reporting Streaming jobs can be done much faster than map reduce Applications can be developed using Scala, Python, Java etc Choose one programming language and perform
Data integration from RDBMS using JDBC (no need of sqoop) Stream data using spark streaming Leverage data frames and SQL embedded in programming language As processing is done in memory Spark works well with Enterprise Hardware with network
file system
EDW(Big Data eco system - Spark)
OLTP
ClosedMain
Frames
XMLExternal
apps
Source(s)
Visualization/Reporting
Reporting
Decision Support
Node
Node
Node
Hadoop Cluster(EDW/ODS)
ETL
Real Time/Batch (No ETL)
Role of Apache• Each of these are separate projects incubated under Apache
– HDFS and MapReduce/YARN– Hive– Pig– Sqoop– HBaseEtc.
Installation (plain vanilla) In plain vanilla mode, depending up on the architecture each
tool/technology needs to be manually downloaded, installed and configured.
Typically people use Puppet or Chef to set up clusters using plain vanilla tools
Advantages You can set up your cluster with latest versions from Apache directly
Disadvantages Installation is tedious and error prone Need to integrate with monitoring tools
Hadoop Distributions Different vendors pre-package apache suite of big data tools into their distribution to facilitate
Easier installation/upgrade using wizards Better monitoring Easier maintenance and many more
Leading distributions include, but not limited to Cloudera Hortonworks MapR AWS EMR IBM Big Insights and many more
Hadoop Distributions
HDFS/YARN/MRHivePig
Apache Foundation
Sqoop
ImpalaTez
Flume
Spark
Ganglia
HBaseImpala
Zookeeper
Cloudera
Hortonworks
MapR
AWS
Be a Big Data Expert with SpringPeople
Adminstrator
Apache Hadoop +
Hadoop Administration
Developer
Apache Hadoop +
Apache Spark with Scala
Data Scientist
Apache Hadoop+
Analytics with R / Machine Learning
Become a Big Data Expert in 10 days.
World class training by Certified Subject Matter Experts
More Details
Big Data Bundled Training
Become a Big Data Expert in 8 days.
World class training by Certified Subject Matter Experts
More Details
Suggested Audience & Other Details Suggested Audience: Developers & Architects Duration: 8 days Prerequisites: -Familiarity with Linux/Unix & Hadoop and some Database experience
Become a Hadoop Guru
Become an overall Hadoop Expert in 5 days.
World class training by Certified Subject Matter Experts
More Details
Suggested Audience & Other Details Suggested Audience: Developers & Architects Duration: 5 days Prerequisites: -Familiarity with Linux/Unix and some Database experience
How To Become A Big Data Analyst?
Join the bundled training program of Big Data
World class training by Certified Subject Matter Experts
More Details
Suggested Audience & Other Details Suggested Audience: Developers & Architects Duration: 10 days Prerequisites: -Hands-on experience in Java and some Database experience
Get Certified & #BeTheExpert
Our Certified Partners
For further info/assistance contact:[email protected]
+91 80 6567 9700www.springpeople.com
Top Related