2. Big Data: An Overview Big Data - High volume - High velocity
- High variety information assets - High Veracity - Require new
forms of processing - Like NoSQL, MapReduce, Machine Learning
Examples Large Hadron Collider 150 million sensors -> data 40
million times/sec data flow > 150 million petabytes (annual ),
or ~ 500 exabytes per day Tipp24 (European lotteries) Analyze
billions of transactions and hundreds of customer attributes Leads
to a 90% decrease in the time it took to build predictive
models
3. DATA: ON A BIG SCALE
4. Hadoop: Elephant in the Room Apache Hadoop - open-source
Java-based software framework - distributed processing of large
data sets - On clusters of computers based on commodity hardware.
Hadoops Benefits (Historical context) - Dont rely on Hardware to
provide HA (Big Iron) - Failures are expected and assumed -
Framework handles failures to provide a HA computing service -
Scale Up v/s Scale Out Key Components - Hadoop Distributed File
System (HDFS) the file system - Hadoop MapReduce the programming
model - Hadoop (v2) YARN: the resource manager Year Activity
2002Nutch Started 2003 GFS White Paper published 2004 Google
MapReduce White Paper 2005 First MR Implementation 2006 Hadoop
project in Apache 2008 Hadoop in Y! Production 2009 Wins 500GB sort
contest
5. Whats the Hadoop Arch., Kenneth ? (1/2)
6. Whats the Hadoop Arch., Kenneth ? (2/2)
7. Hadoop: FAQs What is a Map-Reduce job and why do I care ?
Processing data paradigm in hadoop Batch-mode or in real-time In
Java or in a variety of other langs (see below). There are
higher-level frameworks that help too like Pig , Hive, etc.. I dont
drink java anymore what do I do ? Hadoop is Java-based but Hadoop
Streaming supports python, Ruby, R, etc. I/O bound no difference.
CPU-bound Java better What is Hadoop2 and how will it affect my big
data needs (See slide#14) Much more scalable Programming models v/s
Cluster & Resource Management Under what scenarios should I not
use Hadoop ? Need Answers in a Hurry Queries Are Complex Needing
Optimization Require Random, Interactive Access to Data Store
Sensitive Data Replacing Data Warehouse What are differences
between Hadoop & traditional database ? Hadoop is not a DB ACID
properties Unstructured / mixture of data sources SQL Access
8. Hadoop Stack: Snapshot Technology Domain Description HDFS
File Storage Java-based file storage - reliable and scalable access
MapReduce Programming Framework Original framework for distributed
processing of data Hadoop YARN Resource Mgmt Next generation
framework MR and non-MR models Pig ETL / Data Flow Allows High
level analysis of large data. Generates MR Hive SQL Interface DW -
allows data summarization and ad-hoc queries Hbase Columnar NoSQL
storage Column-oriented NoSQL data storage system Sqoop Data
Exchange Easy data import/export from Hadoop clusters Zookeeper
Process Coordination Highly available system for process
coordination Oozie Workflow Scheduler Helps manage complex DAG job
workflows Ambari Cluster Monitoring Installation, Admin &
Monitoring for Hadoop clusters Avro Serializer Serializes data in
efficient binary format. Uses JSON. Spark Real-time data processing
Powerful processing engine - speed, ease of use, and sophisticated
analytics (using ML).
9. Data Science: The Scoop What is Data Science or a Data
Scientist ? To understand data, to process it, to extract value
from it, to visualize it, to communicate it Single source v/s
disparate sources Mine data for insight to extract
business/competitive value What is Machine Learning then ? The
science of getting computers to act without being explicitly
programmed. Machine learning and statistics may be the stars, but
DS orchestrates the whole show. Practical Uses Product
Recommendation Medical Diagnosis Stock Trading Face Detection
10. Demo: Lets get dirty ! Hadoop running on Single-Node Pseudo
Cluster (Linux VM) Start Hadoop HelloWorld Hadoop style Run a
MapReduce job (wordcount) No Java here Use python scripts to run a
MapReduce job Lipstick on a Pig Perform ETL on some stocks/dividend
data Give me Hive Calculate Top Batter Scores Can you feel the
Hbase Dump Sales Data into Hbase and then access via Hive Use AWS
to show a real cluster Connect to AWS and startup the cluster Demo
performance using wordcount example * All Demos, installation guide
and references available @ GitHub