© Orzota, Inc. 2013
Big Data, Hadoop, NoSQL and more …
Varad MeruSoftware Development Engineer, Orzota, Inc.
[email protected]/in/vmeru
@vrdmr
© Orzota, Inc. 2013 2
Mission: Make big data easy for consumption Offers Big Data/Hadoop Solutions and Software
Services to companies Develops Software to help companies consume Big
Data Founded in March 2012 Headquartered in Silicon Valley, California Offshore offices in Chennai, India
About Orzota
© Orzota, Inc. 2013 3
We work ono Big Datao Hadoopo Cloud Technologieso Data Scienceo Products and Serviceso Everything that it takes to be a valued Player.
About Orzota (contd.)
© Orzota, Inc. 2013 4
Community Development Occasional seminars by Architects, Engineers,
Managers. We invite professionals and aspiring professionals to
join Big Data / Hadoop communities in their geographies.
Pune Hadoop User Group – Participant + Organizer. Chennai Hadoop User Group – Participant + Sponsor.
About Orzota (contd.)
© Orzota, Inc. 2013 5
6
About Me• Orzota, Inc.
• Currently working with Hadoop, Mahout, Cloud, etc.
• Past Work Experience
• Persistent Systems – Search, Recommendation Engines and User Behavior Analytics.
• Area of Interest
• Data Science, Information Retrieval
• Distributed Systems
© Orzota, Inc. 2013
7© Orzota, Inc. 2013
Some of the Innovation Centers in
Technological World
Agenda
• Introduction to BigData• Technologies and Domain
• Hadoop EcoSystem• Introduction to MapReduce
• Architecture – HDFS + MapReduce.
• NoSQL Databases• CAP Theorem
• Different NoSQL Databases
• Other Trends© Orzota, Inc. 2013 8
© Orzota, Inc. 2013
Big Data
9
10
• What is Big Data?
• What does it mean to me?
• Why so much fuss in the industry?
• Who uses these technologies?
• How are they used in the Industry and Academia?
• When to start using them?
• How to learn them?
Big Data
© Orzota, Inc. 2013
11
• Volume - Amassing terabytes—even petabytes—of information.
• 12 terabytes of Tweets created each day.
• 350 billion annual meter readings.
• Velocity - Sometimes 2 minutes is too late.
• Scrutinize 5 million trade events.
• 500 million daily call detail records
• Variety - Big data is any type of data.• 80% data growth in images, video and documents.
“Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”
– Laney Douglas. "The Importance of 'Big Data': A Definition"
Big Data – 3 Vs
© Orzota, Inc. 2013
12
Problem
• Store and Process Data for - • Search Engines,
• Recommendations Engines,
• Fraud Detection,
• Aadhar (Govt. of India),
• Spam Detection, etc.
• Also, in some cases Real-time (e.g. Facebook)
© Orzota, Inc. 2013
13
Solutions ?
• Classical Solutions• Database + Programming Language (Java-Oracle, C#-
SQL Server)
• Data Warehouses – Teradata, Netezza, Microsoft PDW
• Legacy Network Systems• Novel• CORBA• Java RMI – RPC
© Orzota, Inc. 2013
14
Problems of the Solutions
• Problems with Classical Solutions• CAP Theorem, by Prof. Eric Brewer (Berkeley) –
• Choose any 2 between
Consistency, Availability and Partitioning
• ACID Properties• For Small number of Transactions, cumulative overhead still
manageable.• For Very large number of Transactions – Facebook Posts?
• Very High Licensing Fees.
• Closed Source – Stick with the Company’s Eco-System.© Orzota, Inc. 2013
15
Solution to the Problems of the Solutions
• Focus on Problem Domain• What’s more important for your Solution?
• Consistency, Availability, and Partitioning
• Which Industry/Company already face similar Problems?
• How/Where to Collect Data?
• Technology Fields – Internet Companies• Hadoop, NoSQL Datastores
• Open Source, Free and with Friendly Licenses.
© Orzota, Inc. 2013
© Orzota, Inc. 2013
Hadoop Eco-System
16
17
Introduction
• Started by Doug Cutting and Mike Caferella for Nutch – Open Search Engine.
• Further Developed at Yahoo!, Facebook and contributed by people from many companies.
• Named after a Little Toy Elephant owned by Doug’s Son.
• Inspired by 2 research papers from Google
• The Google File System – 2003
• MapReduce – 2004
© Orzota, Inc. 2013
18
Introduction (contd.)
• Contains 3 modules• Distributed File System
• MapReduce
• Commons (A Java library containing common functions used by both DFS and MapReduce)
• Apache Top Level Project• Hadoop’s Website – hadoop.apache.org
• Two Parallel Release Cycles – 1.x and 2.x
© Orzota, Inc. 2013
19
• A Rich Eco-System built around Hadoop• Hive – Large Scale Data Warehouse
• Hbase – NoSQL Database
• Pig – A Data-flow language on top of Hadoop
• Flume – Log Management for Hadoop
• Oozie –Workflow framework
• Mahout – Machine Learning Library on top of Hadoop
• Vaidya – Performance benchmarking framework.
• MRUnit – Unit testing framework for MapReduce Programs.
• And many more …
© Orzota, Inc. 2013
Introduction (contd.)
20
MapReduce in 2 minutes –
Problem Statement – Sum of Double of set of Numbers.
The intermediate array after Processing
MapReduce
1 3 4 5 6 8 9 11 17 21 1
3
4
5
6
8
9
11
17
21
2
6
8
10
12
16
18
22
34
42© Orzota, Inc. 2013
21
Introduction – contd.
Mapping Phase
• Splitting the input
• Sending slaves(datanodes) the mapping code - f(x).
• Apply the f(x) method on the data split 1
1
98
6
11
43
17
21
The Master Node
This node contains the code of the function to be applied on individual entries of Array
Written in the map() method in Hadoop.
Mapping Phase
Code f(x) being sent to the slave node for applying the logic on the data piece. In our case the data piece is an entry from the Array.
Slave Nodes
© Orzota, Inc. 2013
22
Introduction – contd.
Spill Phase
• Masternode directs the Mappers to send the processed f(x) output data to intermediate location.
• Shuffle and Sorting2
2
1816
12
22
86
34
42
The Master Node.
The Results of the Processed Data (from the slave nodes is given to s specific node where reducer function runs)
Spill Phase :- Shuffle and Sort
Slave Nodes
© Orzota, Inc. 2013
23
Introduction – contd.
Reduce Phase
• MasterNode (JobTracker) to invokes the Reduce task once the spilling is over.
• Get location of the Spill output from MasterNode (Namenode).
g(x)=162The Master
Node.The Results of the Processed Data (from the slave nodes is given to s specific node where reducer function runs)
Reducer Phase
Slave Nodes
© Orzota, Inc. 2013
24
Steps involved in writing a MapReduce program
• Write the Mapper
• Write the Reducer
• Write the Driver
Life’s Simple until you start customizing and work on Data Cleansing
MapReduce Programming
© Orzota, Inc. 2013
25
Hadoop – Bird’s Eye View
© Orzota, Inc. 2013
DN TT
DN TT
DN TTDN TTDN TT
DN TT
DN TTDN TT …
… …
Name Node
Job Tracker
DFS Message Path
MapReduce Processing Msg
© Orzota, Inc. 2013
NoSQL – Not Only SQL
26
27
Non-Relational Databases
• Data Model not bound by a Schema.
• No Predetermined Schema, Run-Time Columns
• Sample Data• Twitter Streams
• Web Forms
• Sensor Networks
Introduction
© Orzota, Inc. 2013
28
Schema-less Systems
Entry 1
{“name”:“emp1”}
Entry 2
{“name”:“emp2”,“e_id”:“1”,“e_addr”:“Cupertino”}
Entry 3
{“name”:“emp3”,“e_id”:“3”}
Entry 4
{“name”:“emp4”,“e_id”:“6”, “dob”:“03-Sep-1964”}
© Orzota, Inc. 2013
Business Requirements
• High Writes, Low Reads – Sensor Networks, Large Hadron Collider, Click Logging.
• High Reads, Low Writes – Archival Storage.
• Don’t have any fixed Schema.
Open Question - Where Else?
29© Orzota, Inc. 2013
30
NoSQL Types
• Key-Value Pair • Riak, Voldemort, etc.
• Document Oriented • CouchDB, MongoDB, etc.
• BigTable Implementations • Cassandra, HyperTable, Hbase, etc.
• Graph oriented • Neo4j, etc.
© Orzota, Inc. 2013
31
Introduction
© Orzota, Inc. 2013
Source: http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/© Orzota, Inc. 2013
Wake up - Conclusion Time
• BigData on the Rise
• Technology and the Domain
• Smart Engineers needed, with BigData skills
• Chance to develop niche areas of Expertise even before stepping into the Industry
• 3rd Year Students – Select your final year projects very carefully, with the tools mentioned in this Seminar
• 4th Year Students – Equip your self with the necessary skills for better industry opportunities.
© Orzota, Inc. 2013
33
Recommendations
• I recommend aspiring professionals and young professionals read:
• How to Solve it by Computer – RG Dromey
• Code Complete 2 – Steve McConnell
• Advanced Programming in the Unix Environment – Richard Stevens
• Many Books on Hadoop, NoSQL Datastores, and Big Data in general.
© Orzota, Inc. 2013
… and many more
© Orzota, Inc. 2013
Questions ?
34
35
Contact Us at –
Thank You
Linkedin.com/company/orzota-inc- Twitter.com/orzota
© Orzota, Inc. 2013
Top Related