Big Data and NoSQL BUS 782. What is Big Data? Employee-generated data User-generated data...

36
Big Data and NoSQL BUS 782

Transcript of Big Data and NoSQL BUS 782. What is Big Data? Employee-generated data User-generated data...

Page 1: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Big Data and NoSQLBUS 782

Page 2: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

What is Big Data?

• https://www.youtube.com/watch?v=c4BwefH5Ve8• Employee-generated data• User-generated data• Machine-generated data

• Big Data Analytics: 11 Case Histories and Success Stories• https://www.youtube.com/watch?annotation_id=annotation_3535169775&f

eature=iv&src_vid=c4BwefH5Ve8&v=t4wtzIuoY0w

Page 3: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Big Data• Data Size:

– Gigabyte– Terabyte: Terabyte USB– Petabyte: Wal-Mart handles more than 1m

customer transactions every hour at more than 2.5 petabytes

– Exabyte: the amount of traffic flowing over the internet about 700 exabytes annually

– Zettabyte•

Page 4: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Big Data: Some Facts

• World’s information is doubling every two years• World generated 1.8 ZB of information in 2011• Cisco predicts that by 2016 global IP traffic will reach 1.3

zettabytes• There will be 19 billion networked devices by 2016• 70% of this data is being generated by individuals as opposed

to enterprises & organizations

Page 5: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Big Data Sources

• Web sites• Social media• Machine generated• RFID• Image, video, and audio• Etc.

Page 6: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Big Data Challenges• Big Data are high-volume, high-velocity,

and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

• “3Vs":– Volume: Size >= 30-50 TBs– Velocity: Processing speed– Variety:

• Structured: able to fit in a database table• unstructured data

Page 7: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Do Companies care about Data?

• Not really, What they care about are Key Performance Indicators (KPIs)

• Some examples of KPIs are– Revenue– Profit– Revenue per customer/employee– Customer Attrition: the loss of clients or customers

• Big Data is only useful if it helps drive KPIs

Page 8: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Big Data to KPIs

Page 9: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Applications• Text mining: deriving high-quality information

from text.– text categorization, text clustering, concept/entity

extraction, sentiment analysis, etc.• Web mining:

– Web usage mining– Web content mining

• Social media mining– Salesforce Radian6 Social Marketing Cloud

• http://www.youtube.com/watch?v=EH1dcFh_-I4

Page 10: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Advantages of Relational Databases

• Well-defined database schema• Flexible query language• Maintain database consistency in business transactions:

– Concurrent database processing with multiple users• Reading/updating• Locking

Page 11: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Transaction ACID Properties• Atomic

– Transaction cannot be subdivided– All or nothing

• Consistent– Constraints don’t change from before transaction to after transaction– A transaction transforms a database from one consistent state to another consistent

state.• Isolated

– Transactions execute independently of one another.– Database changes not revealed to users until after transaction has completed

• Durable– Database changes are permanent and must not be lost.

Page 12: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Problems with relational databases in managing Big Data

• High overhead in maintaining database consistency• Do not support unstructured data search very well (i.e. google type

searching)• Do not handle data in unexpected formats well• Don’t scale well to very large size databases:

– Expensive “scale up”: adding processer, storage– Slow query response time– Data must move to server– Server failure

• Organizations such as Facebook, Yahoo, Google, and Amazon were among the first to decide that relational databases were not good solutions for the volumes and types of data that they were dealing with.

Page 13: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

What is needed in new approach

• Deal with data size never imagined before.• Hardware failure should be expected.• Data has gravity, compute has to move to data.

Page 14: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

What is Hadoop?

• Open source project by Apache Foundation• Based on papers published by Google

– Google File System ( Oct, 2003)– MapReduce ( Dec, 2004)

• Consists of two core components– Hadoop Distributed File System (Storage)– MapReduce (Compute)

Page 15: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

How Hadoop fits in the new approach• Run on cluster of low cost commodity servers so can

accommodate petabytes of data cost effectively.• Embraces partial failures • Data locality (computation on local node where• data resides)• Horizontally Scales

– Scale Out• Hadoop file is:

– Distributed: a file is stored in many servers– Replicated: a file is replicated with many copies

Page 16: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Hadoop HDFS: Hadoop Distributed File System• Based on GFS • Designed to store very large amount of data (TBs• and PBs) and much larger file sizes• Write-once, read many-times access pattern• Designed to run on clusters of commodity hardware and does

replication for reliability• Allows data to be read and processed locally• Supports limited operations on files - write, delete, append and

reads but no updates

Page 17: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

MapReduce: a programming model for distributedprocessing of data

• Rather than take the conventional step of moving data over a network to be processed by software, MapReduce moves the processing software to the data.

• Each node does both store and compute, and does best to process local data.

• MapReduce has two main phases:– Map– Reduce

Page 18: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Example: Word Count

Page 19: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Hadoop Ecosystem• Hbase–a column-oriented data store• Hive –provides a SQL like query capability• Pig –a high-level language for creating MapReducejobs• HCatalog–takes Hive’s metadata and makes it available across the

Hadoop ecosystem• Mahout –a library of algorithms for clustering, classification, and filtering• Sqoop–accelerates bulk loads of data between Hadoop and RDMS• Flume –streams large volumes of log data from multiple sources into

Hadoop

Page 20: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

NoSQL Database

• NotOnlySQL is a broad class of database management systems identified by non-adherence to the widely used relational database management system model.

• They are useful when working with a huge quantity of data when the data's nature does not require a relational model.

Page 21: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Types of NoSQL Databases

• Column-oriented database– Example: Cassandra

• Document-oriented database:– Example: MongoDB, CouchDB

• Data stored in JSON, JavaScript Object Notation, format

Page 22: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

JSON, JavsScript Object Notationhttp://www.w3schools.com/json/default.asp

JSON Example

{"employees":[ {"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"} ]}

Page 23: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Cassandra is essentially a key-value store. This means that all data is stored only in one ‘table’, each row of which is uniquely identified by a key, with JSON representation.

https://blog.safaribooksonline.com/2012/12/11/modeling-data-in-cassandra/{"user1": { "Bio": { "name": "Shaneeb Kamran", "age" : 23 } }, "user2": { "Bio": { "name": "Salman ul Haq", "profession": "Developer" }, "Education": { "bachelors": "NUST" } }}

Page 24: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Column Data Modelhttp://www.sinbadsoft.com/blog/cassandra-data-model-cheat-sheet/

http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/

• A column is a key-value pair consisting of three elements:– 1: Unique name: Used to reference the column– 2: Value: The content of the column. – 3: Timestamp: used to determine the valid content.

• Column Family: A container for columns sorted by their names. Column Families are referenced and sorted by row keys.

• Super Column: A sorted associative array of columns– Example: Multi-value attribute

• Super column family: A container for super columns sorted by their names. Super Column Families are referenced and sorted by row keys.

• Keyspace: Top level element. Container for column families.

Page 25: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Column Family

Super column family

Page 26: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.
Page 27: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Migrate a Relational Database Structure into a NoSQL Cassandra Structure http://www.divconq.com/2010/migrate-a-relational-database-structure-into-a-nosql-cassandra-structure-part-i/

{ "biologicalfeatures": { "forests" : { "forest003" : { "name" : "Black Forest", "trees" : "two million", "bushes" : "three million“ }, "forest045" : { "name" : "100 Acre Woods", "trees" : "four thousand", "bushes" : "five thousand“ }, "forest127" : { "name" : "Lonely Grove", "trees" : "none", "bushes" : "one hundred“ } }, "famoustrees" : { "tree12345" : { "forestID" : "forest003", "name" : "Der Tree", "species" : "Red Oak“ }, "tree12399" : { "forestID" : "forest045", "name" : "Happy Hunny Tree", "species" : "Willow“ }, "tree32345" : { "forestID" : "forest003", "name" : "Das Ubertree", "species" : "Blue Spruce“ } } }}

Page 28: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Document database: MongoDBhttp://docs.mongodb.org/manual/core/data-modeling-introduction/

• MongoDB stores business subjects in documents.• A document is the basic unit of data in MongoDB. Documents

are analogous to JSON objects but exist in the database in a more type-rich format known as BSON, Bin ary JSON, is a bin ary-en coded seri al iz a tion of JSON-like doc u ments.

• The structure of MongoDB documents and how the application represents relationships between data: – references and embedded documents.

Page 29: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Example using reference

Page 30: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Embedded Data Models

Page 31: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

CouchDB• A CouchDB document is a JSON object that consists of named

fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. An example of a document would be a blog post:

{

"Subject": "I like Plankton",

"Author": "Rusty",

"PostedDate": "5/23/2006",

"Tags": ["plankton", "baseball", "decisions"],

"Body": "I decided today that I don't like baseball. I like plankton."

}

Page 32: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Problems with NoSQL Databases

• Does not support transaction consistency as relational database systems.

• There is no standard query language for NoSQL databases

Page 33: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

NewSQL Databaseshttp://en.wikipedia.org/wiki/NewSQL

• NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.

Page 34: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

Approaches of NewSQL Systems

• 1. Distributed cluster of shared-nothing nodes:– node owns a subset of the data. These databases include

components such as distributed concurrency control and distributed query processing.

• 2. Transparent sharding:– These systems provide a sharding middleware layer to automatically

split databases across multiple nodes.• 3. Highly optimized SQL engines• 4. In-memory database

Page 35: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

In-Memory Database• An in-memory database is a database

management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism.

• Main memory databases are faster than disk-optimized databases.

• Good for Big Data analytics. • Use non-volatile main memory module that

retains data even when electrical power is removed.

Page 36: Big Data and NoSQL BUS 782. What is Big Data?  Employee-generated data User-generated data Machine-generated.

SAP HANA, High-Speed Analytical Appliance

• SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP. HANA's architecture is designed to handle both high transaction rates and complex query processing on the same platform

• HANA's performance is 10,000 times faster when compared to standard disks, which allows companies to analyze data in a matter of seconds instead of long hours.