BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European...

39
BigData NoSQL Hadoop Part I: What? How? What for? Szkudlarek b fellow European Organisation for Nuclear Research -SCD Industrial Controls & Engineering, SCADA Systems [email protected] ised by: Piotr Golonka, Manuel Gonzalez Berges

Transcript of BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European...

Page 1: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

BigData NoSQL HadoopPart I:What? How? What for?

Kacper SzkudlarekOpenlab fellowCERN - European Organisation for Nuclear ResearchEN-ICE-SCD Industrial Controls & Engineering, SCADA Systems Email: [email protected] by: Piotr Golonka, Manuel Gonzalez Berges

Page 2: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

What we are going to talk about:

• Today:– BigData– NoSQL – Not Only SQL– Hadoop - what it is all about?– HDFS/MAPR – Distributed File System, base of everything

• Next ICETea:– MapReduce – as a new paradigm for data processing– Hadoop ecosystem tools– Other NoSQL systems

Page 3: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

BigData

• Combination of old and new technologies giving availability to:– Manage huge volume of data– Gain the right speed of processing– Within the right time frame to allow real-time

analysis and reactions• Designed for all types of data:

Page 4: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

BigData

Structured Pre-defined schema Example: relational database

Semi-structuredInconsistent

structure, cannot be stored in tows and

tables

Example: logs, tweets, sensor feeds

Unstructured Full or partial lack of structure

Example: free-form text, report

Page 5: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

The BigData characteristic

• So called the 3 “V”:–Volumes• petabytes and exabytes of data (limited number of

files)

–Variety• any imaginable type of data

–Velocity• speed at which data is collected

Page 6: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

NoSQL

Not only SQL

Page 7: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

What is NoSQL

• Next Generation Databases addressing new features:– Non-relational– Distributed– Open-source– Horizontally scalable

• Systems providing mechanisms for Big Data processing• New approach for storage of huge amount of data

– Not necessary structured data– Kept in many formats (e.g. key-value pairs, objects, tree …)

• Fast processing focused on data analytics

Page 8: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

NoSQL examples

Divided by Data Model

Page 9: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

1: Key-value

• Hash map like data alignment, persistent to distributed file system.

• Example: Project Voldemort, riak

12345

ABCD

2014.06.19

Some data

Other data

Yet another data

Page 10: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

2: Document

• Database as storage of mass of documents.• Each documents is different data structure– No set schema

• Example: mongoDB, CouchDB.{ _id: 101, type: "fruit", item: "jkl", qty: 10, price: 4.25, memos: [

{ memo: "on time", by: "payment" }, { memo: "delayed", by: "shipping" }

]}

Page 11: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

3: Column-family

• Stores multiple aggregates– Identified by row id and column family name,– More complex data model,– Gain on data retrieval.

• Example: Apache Hbase, Cassandra.

12345

Colu

mn

fam

ilyCo

lum

n fa

mily

Name: Kacper

Surname: Szkudlarek

City: Saint-Genis-PuoillyStreet: Rue du BordeauPostal code: 01630

Page 12: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

4: Graph

• Modeling of relations between data– Data decompositions.

• Example: Neo4j

Page 13: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Relaxed data consistency

• No ACID (atomicity, consistency, isolation, durability) in meaning as in relation databases– Exception graph DB due to data decompositions

• No really need for transactions– Data is kept aggregated,– Aggregate update is atomic.

Page 14: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Want more information?

• https://www.youtube.com/watch?v=qI_g07C_Q5I

Page 15: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.
Page 16: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

= Distributed FS clustering

job scheduler MapReduce

Page 17: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

What is hadoop?

• Apache licensed software• Batch processing system for a cluster of nodes• Underpinnings of Big Data processing systems– Storing huge amount of data– Fast local processing split into chunks

• Can work on any modern desktop PC as a node– Decent, automatic scalability

• Core and main API written in Java (unfortunately)

Page 18: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Who uses Hadoop? (in one or the another form)

Page 19: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

A new Hadoop Paradigms

• Process data locally• Reduce dependence on bandwidth• Expect/accept failure– Handle failover elegantly

• Duplicate finite blocks of data to small groups of nodes (rather than entire database)

• Reduce elapsed seek time• Data processing cost reduction

Page 20: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Source: http://bitquill.net/blog/?tag=hadoop

The Hadoop Approach

• Distribute large amounts of data across thousands of commodity hardware nodes– Process data in parallel– Replicate data across cluster for reliability

• Analysis moved to data– Avoid data copy

• Scanning of data– Avoids random seeks– Easiest way to process

Page 21: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

The Ecosystem of Projects associated with Hadoop

Data Management

Data Access

HDFS(Hadoop Distributed File System)

Batch

MapReduce

Script

Pig

SQL

Hive

NoSQL

HBase

Stream

Storm

Others

YARN(NextGen MapReduce)

Integration Operations

SqoopFlume

NFSWebHDFS

Monitor

Zookeeper

Scheduling

Oozie

Page 22: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Hadoop and Java

• Core of the Hadoop and base projects developed using Java

• All API’s for Mapper, Reducer, HDFS and so on based on Java interfaces

• Possible usage of other languages for defining certain jobs or part of jobs

Page 23: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

and other distributed file systems

Page 24: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

What is HDFS?

• Standard Hadoop Distributed File System• Logical file system• Primary storage system for Hadoop• Specialized for read access• Can handle enormous files (> 100 TB)• Deployed currently only on Linux

Page 25: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

HDFS Charactersistics• Persistent• Replicated• Linear scalable• Applications sequentially stream reads

– Often from very large files• Optimized for read performance

– Avoids random disk seeks• Write once and read many times• Files append only• Data stored in blocks

– Distributed over many nodes– Block size often range from 128M to 1G

Page 26: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

HDFS Architecture

Secondary NameNode

NameNode

NameSpace

NameSpaceBlock Map

DataNode

BL1 BL7

BL8 BL11

DataNode

BL1 BL6

BL2 BL7

DataNodeDataNode DataNode DataNode

Checkpoint Image andEdit Journal Log (backup)

Namespace MetaDataImage (Checkpoint)And Edit Journal Log

Page 27: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Logical File System

• File’s disk blocks are not physically contiguous– Distributed around many DataNode

• Data only logically contiguous• Read/write mechanism transparent to the

user

Page 28: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Data Organization• Metadata– Organized into files and directories– Linux-like permissions prevent accidental deletions

• Files– Divided into uniform sized blocks– Default 64M– Distributed across clusters

• Rack-aware (HA, minimization of out of rack data transfers)

• Checksuming– Corruption detection

Page 29: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

HDFS Cluster (I)

• HDFS runs in Hadoop distributed mode• 3 main components:– Name node (eventually secondary NameNode):• Manages DataNodes• Keep Metadata for all nodes & blocks• NOT auto failover (with secondary NameNode)• Backups of logs

Page 30: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

HDFS Cluster (II)

• DataNodes– Hold data blocks– Slave in hierarchy– Manages blocks for HDFS– If heartbeat fails:• Removed from cluster• Replicated blocks take over

• Client– Talks directly to NameNodes then DataNodes

NameNode

HDFS

DataNode daemon

DataNode daemon

DataNode daemon

heartbeats

fsimage

editlog

Page 31: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

File Access – RPCNameNode

NameSpace

NameSpaceBlock Map

JVM

Distributed File System

FSData Output Stream

Client Code

PIG

Hive

HBase

fsshell

DataNode

1

2

3

4

5 61. Request (create/open/delete)

• Provide name of file or directory2. Approval3. Request for block4. Block ID and list of DataNodes5. Operation on DataNode

• Read• Write• Delete

6. Return

Note:• NameNode is not

in the data path• NameNode only

stores metadata

Page 32: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

• Alternative to HDFS• Build for business-critical production applications.

– Commercial product– Free to use version available

• New container architecture different from HDFS• Implements normal files, visible in operating system as

soon as it is written, access via NFS• Solve synchronization problem with commodity

hardware• Reliable

Page 33: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Container architecture

• Chops data of each node into 1000s pieces• Replicate containers across the cluster• If node dies, other replicates missing data with

higher speed

Page 34: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

HDFS vs MapR

Disclaimer:Source: http://www.mapr.com/why-hadoop/why-mapr/architecture-matters

Page 35: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

MapR advantages

• High Availability Cluster• Better performance than HDFS

– Data from HDFS NameNode moved into the cluster– No file count limitation– Lower costs, less hardware in the cluster

• NFS interface for clusters access, behaves like a giant NFS server with full HA

• Replicated, ultra-reliable solution available in M7 option• Holder of the TeraSort world record (speed of writing of

1TB file) -> 55 seconds (youtube link…)

Page 36: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Other distributed file systems

• GFS – Google File System, proprietary file system developed for own use.

• GridFS – distributed file system used by MongoDB

Page 37: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

es-hadoop

• Hadoop extension to work with Elasticsearch data.• Near real-time responses (think milliseconds).• Dedicated Input/Output classes to read data to

Hadoop MapReduce.• Usage of Hadoop paradigm of local data

processing:– Each node works on shards stored on it.

• Integration with Hadoop tools (Pig, Hive, etc.).• Horizontal scaling of cluster

Page 38: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Distributions of Hadoop

• Available many different distributions– Cloudera (under testing @CERN)

• Free VM images/Online Live Service

– Hortonworks • Free VM images

– MapR(image)• Many free and paid VM machines

– Spring for Apache Hadoop• Where to read about?– Online training by Hortonworks and Cloudera

Page 39: BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

To be continued…

• MapReduce – as a new paradigm for data processing• Hive – SQL like interface data access tool• Pig - high-level scripting tool for data processing• HBase – NoSQL system, the new way of thinking about

databases