On Storing Big Data

Post on 18-Aug-2015

59 views 0 download

Tags:

Transcript of On Storing Big Data

On Storing Big Data

Ilias Flaounas

Intelligent Systems Lab

30 October 2012

I. Flaounas (Intelligent Systems Lab) 30 October 2012 1 / 16

Storing Big Data

Data start to play an increasingly important role in business andscience.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

Storing Big Data

Data start to play an increasingly important role in business andscience.

Storing, searching, sharing, analysing and visualising big data hasbecome a challenge.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

Storing Big Data

Data start to play an increasingly important role in business andscience.

Storing, searching, sharing, analysing and visualising big data hasbecome a challenge.

Especially storing of data is often disregarded as an issue.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

Storing Big Data

Data start to play an increasingly important role in business andscience.

Storing, searching, sharing, analysing and visualising big data hasbecome a challenge.

Especially storing of data is often disregarded as an issue.

Note that sometimes a MySQL database is not enough.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

Storing Big Data

Data start to play an increasingly important role in business andscience.

Storing, searching, sharing, analysing and visualising big data hasbecome a challenge.

Especially storing of data is often disregarded as an issue.

Note that sometimes a MySQL database is not enough.

Hadoop offers an out of the box distributed filesystem for storing datafiles. However, the challenge appears when someone needs DBcapabilities, frequent updates or real time processing.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

Data keep on coming in high velocity, high volumes, and high variety.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

Data keep on coming in high velocity, high volumes, and high variety.

Common practices to increase performance fail after a while: buying afaster server, getting more RAM, using materialised views, fine tuningqueries...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

Data keep on coming in high velocity, high volumes, and high variety.

Common practices to increase performance fail after a while: buying afaster server, getting more RAM, using materialised views, fine tuningqueries...

Furthermore, “alter table” doesn’t really work with lots of data.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

Data keep on coming in high velocity, high volumes, and high variety.

Common practices to increase performance fail after a while: buying afaster server, getting more RAM, using materialised views, fine tuningqueries...

Furthermore, “alter table” doesn’t really work with lots of data.

Backups and data availability becomes an issue.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

NoSQL Movement

The term is too broad and new to really define it.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

No joins between tables

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

No joins between tables

No common scripting language (like SQL)

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

No joins between tables

No common scripting language (like SQL)

No ACID (atomicity, consistency, isolation, durability)

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

No joins between tables

No common scripting language (like SQL)

No ACID (atomicity, consistency, isolation, durability)

On the other hand you gain horizontal scalability and high performance.Also, most NoSQL systems are Map/Reduce ready and/or bind withHadoop.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

Memory based: Memcached, Redis, other optimised for solid statedisks...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

Memory based: Memcached, Redis, other optimised for solid statedisks...

Specialised for graphs: Neo4j, InfiniteGraph,...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

Memory based: Memcached, Redis, other optimised for solid statedisks...

Specialised for graphs: Neo4j, InfiniteGraph,...

Specialised for full-text search: Lucene, Solr...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

Memory based: Memcached, Redis, other optimised for solid statedisks...

Specialised for graphs: Neo4j, InfiniteGraph,...

Specialised for full-text search: Lucene, Solr...

Understand your requirements and then make a choice.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

Oracle response

I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16

Oracle response

May, 2011: Oracle issues a white paper titled “Debunking the NoSQLHype”.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16

Oracle response

May, 2011: Oracle issues a white paper titled “Debunking the NoSQLHype”.

The conclusion:

“Go for the tried and true path. Don’t be risking your data on NoSQLdatabases.”

I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16

Oracle response

May, 2011: Oracle issues a white paper titled “Debunking the NoSQLHype”.

The conclusion:

“Go for the tried and true path. Don’t be risking your data on NoSQLdatabases.”

October 2011: Oracle releases the “Oracle NoSQL Database”. The whitepaper is now reachable only via Google archives.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

Full Index Support

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

Full Index Support

Map/Reduce ready - Can bind with Hadoop

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

Full Index Support

Map/Reduce ready - Can bind with Hadoop

Eventually consistent

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

Full Index Support

Map/Reduce ready - Can bind with Hadoop

Eventually consistent

Open Source but developed and maintained by company “10gen”

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Document based DB

A document is represented in JSON format:

{“ id” : 12345678,“Link” : “http://news.scotsman.com/abc.html”,“Title”:“Blah blah blah”,“Content”: “More blah blah”,“OutletID” : 14,“Date” : ISODate(“2011-11-17T20:33:15.097Z”),“ Hash” : 550973592,“Tags” : [ International, News, Scotland],}

I. Flaounas (Intelligent Systems Lab) 30 October 2012 8 / 16

Single Server

A single machine stores the DB, e.g MySQL.I. Flaounas (Intelligent Systems Lab) 30 October 2012 9 / 16

Master/Slave

Two machines in Master/Slave configuration.I. Flaounas (Intelligent Systems Lab) 30 October 2012 10 / 16

MongoDB - Replication

Automatic Fail Over - The Master is elected among servers.I. Flaounas (Intelligent Systems Lab) 30 October 2012 11 / 16

MongoDB - Sharding

Data is spread horizontally.I. Flaounas (Intelligent Systems Lab) 30 October 2012 12 / 16

MongoDB

If new shard is added, data is balanced automatically.I. Flaounas (Intelligent Systems Lab) 30 October 2012 13 / 16

MongoDB

No single point of failure, distributed read/writes.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 14 / 16

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

Training people on the new techs

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

Training people on the new techs

Designing DB – splitting data among machines for maximum I/O

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

Training people on the new techs

Designing DB – splitting data among machines for maximum I/O

Bugs or ‘simple’ features may be missing, new versions come out toooften...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

Training people on the new techs

Designing DB – splitting data among machines for maximum I/O

Bugs or ‘simple’ features may be missing, new versions come out toooften...

Security

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Thank you!

I. Flaounas (Intelligent Systems Lab) 30 October 2012 16 / 16