Road Map for Careers in Big Data

35
Architecture Design Series Mich Talebzadeh London August 17, 2016 Road Map for Careers in Big Data

Transcript of Road Map for Careers in Big Data

Page 1: Road Map for Careers in Big Data

Architecture Design Series

Mich Talebzadeh

London August 17, 2016

Road Map for Careers in Big Data

Page 2: Road Map for Careers in Big Data

Author

About the Author:

Mich TalebzadehBig Data and RDBMS Senior Technical Architect

E-mail: [email protected]://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2?https://talebzadehmich.wordpress.com/

All rights reserved. No part of this publication may be reproduced in any form, or by any means, without the prior written permission of the copyright holder.

Page 3: Road Map for Careers in Big Data

The Big Picture

Get familiar with Big Data in General The buzzwords and the emerging new products

expanding at amazing pace Gain an understanding of Big Data space Relate your existing (often Relational knowledge) to

Big Data Build upon your existing knowledge and experience Discover where Big Data fits Gain an insight into the Hadoop core Make it easier to move to Big Data if you wish

Page 4: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 4

Trends

Back in 2007, I responded to an article “The Oracle DBMS is a ‘legacy technology.’” in techtarget.com.

I stated that the argument is not so much that relational database management systems “should be considered legacy technology” since they are more than a quarter of century old, but more to do with the philosophy and the design approach that they were created for

Simply put today’s systems and users deal much more with non-transactional (read) activity than with transactional (write/update) activity.

Does that mean that RDBMSs are legacy systems? I don’t think so. All it means is that there is a limit that one can take the relational model so far.

Page 5: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 5

Trends

So what is the solution in a business world that is growing increasingly distributed and has to deal with the increased use of heterogeneous systems and proprietary databases across different levels of business?

Need to deploy tools for the purpose and relying on one engine philosophy (say RDBMS) solution is no longer viable.

Move from internal company data to information from multiple sources

Deal with both transactional and non-transactional data RDBMS landscape is changing. The newer versions are

adapting themselves for the New landscapes

Page 6: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 6

Trends

Shift from Data at rest to Data in motion The World is saving more data than before The ability to collect and analyze such large amount of

data has given rise to Big Data Technology Big Data is a trendy term these days and if you don't

know about it and you are in I.T. then you are missing the buzz.

As we have tools to collect and analyze transactional data, we need tools to collect and analyze vast quantity of data

If time is the essence in transactional world, then equally it is becoming an essence dealing with vast sums of data

Page 7: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 7

Big Data Meaning and Terminology

Big Data is a term for data sets that are so large or complex that traditional data processing applications are inadequate.

The challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization and querying.

The term often refers to the use of predictive analytics, user behaviour analytics (sentiment analysis), or certain other advanced data analytics methods that extract value from the vast amount of data

Page 8: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 8

Big Data Meaning and Terminology

Big Data is not only about the volume.

Much of the data is received in real time and is most valuable at the time of arrival.

Examples:−Detect share prices as soon

as possible−a service operator wants to

detect failures from logs within a seconds

−news site may want to train their model to show users content which they are interested in.

Page 9: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 9

Big Data Benefits

Big Data can lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

Analysis of this type of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on

Some common sources of Big Data are shown below

Page 10: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 10

What sort of Data are we dealing with

Page 11: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 11

The current estimate is that around 20-30% of available data is stored in structured format and the rest are unstructured.

Unstructured refers to piece of data that either does not have a pre-defined data model or is not organized in a pre-defined manner.

Big Data deployment and usage:−Primarily built on open source−Distributed data - data physically distributed across

a network using commodity hardware aka Cluster−Business Intelligence - structured queries

What sort of Data are we dealing with

Page 12: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 12

Big Data deployment and usage:−Analytics and data mining - for data from multiple

sources and of variable types−Cloud Computing - access to large pools of computing

power available as needed−Networked devices−Data from Sensors - more sensors producing more data

more frequently−The Internet of Things - machine-generated data read

and used by other machines−Deploying both relational and NoSQL (not only SQL)

type storage

What sort of Data are we dealing with

Page 13: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 13

Big Data Fundamentals: Hadoop core Consists of−Hadoop Distributed File System (HDFS) on a cluster−Map-reduce - The default engine to search and retrieve data

across HDFS cluster−Yarn – General purpose Resource Manager to manage resources

in the cluster Data Warehouses

− Hive - is the most versatile and capable of the many SQL or SQL-like ways of accessing data on Hadoop

− Pig – Similar to Hive but supports procedure type language NoSQL Databases - In general, a NoSQL database relaxes the

rigid requirements of an RDBMS (meaning ACID properties), so one can implement new usage models that require lower latency and higher levels of scalability. NoSQL databases come in four major types:

Big Data Databases

Page 14: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 14

 Key/Value Database - is the simplest type of database and can store any kind of digital object. Each object is assigned a key and is retrieved using the same key. However, analytic capabilities are limited. Examples of key/value databases include Redis, Riak and Oracle NoSQL Database

- There is no schema for the value returned by the key  It is a heap of data - Query only by key, cannot query based on the rest of content - All queries return the value as a lump of data. - There is no way to just return some of the value! Use Cases:

− Session information in web applications− The shopping cart for an online buyer

NoSQL Database types

Page 15: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 15

Document Stores- These are a type of key/value store in which documents are stored in recognizable formats and are accompanied by metadata. Because of the consistent formats and the metadata, you can perform search and analysis without first retrieving the documents.

Examples of document stores include MongoDB, MarkLogic and Couchbase.

 

No SQL Database types

Page 16: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 16

Brief walk-through

The Relational vs. Document-Oriented Data Model

Page 17: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 17

The Relational vs. Document-Oriented Data Model

Page 18: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 18

In the relational model, records are normalised across multiple tables, using primary key and foreign key constraints.

The upside of normalisation is that there should be no duplication of data in the database.

The downside is that a change in a single record can mean holding locks on a number of records on different tables as defined by the transaction to ensure a change does not leave the database in an inconsistent state. This complex web of interrelationships is what makes it so difficult to distribute relational data across multiple servers and can lead to performance challenges both reading and writing data.

Big data is all about scaling out (horizontally). Granted this may result in duplication of data. However, using more storage in exchange for better application performance and the ability to easily distribute workloads across machines is now a better choice for many applications.

The Relational vs. Document-Oriented Data Model

Page 19: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 19

What is a Document Data Model?Use of the term “document” is a bit confusing. A

document-oriented database really has nothing to do with “documents” in the classical sense of the word. It does not mean books, letters or articles. Rather, a document in this case refers to a data record that is self-describing as to the data elements it contains. XML documents, HTML documents and JSON documents are examples of “documents” in this context. An example of a record will look like below

The Relational vs. Document-Oriented Data Model

Page 20: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 20

The Relational vs. Document-Oriented Data Model

As can be seen, the data is de-normalized. Each record contains a complete set of information related to the error without external reference. The records are self-contained. This makes it very easy to move the entire record to another server – all the information simply comes along with it. There is no concern about having parts of the record in other tables left behind.

Page 21: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 21

Only the self-contained record (document) needs to be updated when changes are made. In the NoSQL world the document ID is the one and only key to a document.

Document ID is roughly equivalent to a primary key in a relational database. Usually an ID can only appear once in a database. Different NoSQL solutions have different names for these: buckets, collections, tables, etc. A Collection is equivalent to a Table and a Document to a Row

Use Cases:− Enterprise data− Event logging applications− Content management systems− Web analytics and real-time analytics− E-Commerce applications

The Relational vs. Document-Oriented Data Model

Page 22: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 22

Columnar Databases – These provide varying degrees of row and column structure. Each row in the column family has a unique key plus as many columns needed to hold relevant information. Unlike relational columnar databases like Teradata or SAP Sybase IQ, there is no requirement for every row to have the same columns. You can use a columnar database to perform more advanced queries on big data. Examples of columnar databases include Hbase, Cassandra and Accumulo.

Use Cases:−Distribution of data across multiple nodes−Count and categorize data; sorting, counting for a time frame,

example Web Analytics−Event logging applications 

No SQL Database types

Page 23: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 23

Graph Databases – These store networks of objects that are linked using relationship attributes. For example, the objects could be people in a social network who are related as friends, colleagues or strangers. You can use a graph database to map and quickly analyze very complex networks. Examples of graph databases include Allegro and Neo4J.

Use Cases− Who is associated with who− Fraud detection− Criminal or Terrorist Cell identification− Profile analysis in Social media− Routing and dispatch systems

(public domain image hosted by wikipedia)

No SQL Database types

Page 24: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 24

Mich’s Journey: It takes a while to unlearn habits. But do not

worry. You need to adjust the way you think about data and modelling. By understanding alternatives you will be able to make more efficient use of your existing knowledge as well. After all, the tool best suited for the job will leave you with the least headache. If you know more tools, you can choose better!

Advice

Page 25: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 25

As Big Data becomes bigger, companies will increasingly rely on Hadoop and the leading NoSQL databases.

You are not going to see NoSQL databases unseating RDBMSs for established workloads so much as taking control of an increasing share of new workloads.

Big Data in some form and shape is gradually creeping into organisations.

Some NoSQL shape Whether you are a developer, architect, analyst or

Database Administrator, you are going to come across it and get involved in it.

The rise of NoSQL databases

Page 26: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 26

As Big Data databases find a space, it is becoming important to have the tools to deal with these distributed databases.

Most of these tools can be used against RDBMS as well.

The Fibre optics and Gigabit networks have resolved the latency issues by and large.

I will be talking about one star tool in Big Data landscape called Apache Spark

The Rise of tools

Page 27: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 27

Apache Spark is a fast and general purpose engine for large-scale data processing.

It is written mostly in Scala, and provides APIs for Scala, Java, Python and R.

It is fully compatible with Hadoop Distributed File System (HDFS) and can run on Cloud

It extends on Hadoop’s core functionality by providing in-memory cluster computation among other things.

It has its own Standalone cluster manager and can work with Yarn resource manager as well.

It is heavily used in Finance (VAR), Health care, Fraud Detection, streaming data,

Apache Spark

Page 28: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 28

Spark Combines SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and

DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Apache Spark

Page 29: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 29

Spark supports similar dialect of ANSI-SQL including Analytics functions.

You can also write the same code using Scala and functional programming.

This code tells me what I spent in the past six year through my debit card in my bank (downloaded bank transactions in a CSV format), for each hashtag (Sainsbury, John Lewis, Waitrose etc.)

Apache Spark

Page 30: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 30

Apache Spark

I have used Scala and functional programming in Spark-shell to produce the same results below:

Page 31: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 31

SQL Programming, Hadoop, Linux, Scala, Java, Python and R are the most in-demand skills in positions that mention big data as a requirement. 

In 2015 and 2016 we are seeing an exponential rise in demand for Spark with Scala and R skills used in functional programming and Machine Learning.

Big Data Skills in Demand

Page 32: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 32

Page 33: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 33

Page 34: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 34

Tons of information in the Internet Big Data University

http://bigdatauniversity.com My articles in LinkedIn

https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2? Apache Software Foundation

http://www.apache.orgOther available sources

Further Information

Page 35: Road Map for Careers in Big Data

© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 35

Author

About the Author:

Mich TalebzadehBig Data and RDBMS Senior Technical Architect

E-mail: [email protected]://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2?https://talebzadehmich.wordpress.com/

All rights reserved. No part of this publication may be reproduced in any form, or by any means, without the prior written permission of the copyright holder.