Road Map for Careers in Big Data
-
Upload
mich-talebzadeh-phd -
Category
Data & Analytics
-
view
213 -
download
0
Transcript of Road Map for Careers in Big Data
Architecture Design Series
Mich Talebzadeh
London August 17, 2016
Road Map for Careers in Big Data
Author
About the Author:
Mich TalebzadehBig Data and RDBMS Senior Technical Architect
E-mail: [email protected]://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2?https://talebzadehmich.wordpress.com/
All rights reserved. No part of this publication may be reproduced in any form, or by any means, without the prior written permission of the copyright holder.
The Big Picture
Get familiar with Big Data in General The buzzwords and the emerging new products
expanding at amazing pace Gain an understanding of Big Data space Relate your existing (often Relational knowledge) to
Big Data Build upon your existing knowledge and experience Discover where Big Data fits Gain an insight into the Hadoop core Make it easier to move to Big Data if you wish
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 4
Trends
Back in 2007, I responded to an article “The Oracle DBMS is a ‘legacy technology.’” in techtarget.com.
I stated that the argument is not so much that relational database management systems “should be considered legacy technology” since they are more than a quarter of century old, but more to do with the philosophy and the design approach that they were created for
Simply put today’s systems and users deal much more with non-transactional (read) activity than with transactional (write/update) activity.
Does that mean that RDBMSs are legacy systems? I don’t think so. All it means is that there is a limit that one can take the relational model so far.
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 5
Trends
So what is the solution in a business world that is growing increasingly distributed and has to deal with the increased use of heterogeneous systems and proprietary databases across different levels of business?
Need to deploy tools for the purpose and relying on one engine philosophy (say RDBMS) solution is no longer viable.
Move from internal company data to information from multiple sources
Deal with both transactional and non-transactional data RDBMS landscape is changing. The newer versions are
adapting themselves for the New landscapes
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 6
Trends
Shift from Data at rest to Data in motion The World is saving more data than before The ability to collect and analyze such large amount of
data has given rise to Big Data Technology Big Data is a trendy term these days and if you don't
know about it and you are in I.T. then you are missing the buzz.
As we have tools to collect and analyze transactional data, we need tools to collect and analyze vast quantity of data
If time is the essence in transactional world, then equally it is becoming an essence dealing with vast sums of data
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 7
Big Data Meaning and Terminology
Big Data is a term for data sets that are so large or complex that traditional data processing applications are inadequate.
The challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization and querying.
The term often refers to the use of predictive analytics, user behaviour analytics (sentiment analysis), or certain other advanced data analytics methods that extract value from the vast amount of data
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 8
Big Data Meaning and Terminology
Big Data is not only about the volume.
Much of the data is received in real time and is most valuable at the time of arrival.
Examples:−Detect share prices as soon
as possible−a service operator wants to
detect failures from logs within a seconds
−news site may want to train their model to show users content which they are interested in.
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 9
Big Data Benefits
Big Data can lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.
Analysis of this type of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on
Some common sources of Big Data are shown below
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 10
What sort of Data are we dealing with
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 11
The current estimate is that around 20-30% of available data is stored in structured format and the rest are unstructured.
Unstructured refers to piece of data that either does not have a pre-defined data model or is not organized in a pre-defined manner.
Big Data deployment and usage:−Primarily built on open source−Distributed data - data physically distributed across
a network using commodity hardware aka Cluster−Business Intelligence - structured queries
What sort of Data are we dealing with
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 12
Big Data deployment and usage:−Analytics and data mining - for data from multiple
sources and of variable types−Cloud Computing - access to large pools of computing
power available as needed−Networked devices−Data from Sensors - more sensors producing more data
more frequently−The Internet of Things - machine-generated data read
and used by other machines−Deploying both relational and NoSQL (not only SQL)
type storage
What sort of Data are we dealing with
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 13
Big Data Fundamentals: Hadoop core Consists of−Hadoop Distributed File System (HDFS) on a cluster−Map-reduce - The default engine to search and retrieve data
across HDFS cluster−Yarn – General purpose Resource Manager to manage resources
in the cluster Data Warehouses
− Hive - is the most versatile and capable of the many SQL or SQL-like ways of accessing data on Hadoop
− Pig – Similar to Hive but supports procedure type language NoSQL Databases - In general, a NoSQL database relaxes the
rigid requirements of an RDBMS (meaning ACID properties), so one can implement new usage models that require lower latency and higher levels of scalability. NoSQL databases come in four major types:
Big Data Databases
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 14
Key/Value Database - is the simplest type of database and can store any kind of digital object. Each object is assigned a key and is retrieved using the same key. However, analytic capabilities are limited. Examples of key/value databases include Redis, Riak and Oracle NoSQL Database
- There is no schema for the value returned by the key It is a heap of data - Query only by key, cannot query based on the rest of content - All queries return the value as a lump of data. - There is no way to just return some of the value! Use Cases:
− Session information in web applications− The shopping cart for an online buyer
NoSQL Database types
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 15
Document Stores- These are a type of key/value store in which documents are stored in recognizable formats and are accompanied by metadata. Because of the consistent formats and the metadata, you can perform search and analysis without first retrieving the documents.
Examples of document stores include MongoDB, MarkLogic and Couchbase.
No SQL Database types
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 16
Brief walk-through
The Relational vs. Document-Oriented Data Model
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 17
The Relational vs. Document-Oriented Data Model
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 18
In the relational model, records are normalised across multiple tables, using primary key and foreign key constraints.
The upside of normalisation is that there should be no duplication of data in the database.
The downside is that a change in a single record can mean holding locks on a number of records on different tables as defined by the transaction to ensure a change does not leave the database in an inconsistent state. This complex web of interrelationships is what makes it so difficult to distribute relational data across multiple servers and can lead to performance challenges both reading and writing data.
Big data is all about scaling out (horizontally). Granted this may result in duplication of data. However, using more storage in exchange for better application performance and the ability to easily distribute workloads across machines is now a better choice for many applications.
The Relational vs. Document-Oriented Data Model
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 19
What is a Document Data Model?Use of the term “document” is a bit confusing. A
document-oriented database really has nothing to do with “documents” in the classical sense of the word. It does not mean books, letters or articles. Rather, a document in this case refers to a data record that is self-describing as to the data elements it contains. XML documents, HTML documents and JSON documents are examples of “documents” in this context. An example of a record will look like below
The Relational vs. Document-Oriented Data Model
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 20
The Relational vs. Document-Oriented Data Model
As can be seen, the data is de-normalized. Each record contains a complete set of information related to the error without external reference. The records are self-contained. This makes it very easy to move the entire record to another server – all the information simply comes along with it. There is no concern about having parts of the record in other tables left behind.
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 21
Only the self-contained record (document) needs to be updated when changes are made. In the NoSQL world the document ID is the one and only key to a document.
Document ID is roughly equivalent to a primary key in a relational database. Usually an ID can only appear once in a database. Different NoSQL solutions have different names for these: buckets, collections, tables, etc. A Collection is equivalent to a Table and a Document to a Row
Use Cases:− Enterprise data− Event logging applications− Content management systems− Web analytics and real-time analytics− E-Commerce applications
The Relational vs. Document-Oriented Data Model
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 22
Columnar Databases – These provide varying degrees of row and column structure. Each row in the column family has a unique key plus as many columns needed to hold relevant information. Unlike relational columnar databases like Teradata or SAP Sybase IQ, there is no requirement for every row to have the same columns. You can use a columnar database to perform more advanced queries on big data. Examples of columnar databases include Hbase, Cassandra and Accumulo.
Use Cases:−Distribution of data across multiple nodes−Count and categorize data; sorting, counting for a time frame,
example Web Analytics−Event logging applications
No SQL Database types
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 23
Graph Databases – These store networks of objects that are linked using relationship attributes. For example, the objects could be people in a social network who are related as friends, colleagues or strangers. You can use a graph database to map and quickly analyze very complex networks. Examples of graph databases include Allegro and Neo4J.
Use Cases− Who is associated with who− Fraud detection− Criminal or Terrorist Cell identification− Profile analysis in Social media− Routing and dispatch systems
(public domain image hosted by wikipedia)
No SQL Database types
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 24
Mich’s Journey: It takes a while to unlearn habits. But do not
worry. You need to adjust the way you think about data and modelling. By understanding alternatives you will be able to make more efficient use of your existing knowledge as well. After all, the tool best suited for the job will leave you with the least headache. If you know more tools, you can choose better!
Advice
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 25
As Big Data becomes bigger, companies will increasingly rely on Hadoop and the leading NoSQL databases.
You are not going to see NoSQL databases unseating RDBMSs for established workloads so much as taking control of an increasing share of new workloads.
Big Data in some form and shape is gradually creeping into organisations.
Some NoSQL shape Whether you are a developer, architect, analyst or
Database Administrator, you are going to come across it and get involved in it.
The rise of NoSQL databases
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 26
As Big Data databases find a space, it is becoming important to have the tools to deal with these distributed databases.
Most of these tools can be used against RDBMS as well.
The Fibre optics and Gigabit networks have resolved the latency issues by and large.
I will be talking about one star tool in Big Data landscape called Apache Spark
The Rise of tools
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 27
Apache Spark is a fast and general purpose engine for large-scale data processing.
It is written mostly in Scala, and provides APIs for Scala, Java, Python and R.
It is fully compatible with Hadoop Distributed File System (HDFS) and can run on Cloud
It extends on Hadoop’s core functionality by providing in-memory cluster computation among other things.
It has its own Standalone cluster manager and can work with Yarn resource manager as well.
It is heavily used in Finance (VAR), Health care, Fraud Detection, streaming data,
Apache Spark
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 28
Spark Combines SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Apache Spark
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 29
Spark supports similar dialect of ANSI-SQL including Analytics functions.
You can also write the same code using Scala and functional programming.
This code tells me what I spent in the past six year through my debit card in my bank (downloaded bank transactions in a CSV format), for each hashtag (Sainsbury, John Lewis, Waitrose etc.)
Apache Spark
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 30
Apache Spark
I have used Scala and functional programming in Spark-shell to produce the same results below:
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 31
SQL Programming, Hadoop, Linux, Scala, Java, Python and R are the most in-demand skills in positions that mention big data as a requirement.
In 2015 and 2016 we are seeing an exponential rise in demand for Spark with Scala and R skills used in functional programming and Machine Learning.
Big Data Skills in Demand
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 32
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 33
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 34
Tons of information in the Internet Big Data University
http://bigdatauniversity.com My articles in LinkedIn
https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2? Apache Software Foundation
http://www.apache.orgOther available sources
Further Information
© 2016 Mich Talebzadeh Roadmap for Careers in Big Data 35
Author
About the Author:
Mich TalebzadehBig Data and RDBMS Senior Technical Architect
E-mail: [email protected]://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2?https://talebzadehmich.wordpress.com/
All rights reserved. No part of this publication may be reproduced in any form, or by any means, without the prior written permission of the copyright holder.