Introduction to Reporting and AnHadoop - Transcript.pdf

7/30/2019 Introduction to Reporting and AnHadoop - Transcript.pdf

1/4

Welcome All to this module of BigDataUniversity.My name is BenConnors, and I am the Worldwide Head of Alliances for Jaspersoft;we are a Business Intelligence software provider and a closepartner of BigDataUniversity. We will be providing this Reportingand Analysis on Hadoop course module for BigDataUniversity.Jaspersoft integrated our software with a wide variety of SQL and

NoSQL data sources, including multiple integrations with Hadoop.In this lesson I will explain the need for reporting and analysison Hadoop, and I will discuss several architectural approaches toaddress various use cases for Hadoop reporting and analysis.Lets look briefly through the agenda for this lesson.As Iproceed through this lesson we will discuss how the Hadoopplatform is powerful, but by itself it is not sufficient toprovide fast and meaningful insights on big data stored inHadoop. Next we will see different approaches to Big Datareporting and analysis along with various data accesstechnologies. And finally I will give a quick overview of how aBusiness Intelligence (or BI) Suite can operate using Hadoop.Throughout the course you will engage in hands-on lab exercises.You will also get free BI and Hadoop software to use during thecourse, and the software is even yours to keep and use for freeafter your complete the course.Well, Lets review the defining characteristics of Big Data.Inparticular, lets use Hadoop for our discussion.As you recall,massive volume, velocity and variety, or The Three Vs are thedefining properties of Big Data.The Hadoop framework addressesall of these requirements.It provides a framework to scale out horizontally to arbitrarilylarge data sets to address Volume, it handles a ferocious rate ofincoming data from very large systems for Velocity, and itsupports complex jobs to handle any Variety of unstructured data,. But despite this ability to manage the Three Vs, the bigchallenge remains in figuring out how to analyze the data, andfind insights in this ocean of data.Hadoop does embrace the ThreeVs. But the actual value of big data lies in deriving insightsfrom these huge data sets to enable key decisions. So there needsto be an easy way to leverage Hadoops power to further crunchthrough these data and make them more presentable and digestible.Thats what drives reporting and analysis on Hadoop. Reportingand Analysis on Hadoop addresses the need to derive meaningfulinsights quicklySo with reporting and analysis on Hadoop we canMake Big Data small, to facilitate insights from the sea ofdata.Reduce the time to unlock the Business Value of 'Big Data.And simply representation of complex insights.Also, BI suitescan open possibilities of integrating a variety of external datasources, as well examine laterSo, given the need for churning through the data and then puttinginto the hands of business users who can take action, lets have alook at how much of this process is actually handled byHadoop.What we see here is the traditional data funnel we haveadapted it to the newer technologies used in the Big Data world.

1


2/4

We start with the data production phase. Most of the big data isgenerated by servers in form of logs, transaction records, webapplications such as web 2.0 and 3.0 applications also sensoreddata and enterprise data from applications like CRMs, ERPbilling systems and so forth.The data produced in the enterprises today constitutes

structured, unstructured, and multi-structured data but its allgrowing at astronomical rates.The next phase in the data funnel pertains to storage andmanipulation of this herculean datasets. The users at this levelof the data funnel are primarily data architects who design theenterprise wide systems. The technology options for dataarchitects at this level include Hadoops distributed file systemwith Map and Reduce. Hbase - used as a data store and Hadoop dataware house - Hive.Further down the data funnel is Data Analysisand Data Visualization. Its the phase where the big data iscrunched through to make it available to various enterprise usergroups. This helps to make better decisions. You may find thedata analysts and data scientist working at this level. Thetypical job functions for users at this level include creatingcubes with facts and dimensions to perform Online AnalyticalProcessing or OLAP analysis , Visually represent data that helpsin spotting new trends and patterns. Not surprisingly, thetechnologies or tools for this functionality in Hadoop platformare not really sophisticated enough to support the job functionsrequired in this level.Final layer of the data funnel is wherethe decision makers looking for Interactive, Contextual Reports,Charts and Dashboards based on which they can make importantoperational and strategic decisions. We notice that there isntmuch support to the Data Visualization/Viewing and Data Deliverylayers in the existing Hadoop platform.So as you can see, thisleaves Hadoop incomplete.Perhaps with the addition of a businessintelligence suite that can leverage Hadoop we can provide themuch needed reporting and analysis on Hadoop. That makes it anend to end enterprise ready solution.Well Lets discuss the various approaches to Big Data Analysis,based on the various use cases they can address:Here are 3latency based approaches or use cases to draw insights from BigData and/Hadoop.The first approach is Live Exploration : An example of thisapproach could be a financial firm that is trying to hedge itsinvestments and needs to look for real time trends to make itsdecisions on financial market trading.That would mean discovering real-time patterns as they emergefrom their Big Data content.So this use case has the leastlatency of the 3 approaches to big data.And this Live explorationcan be supported by a UI controlled by a In-Memory enginedesigned to reduce any delays or latency.The next approach is Direct Batch Reporting: This use casedescribes scenarios where the executives and operational mangerslook for summarized pre-built periodic reports on Big Datacontent.The reports generated here are typical of medium

2


3/4

latency.And is targeted to user group of Executives andOperational Managers who might want to get reports on for examplewhich web content was most popular over the last several days.Lastly, the Indirect Batch Analysis.These are situations wherebusiness users would want to analyze historical trends likeperhaps customer buying pattern by region and demographics, or

may be huge environmental data sets to determine placement ofwind turbines installations for green energy. These analysis aremostly based upon slicing and dicing questions in the BigData content. The business users here could be Data Analysts andOperational managers.OK, Having discussed the various use cases for Big Data Analysis,lets look at the various technologies that can be put together toimplement the individual approaches to big data(the ones wediscussed in the previous slide).As recalled , the first approach was Live Exploration of big datathat was discovering real-time patterns as they emerge from theirBig Data content.So an implementation stack for this approach mayinclude, Native Connectors to draw data from Hbase/Hadoopdistributed file system and a user interface controlled by an inmemory engine for a faster access to the insights.Second approach,Direct Batch Reporting where the user views summarized, pre-built periodic reports on Big Data contentTypical implementation stack here includes: Hive connectors viawhich a Report Design Tool like Jaspersofts iReport for example,can draw upon data to design, generate and schedule reportsFinally ,Indirect Batch Analysis.The expectation here is toanalyze historical trends based on exploratory questions in thehistorical Big Data content.So the data here is brought usingExtract-Transform Load (ETL) technology into either an analyticalrelational database or an Online Analytical Processing (OLAP)engine that can be used to gain insights of the historicaltrends.The 3 approaches are latency based with live exploration takingthe least time typically of the order of seconds.Direct BatchReporting takes the medium latency where the reports can be runin few minutes perhaps.And finally Indirect batch analysis which has the highest latencyamong the 3 use cases where a data load and analysis could end uptaking hours or even overnight.This is an architectural diagram representing our previousdiscussion on various data access technologies that can make BIreporting and analysis seamlessly operate on Hadoop.In the following lessons we will look in detail and useJaspersoft Business Intelligence Suite to illustrate concepts.Youwill note here that Jaspersoft uses ETL, HiveQL (SQL), and also aconnector to HBase . So ETL goes to a relational database andHive uses a SQL-like language But, unlike Hive, there is no HBasequery language, rather data is queried using one of HBase's APIs:either the Java API directly or through the Thrift or REST serverinterfaces. We will discuss all of this in detail in laterlessons.

3


4/4

Lets review for a moment what a Business Intelligence Suitetypically comprises. This will provide some context for ourupcoming lessons.The Capabilities of a BI Suite include:Reporting which isdesigning interactive pixel perfect and/or ad hoc based reportsfor the web, the printer or mobile device.

Dashboards where youd build multi-report dashboards withinternal or external data for executives and knowledgeworkers.Analysis where you will explore data with powerfulrelational OLAP or in-memory analysis against any data source.And data Integration you will build data marts or warehouses fromseveral disparate relational or non-relational data sources.Laterin this course, we will examine each in more detail.So here is the preview of the upcoming lessons.For each use-caseapproach, we will discuss in more depth what it is, theimportance of the approach, guidelines to help you decide when touse the method, and a demonstration of creating Reports andAnalysis with it including a hands lab where you can create yourown and gain experienceAt the completion of this BigDataUniversity course, you will havean understanding of the various ways to create reports andanalytics on Hadoop, as well as hands-on experience in using a BItool. Jaspersoft, the worlds most widely installed businessintelligence software, will be used as the example environment,together with IBM BigInsights for the Hadoop component. Bothproducts are free to use not only during this training, but alsofor your continued use free after you complete this course.We hope that you find this training to be enjoyable and useful,and we welcome your comments. When you are ready, please proceedto the next lesson. My colleagues are waiting there for you.THANK YOU for your interest!

4

Introduction to Reporting and AnHadoop - Transcript.pdf

Documents

Transcript of Introduction to Reporting and AnHadoop - Transcript.pdf