OBIEE 12c Advanced Analytic Functions

www.redpillanalytics.com

Abstract Oracle Business Intelligence Enterprise Edition 12c has enhanced analytical capabilities

due to an (optional) integration with the statistical software R. These new functions include the following: Trendline, Bin and Width Bucket, Forecast, Clustering, Outlier and Regression. This document will provide a comprehensive review of these newly available functions, and provide examples of them in action. For ease of understanding and reproducibility, the sample data set is Oracle’s Sample Sales Lite1.

1ThisdatasetisavailablewitheveryinstallofOBIEE12c.Alternatively,asimilarsetcanbefoundwithintheOracleBISampleAppVirtualMachine.


The Trendline Function

Trendline is part of the Advanced Analytics Internal Logical SQL Functions, meaning it is in the group of functions that are done internally as opposed to being done in R. This function fits a linear or exponential model, and returns the fitted values or model. The numeric_expr represents the Y value for the trend and the series (time columns) represent the X value. A Trendline is a model, and its assertion that the data is the result of a model. The TRENDLINE function measures data across time and shows a line of a metric by ordered records. It can model data as linear and as exponential regression.

Figure 1: The Trendline function is found under the Aggregate folder by clicking on the “Insert Function” button in the Formula section of the column editor.


Trendline Function Syntax � TRENDLINE( <numeric_expr>,( [<series>] ) BY ( [<partitionBy] ),

<model_type>, <result_type>, [number_of_degrees] )

� Where:

o numeric_expr—represents the data to trend

▪ This is the Y-Axis and is a measure column.

o series—indicates the X-axis. This is a list of <valueExp> <orderByDirection>, where <valueExp> is a dimension column and <orderByDirection> is ASC (ascending) or DES (descending).

▪ The default is ASC. Note that this cannot be an arbitrary combination of numeric columns.

▪ It is possible to use more than one Trendline column in the same analysis, but the Trendline columns must have the same X-Axis.

o partitionBy—A list of dimension attributes that are not on the X-Axis.

o model_type— A model type may be one of the following types:

▪ LINEAR—a function with a constant rate of change and a straight line graph.

▪ EXPONENTIAL—a function whose value is raised to the power of the variable.

o result_type— A results type may be one of the following types:

▪ VALUE - will return all the regression Y values given that X in the fit.

▪ MODEL - will return all the parameters in a JSON (JavaScript Object Notation, which is a lightweight data-interchange format) format string.

Figure 2: Example formula to display result_type of MODEL.


Figure 3: Results of using ‘MODEL’ as the result type; it returns the parameters in a JSON (JavaScript Object Notation) format string.

Example Syntax TRENDLINE(“Base Facts”.”Revenue”, (“Time”.”Calendar Date”), ‘LINEAR’,’VALUE’)

Figure 4: Selected dimensions and fact columns for a sample trendline analysis.

Figure 5: Note the Trendline (in green); depicting these types of subtle changes is what this function is best at.


Figure 6: If the graph is set to vary color by ‘Per Name Year’, the results are displayed for each year. Note the differences between each year that otherwise would not be apparent.

Figure 7: Segmentation of the trends could continue to smaller subsets. Above, the 2009 has been split by semester.


The BIN and WIDTH_BUCKET Functions Both BIN and WIDTH_BUCKET are included in the Advanced Analytics Internal

Logical SQL Functions, meaning they are in the group of functions that are done internally as opposed to being done in R. With that being said, the syntax for the two functions is different and will be covered later on.

About BIN In the BIN function, the user can select any numeric attribute (INT, FLOAT, DOUBLE,

NUMERIC) from a dimension or fact table/measure containing the data values and place them into a discrete number of bins. The reason to bin a measure would be to separate results of the measure into group (see BIN syntax). An example of this would be sales from a store and binning the revenue from anything less than $200, between $200 and $500, and so on. This sales that had that amount of revenue will be binned into the groups that fit that specific criteria. The BIN function classifies a given number expression into a specific number of equal width buckets. The function can return either the bin number or one of the two end points of the bin interval. The output of the BIN function is used as a GROUP BY expression for other measures included in the query. The BIN function is treated like a new dimension attribute for purposes such as aggregation, filtering, and drilling. All of these operations are supported on BIN expressions.

BIN Syntax BIN(numeric_expr [BY grain_expr1, …, grain_exprN] [WHERE condition] INTO number_of_bins BINS [BETWEEN min_value AND max_value] [RETURNING { NUMBER | RANGE_LOW | RANGE_HIGH }])

� Where:

o numeric_expr—indicates the measure or numeric attribute to bin

o BY grain_expr1, …, grain_exprN—indicates a list of expressions that define the grain at which the numeric_expr is calculated before the numeric values are assigned to bins.

▪ This clause is required for measure expressions and is optional for attribute expressions

▪ The BY clause of the BIN function defines the grain at which the binned expression is evaluated prior to binning.

� If the binned expression is a measure, then the measure is grouped at the grain specified in the BY clause before being binned.

▪ The BY clause of the BIN function is mandatory if the binned expression is a measure.


� Otherwise, for non-measure expressions, the BY clause is optional.

o WHERE condition—indicates a filter condition to apply to the numeric_expr before the numeric values are assigned to bins

o INTO number_of_bins—indicates the number of bins to return. The default is 10.

o BETWEEN min_value AND max_value—indicates the minimum and maximum values used for the end points of the outermost bins

o RETURNING—indicates a filter condition to apply to the numeric_expr before the numeric values are assigned to bins. Note the following options:

▪ RETURNING NUMBER—indicates the return value should be the bin number (for example: 1,2,3,4). This is the default condition

▪ RETURNING RANGE_LOW—indicates the lower value of the bin interval

▪ RETURNING RANGE_HIGH—indicates the higher value of the bin interval

Figure 8: The Bin Function is found under the Aggregate folder in the column formula editor.


About Width Buckets The WIDTH_BUCKETS function is known as a “secret function” meaning it is not

available in the function menu, but the user can type the formula to use it. The syntax of WIDTH_BUCKET is also comma-based, which is not consistent with most Advanced Analytics in OBIEE. Similar to binning, width bucket classifies a given numeric expression into a specified number of equal width buckets. It operates on top of a base query result set as a display function. The function can return either the bin number or one of the two end points of the bin interval. Unlike the BIN function, the WIDTH_BUCKET function is not treated as a new dimensional attribute for the purposes of aggregation. It is applied on top of the query result similar to the other display functions such as RANK, TOPN, BOTTOMN, NTILE, PERCENTILE, MAVG, and MEDIAN. Use the WIDTH_BUCKET function when you want to compute a discrete set of buckets on top of an already aggregated query result set. The syntax for Width Bucket is much simpler than that of the BIN function.

WIDTH_BUCKET Syntax WIDTH_BUCKET(numeric_expr, {NUMBER | RANGE_LOW | RANGE_HIGH }, number_of_bins, [min_value, max_value] [BY expr1, …, exprN])

� Where:

o numeric_expr—indicates the measure or numeric attribute to bin

o NUMBER—indicated that the return value should be the bin number (ex: 1,2,3,4).

o RANGE_LOW—indicates the lower value of the bin interval

o RANGE_HIGH—indicates the higher value of the bin interval

o number_of_bins—indicates the number of bins to return. The default is 10.

o min_value, max_value—indicates the minimum and maximum values used for the end points of the outermost bins. If the min_value and max_value conditions are omitted, then the function determines the end points automatically.

o BY expr1, …, exprN—indicates an optional list of expressions that define the groups in the query result set over which the WIDTH_BUCKET calculation is applied. The bucket intervals within different groups are calculated independently.

▪ The BY clause of the WIDTH_BUCKET function defines the groups in the query result over which the WIDTH_BUCKET calculation is applied.

� The buckets within different groups are calculated independently.

▪ The BY clause is always optional in the WIDTH_BUCKETS function.


� If the BY clause is omitted from the WIDTH_BUCKET function, then the function operates over the entire result set.

BIN and WIDTH BUCKET: Defining Grouping

The goal of both functions is to define the bin/bucket that the specific data entry belongs to. This is accomplished by:

o Using what column the binning should be done (that is, the binned expression).

§ Remember, this is a numeric expression (and usually a measure).

o By what attributes the data should be arranged.

§ Remember, the BY function does not have the same meaning in both functions!

o The number of Bins/Buckets and the type of data returned.

§ Remember, it is one of three options: the bin or bucket number, it’s minimum or maximum point.

o The WHERE condition option found in the BIN function.

BIN and WIDTH_BUCKET Function Example The dimensions and measures being used for this example are:

• LOB

• Per Name Month

• Revenue

• BIN Formula: BIN("Base Facts"."Revenue" BY"Products"."LOB","Time"."Per Name Month" into 4 bins)

• WIDTH_BUCKET Formula: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER, 4)

o (Define the number of bins for each to be the same or there will be an error)


Figure 9: Above are the results of the binning and buckets of revenue. The table shows that it is binning the monthly revenue of the LOB in columns “BIN” and “WIDTH_BUCKET” in bins of 1-4. It is sorting or binning the revenue into specific numbered groups.

Figure 10: A linear graph where Bin #1 contains the month and year when the revenue was less than $15,000.


Figure 11: A linear graph where Bin #2 contains the month and year when the revenue was between $15,000 and $30,000.

Figure 12: A linear graph where Bin #3 contains the month and year when the revenue was between $30,000 and $45,000.


Figure 13: A linear graph where Bin #4 contains the month and year when the revenue was greater than $45,000.

Be sure not to aggregate BOTH functions using the BY clause for it will result in an error.

•BIN: BIN("Base Facts"."Revenue" BY "Time"."Per Name Month" into 4 bins)

� The meaning of BY "Month" in BIN is: Take the sum("Revenue" by "Month") and arrange the sum of month in 4 bins. So rows of the same month will have the same BIN "Revenue" by "Month" results.

•WIDTH_BUCKET: WIDTH_BUCKET("Base Facts"."Revenue", NUMBER, 4 by "Time"."Per Name Month")

� The meaning of BY "Month" in WIDTH_BUCKET is: Take individual rows of data in each month and arrange them in 4 buckets.


Figure 14: The Bin and Width Bucket do not match due to both functions using the BY clause.

Using the WHERE Option in the BIN Function

Figure 15: BIN Function Criteria edited to include the WHERE option.

BIN Formula: BIN("Base Facts"."Revenue" BY "Products"."Product Type","Time"."Per Name Month" where "Time"."Per Name Year"='2010' into 4 bins)


The Forecast Function A Forecast creates a time-series model of the specified measure over the series using either

Exponential Smoothing or ARIMA (Autoregressive integrated moving average). This function outputs a forecast for the set of periods as specified by numPeriods. Forecasting is very useful as a tool for predictive analytics. You can see potential trends for different dimensions and measures because of this function.

Forecast Syntax

Figure 16: The Forecast function can be found under the “Time Series Calculations” folder within the column formula editor.


FORECAST (numeric_expr, ([series]), output_column_name, options, [runtime_binded_options]) ])

� Where:

o numeric_expr —indicates the measure to forecast.

o series —indicates the time grain at which the forecast model is built. This is a list of one or more time dimension columns.

▪ If you omit series, then the time grain is determined from the query.

▪ The series must fit the date columns in the Analysis.

o output_column_name —indicates the output column. Valid values are ‘forecast’, ‘low’, ‘high’, and ‘predictionInterval.’

▪ forecast —This column is the forecasted output

▪ low —This column is the forecasted lower bound number

▪ high —This column is the forecasted higher bound number

� Upper and lower limits of the prediction at the given confidence level might be important

▪ predictionInterval —This is an available option that is the confidence for the prediction.

� The predictionInterval ranges from 0 to 100, where the higher values specify a higher confidence.

o options —indicates a string list of name/value pairs separated by a semi-colon.

▪ The value can include %1…%N, which can be specified in runtime_binded_options.

▪ View the table below for the available options

o runtime_binded_options—indicates a comma separated list of runtime-binded columns and options


Forecast also has many of Available Options that can be used with the function. Below is a list of the option types: (Value type in the parentheses)

� numPeriods —The number of periods to forecast (integer)

� predictionInterval —The confidence for the prediction (0 to 100, where higher values specify higher confidence)

� modelType —The model to use for forecasting. (ARIMA—Autoregressive Integrated Moving Average, fitted to time series data either to better understand the data or to predict future points in the series), (ETS—Error, Trend, Seasonal—exponential smoothing state space model that is applied to the ‘y’.)

� useBoxCox —If TRUE, then use Box-Cox transformation, which is a method used to normalize a data set so that statistical tests can be performed to evaluate it properly. Many real world raw data sets do not conform to the normality assumptions used for statistics, so transformation functions can sometimes be used to normalize the data. (TRUE, FALSE)

� lambdaValue —The Box-Cox transformation parameter. Ignore if NULL or when useBoxCox is FALSE. Otherwise the data is transformed before the model is estimated.

� trendDamp —This is a parameter for ETS (Error, Trend, Seasonal) model. If TRUE, then use damped trend. If NULL, then try both damped and non-damped trend and choose then one that is optimal.

� errorType —This is a parameter for ETS model. (additive (“A”), multiplicative (“M”), automatically selected (“Z”))

� trendType —This is a parameter for ETS model. (none(“N”), additive (“A”), multiplicative (“M”), automatically selected (“Z”))

� seasonType —This is a parameter for ETS model. (none(“N”), additive (“A”), multiplicative (“M”), automatically selected (“Z”))

� modelParamIC —The information criterion (IC) to be used in the model selection. (“ic_auto”, “ic_aicc”,”ic_bic”,”ic_auto”—this is the default)


Figure 17: “Per Name Year” has been filtered to be “equal to/ is in” ‘2008’ to allow forecasting for ‘2009’.

Forecast Example The formula used in the FORECAST Column is as follows:

FORECAST("Base Facts"."Revenue", ("Time"."Per Name Year", "Time"."Per Name Month"),'forecast','modelType=arima;numPeriods=%1;predictionInterval=70;', 12)

Figure 18: Forecast for 2009 based on 2008 data.


The Clustering Function This function groups a set of records into groups based on one or more input expressions using K-Means or Hierarchical Clustering, which are the two modes of clustering analysis that can be utilized in the advanced analytics clustering model provided in 12c. K-MEANs: Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering attempts to partition into a specified number of clusters (k) so as to minimize the sum of the distance functions of each individual point from the K center. This allows for an overview of similarities along the given dimensions. Hierarchical Clustering: Generally, this form of clustering is an attempt to build a sort of pecking order in which the data filters down into distinct groups along the prompted dimensions. Hierarchical clustering can be thought of as a sort of “top-down” approach of structuring an overview for viewing contextual differences/similarities amongst user-defined dimensions. Syntax for Clustering Analysis: CLUSTER( (dimension_expr), (expr), output_column_name, options, [runtime_binded_options]) Where:

• dimension_expr— represents a list of dimensions to be clustered (K).

• expr— represents a list of dimension attributes or measures to be used (x1, x2, …, xn) to cluster the dimension_expr (K)

• output_column_name— is the output to be printed in the column header, this portion of

the syntax is only part of the aesthetic interaction in the platform and does not perform and analytics. The valid values include:

o clusterID – This column is the cluster number or ID. o clusterName – This column is synonymous with clusterID. o clusterDescription – The description can be added by the end user after the

cluster dataset is persisted into DSS. o clusterSize – This column is the number of elements in the current cluster. o distanceFromCenter – This column indicates how far the current cluster

element is from the center of the current cluster. o centers – This column indicates the center of the current cluster in a format

• options — is a string list of name=value pairs separated by ';'. The value can include %1

... %N, which can be specified using runtime_binded_options.

• runtime_binded_options — indicates a comma separated list of binded columns or literal expressions that supply a specification to an unrepresented value in the options list.


This portion of the syntax is optional. It is merely satisfying parameters for other options that have yet to be specified. For example, in the clustering analysis, you might have options of numclusters=%1, maxIter=%2. Let’s speculate that you want 5 clusters and a maximum 10 iterations for this particular analysis. Your runtime_binded_options would then be 5,10 — which corresponds to 5 clusters and 10 iterations. Order matters. %1 in options equates to the first specified option, %2 the second, and %N the Nth. Here would be the entire syntax for this example (highlighted is the areas of focus).

CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"), ("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName', 'algorithm=k-means;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE', 5, 10) Remember that the runtime_binded_options option is not required. Parameters can be specified in the function without the use of this option. This means that the following code is synonymous in performance to the example given above: CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"), ("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName', ‘algorithm=k-means;numClusters=5;maxIter= 10;useRandomSeed=FALSE;enablePartitioning=TRUE’) Clustering Example Analysis An example of a clustering analysis could check to see how the dimensions of offices and companies within the data set were clustered along the measures of revenue and discount amount. One hypothesis for this analysis might be that offices under their respective companies are acting very similar in regards to discount amount and revenue. Formula Syntax2 CLUSTER(("Offices"."Office", "Offices"."Company"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', ‘algorithm=k-means;numClusters=@{numClusters};maxIter=@{numIter};useRandomSeed=FALSE;enablePartitioning=TRUE’) Methodology: The example will be using K-means clustering rather than hierarchical clustering. See above in the Syntax for Clustering section for details on the syntax variation for the options of numClusters and maxIter that allow for user inputs for these variables.

2Thehighlightedtextreferstopresentationvariables.SeeAppendixIformoreinformation.


With a user input of 3 clusters and 20 iterations, one would receive an output of: Figure 19: Cluster Visualization for 3 Clusters, with 20 Iterations Where our clusters are depicted via color and shape and our Discount Amount and Revenue on our axis and each point represents one of the 20 offices in the data set. We can see how this graph changes after doubling the cluster amount. Figure 20: Cluster Visualization of 6 Clusters with 20 Iterations.


Notice how some clusters are larger than others. This is because in this clustering method, the objects of the data set are grouped in such a way that the clusters are very different from each other and the objects in the same group or cluster are very similar to each other. This being said, some data clusters might contain highly similar points along the measures of discount amount and revenue while others are highly varied and only contain one data point, such as cluster number 1 in this analysis. There is no ‘perfect number’ for cluster amount. This number is contingent upon the data set in use, the amount of data, and user preference. 3 and 6 were used here in a mere exemplary fashion. If the data is in a tabular format, one can get a fairly informational depiction of exact amounts within the selected data set. This allows for a more precise or exact view of the data within the clusters. It would be poor practice to display all of this information on the scatterplot. The visualization is more of an aesthetic way of viewing data that allows for increased perception of what might otherwise not be apparent. The tabular version is important in correspondence with the visual so that the user can witness precision along the results of the executed underlying algorithm. Here is a snippet of the tabular information, sorted in ascending order by cluster number: Figure 21: Tabular View of Cluster Analysis. The last important thing to note is that within the clustering function in 12c there are a few variant methods for clustering. These are sort of subsets within the K-means and Hierarchical methods. For the visual comparison K-means will be used because K-means is the default method for clustering in OBIEE. Also new variables (as compared to the previous analysis) will be used to get more data points and to compare the different methods accordingly to see how they differ.


Figure 22: New Columns for Methodology Comparison. Notice below, the added option in the options portion of the syntax for all 3 of the following comparisons, clusterNamePrefix, for this function. Also notice that useRandomSeed is set to FALSE because we are comparing methods. In the ‘run time binded’ section of the function analysis, both %1 and %2 are set to (“INSERT METHOD”) for the usage of methodology and the display of the methodology name in the legend for the visualization respectively. Also note that 5 clusters are used in each analysis which allows for a more telling comparison along our input dimensions. K-MEANS CLUSTERING METHODS: 1) Hartigan-Wong Method CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method= %1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2', '@{P_Method}{Hartigan-Wong}', ‘@{P_Method}{Hartigan-Wong}') Figure 23: Output from Hartigan-Wong Method.


2) Lloyd Method CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method= %1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2', '@{P_Method}{Lloyd}', ‘@{P_Method}{Lloyd}') Figure 24: Output from Lloyd Method.


3) MacQueen Method CLUSTER(("Offices"."Office", "Products"."Product"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method= %1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2', '@{P_Method}{MacQueen}', ‘@{P_Method}{MacQueen}') Figure 25: Output from MacQueen Method. Looking closely at these varying visualizations, it is apparent that the differentiation of each cluster is slightly different for the 3 methods. Also of note, is that the H-Clustering Methods are the default, but there is also ward.D, ward.D2, Single, Average, Median, McQuitty, and centroid.


The Outlier Function This function classifies a record as Outlier based one or more input expressions using K-Means, Hierarchical Clustering or Multi-Variate Outlier Detection Algorithms (The 3 methods in outlier detection for the Advanced Analytics tools in OBIEE 12c). Each method is utilized for different purposes and the user has the ability to adjust the algorithm of use according to their specific needs. In statistics, an outlier is a reference to specific data that diverge from the normality of the data set as a whole to a statistically significant extent. Outliers can be thought of as a data anomaly; the sort of black sheep within the data. Outlier detection can be thought of as clustering data along a logical metric, where normality is equal to FALSE (not an outlier) or abnormality is equal to TRUE (an outlier). Here is a brief description of the 3 methods that were mentioned above: K-MEANs: Given a specified number of observations input by the user (x1, x2, …, xn), k-means clustering attempts to partition into a specified number of clusters (k) so as to minimize the sum of the distance functions of each individual point from the K center. This allows for an overview of similarities along the given dimensions. For outlier detection, there will be two clusters in a logical format, one of TRUE and one of FALSE. TRUE denoting an outlier, FALSE denoting data normality. Hierarchical Clustering: Generally, this form of clustering is an attempt to build a sort of pecking order in which the data filters down into distinct ‘groups’ along the prompted dimensions. Hierarchical clustering can be thought of as a “top-down” approach of structuring an overview for viewing contextual differences/similarities amongst user-defined dimensions. Multivariate Outlier Detection (default outlier detection for 12c): One way to check for multivariate outliers is with Mahalanobis’ distance.3 Mahalanobis’ distance can be thought of as a metric for estimating how far each case is from the center of all the variables’ distributions (i.e. the centroid in multivariate space). Mahalanobis’ distance accounts for the different scale and variance of each of the variables in a set in a probabilistic way.

3 (Mahalanobis, 1927; 1936 ).


Syntax for Outlier Analysis: OUTLIER( (dimension_expr1 , ... dimension_exprN), (expr1, .. exprN), output_column_name, options, [runtime_binded_options]) Where:

• dimension_expr— represents a list of dimensions to be clustered (K) • expr— represents a list of dimension attributes or measures (x1, x2, …, xn) to be used in

order to find outlier’s. • output_column_name— is the output column. The valid values are:

o ’isOutlier’: which will print back a logical value TRUE or FALSE as to whether or not each data point is an outlier or not.

o ’distance’: will return the “distance from normality” (the higher this number, the ‘more’ of an outlier the data point is).

• options — is a string list of name=value pairs separated by ';'. The value can include %1 ... %N, which can be specified using runtime_binded_options.

• runtime_binded_options — is an optional comma separated list of run-time binded columns or literal expressions that supply a specification to an unrepresented value in the options list. This portion of the syntax is optional. It is merely satisfying parameters for other options that have yet to be specified. For example, in an outlier analysis, the user might have an option output_column_name=%1. If it was speculated that they wanted to use the distance for this particular analysis, Their runtime_binded_options would then be equal to ‘distance’. Order matters. %1 in options equates to the first specified option, %2 the second, and %N the Nth. Here would be the entire syntax for this example (highlighted is the areas of focus). Remember that runtime_binded_options is optional. You can specify parameters to your options without using this tool, which implies that runtime_binded_options is more of an organizational tool than a functional one. Using it versus not using it does not impact performance, but the option is nice to have for organizational purposes.

Outlier Function Example Analysis: For the analysis, observe how the dimensions of offices and companies within the data set were clustered along the measures or attributes of both revenue and discount amount. One hypothesis for this analysis might be that offices under their respective companies are acting very similar in regards to discount amount and revenue.

Figure 26: Columns used in example analysis.


New Columns for Methodology Comparison For this example, the multivariate outlier algorithm (mvoutlier) will be used, rather than K-means or hierarchical clustering to start (no particular reason for this other than mvoutlier being the default algorithm). However, perhaps it could be wagered that the mvoutlier algorithm is the most favorable and is the default algorithm for a reason. Observe the variance in algorithms below. Function Syntax: OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=_________’) Outlier Observations Thus far, the syntax has proved to disallow any entering of a specified number of outliers. When using the multivariate algorithm, and entering numClusters into the syntax in order to change the result, an error is printed in the results tab. After playing around with the sample sales data, the conclusion can be made that there is no way to set a specific number of outliers to be detected. The number of outliers is contingent upon each data set and how it acts with the underlying algorithm in R. Setting an “is not equal to” filter on the two data points (Eiffel and Spring offices) in order to see if there would still be outliers does not change whether or not there are outliers. Rather, there are two new outliers (the second set of two most northeasterly points on the graph). This is counterintuitive to what the function is doing. If the function was finding truly, significantly variant data, then the result, after this filter was applied, should return all green (FALSE) points on the scatter plot. On the other hand, sometimes a user might have a data set with all very similar points but still want to find the point(s) that are most variant. This means that the outlier detection algorithm is a reliable source and will give us outliers in all situations. It is important to keep these contingencies in mind when analyzing data. When the scatter plot involving these variables of analysis is made, returned is the following graphs, with the accompanying tables of:


Multivariate Outlier Detection Method: OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=mvoutlier'

Figure 27: Multivariate Outlier Detection output. Hierarchical or H-clustering Outlier Detection Method: OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=h-clustering')

Figure 28: Hierarchical Clustering Outlier Detection output.


K-means Outlier Detection Method OUTLIER(("Offices"."Company", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=Kmeans')

Figure 29: K-Means Outlier Detection output. Notice that when using the h-clustering algorithms and the multivariate algorithms, the outliers are consistent (Eiffel and Spring offices of Tescare Ltd.) but when using the K-means algorithm to find outliers, very different values of Blue Bell and Teller offices of Stockpiles Inc are received. These variations in outlier detection methods between the algorithms beg the question of reliability amongst algorithms. For this reason, the variables of analysis were altered to try to get a visualization with more data points, and hence more outliers, to see if there was some sort of anomalistic variation here with just these variables. The syntax in use is: OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', 'algorithm=h-clustering') OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=Mvoutlier’) OUTLIER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=K-means’) Place each of the above in their own respective columns (with the same variables in each)in order to graphically view these outliers on the same scatter-plot. New variables (Product, and office) were also used in this analysis for a larger amount of data points. As is visible after these minute changes, some outliers overlap and others do not.


Legend Translation: Blue Squares: Only h-clustering viewed these as outliers Green Circles: All algorithms viewed these as outliers Yellow Rhombi: No algorithm viewed these as outliers Red plus: MV Outlier and K-means algorithms viewed these as outliers, not h-clustering. In the graph editor, the corresponding order of methodological reference is:

• Hierarchical Clustering • MV Outlier • K-Means

Figure 30: Visualization comparing all three outlier methods Analyzing Methodology Consistency There are no algorithmic overlapping points between the h-clustering algorithm and the other two algorithms. This is interesting. It could be inferred that other data sets would have points where the other methods overlapped with the h-clustering algorithm, but in this particular data set, the reasoning might have something to do with the variance in the algorithm and how it goes about ‘defining’ an outlier. Remember, h-clustering is representative of a hierarchical pecking order in which data sort of filters down whereas the other methods are distance based, based on your input criterion or dimensionality. These differences could account for the variance in our visualization here. Also notice that it seems as if the ‘behind the scenes’ R-statistics are more consistent with their outlier detection. In the first analysis, K-means was a little bit off as compared to the other two algorithms. After browsing through some documentation on K-means clustering, an apparent


notion of K-means being a reliable method amongst increasingly large data sets is noticeable. In the first analysis, there were few data points, in the second there are many. The fact that there were only 20 points in the first analysis might be the reason for this discrepancy amongst strategies. Perhaps as the data set size increases, more consistency with the varying algorithms will be noticed. Keep this in mind when choosing algorithms.


The Regression Function This function fits a linear model, and returns the fitted values or model. This function can be used to fit a linear curve on two measures. In statistics, a regression analysis is a process that estimates the relationship among two variables within a data set. The focus of this test is to measure the relationship between one or more independent (fixed) variables and its correlation to a dependent (variable) variable. More specifically, regression allows for a deeper understanding of how a dependent value changes when the independent variable is adjusted. It might help to think of regression in a sort of ‘mathy’ f(x) or f of x notation, where x is the independent variable or the input value. The dependent variable (or output) could be thought of as the value of the y axis. It might also help to think of these two variables in a linguistic way. The y-axis measure is the dependent variable, this means that it is literally dependent on some other value to change before it does. The x-axis measure(s) is/are literally independent of any other factor(s); they are fixed. This is important to understand before getting into the syntax. In laymen’s terms, regression is a measure of how good one measure is a predictor of another measure. Linear regression is also widely used for forecasting trends in an analysis, predictive analytics, and has large ties to the arena of machine learning as well. Also, understand that regression methodology does not insinuate causation, but rather suggests a specific extent of correlation of two measures. Dummy Variables in Categorical Regression It is not possible to directly regress a categorical variable against a numerical variable, nor is it possible to regress a numerical variable against a categorical variable. There is a solution for this though. It is called a dummy variable. This works with the assumption that it is necessary for an analysis to have a regression model regarding a categorical variable that contains the names of pets (Cats, Dogs, and Birds) and to see how good a predictor these pets are of (fill in the blank). It would not make sense to assign Cats, Dogs, and Birds a 1,2, and 3, respectively, unless, for some reason, this Dog was twice as much of a pet than a Cat and the Bird 3 times as much of a pet as the Cat. Since regression is used with two numerical variables, interpretations are only valid under circumstances where having a 100 stored for some variable literally equates to having 100 times the characteristic of X than the variable that stores the number 1. For the pet example, since it would be illogical to assign a 1, 2 and 3, an alternative (with a regression model in mind) is to assign some binary values, such as a 1=Cat and 0=not a cat. Syntax for Regression Analysis REGR(y_axis_measure_expr, (x_axis_expr), (category_expr1, ..., category_exprN), output_column_name, options, [runtime_binded_options]) Where:

• y_axis_measure_expr represents the measure for which the regression model is to be computed. This is your dependent variable.


• x_axis_expr represents the measure to be used to determine the regression model for the y_axis_measure_expr. This is your independent variable.

• category_expr1, ..., category_exprN represents the dimension/dimension attributes to be used to determine the category for which the regression model for the y_axis_measure_expr is to be computed. One or more dimensions or dimension attributes, up to five, may be provided as category columns.

• output_column_name is the output column. o fitted - returns the points on regression line in (y=ax+b) format o intercept - the intercept point with the zero on x axis (b from y=ax+b) o modelDescription - the Model in JSON format.

• options is a string list of name=value pairs separated by ';'. The value can include %1 ... %N, which can be specified using runtime_binded_options.

• runtime_binded_options is an optional comma separated list of run-time binded columns and options.

Regression Example Analysis In this particular analysis, a comparison is made to unveil how good a predictor the independent variable of billed quantity is for the dependent variable of revenue. The question to be answered here is, if the quantity of billed items is changed, how does revenue altered? Based on the column names alone, it could be predicted that the data will cluster fairly nicely around the regression line created by the function in an upward slope. This means that the billed quantity would be a good predictor of revenue. This is fairly intuitive. But, what can also be witnessed below is that billed quantity is not a perfect predictor of revenue; if it was there would be less data outlying this regression line. In a regression scatterplot like the one below, the tighter our ‘green dots’ are hugging our ‘blue dots’ the higher the correlation between the two variables. Function Syntax Used REGR("Base Facts"."Revenue", ("Base Facts"."Billed Quantity"), ("Time"."Per Name Month", "Time"."Per Name Year"), 'fitted', ‘’)


Figure 31: Regression Analysis of Billed Quantity as a Predictor of Revenue If the user were check the table below and look under the column heading “Regression”, he/she would see the regression function’s output, and how it relates to Figure 32, Figure 32: Regression Output in Tabular View


It may be interesting to see what data in this regression were not fitting this particular trend. The visualization below was created by using this syntax —OUTLIER((“Time"."Per Name Year", "Time"."Per Name Month"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier', ‘algorithm=mvoutlier’). This will display outlying values in correspondence with the same syntax and variables used for the above regression. Figure 33: Visual of Data Points where Billed Quantity is not a Predictor of Revenue Concentrate on the red plus signs rather than the yellow rhombi. The red plus signs are the outliers for this regression analysis, where the yellow rhombi are merely the corresponding data points that were plotted for the regression line for these 4 outlying data points. By sorting the outlier portion of this data set, one could create a table that shows the year and month where billed quantity was not necessarily a great predictor of revenue.

Figure 34: Tabular View of Outliers Within a Regression Analysis


What is noticeable is that, for the 6th and 7th months for 3 consecutive years, billed quantity was not a great predictor of revenue. By obtaining this sort of information, it is possible to drill down into why this might be the case. These sort of quantitative and visual ‘hints’ within the data being unveiled in an aesthetic way is the epitome of these advanced analytics tools. Statistics can tell a lot about why things are the way they are and can, ultimately, provide some insight to move forward in a fashion that will allow the building of a sustainable organization.


Appendix I: Creating Presentation Variables and Prompts Presentation Variable and Prompting the User for Function Options Above, there is slight variation in syntax within the function code from the original syntax given where there is @{numClusters};maxIter=@{numIter}in the options portion of the function input. The @{} is the code for adding a presentation variable to a dashboard prompt that will prompt the user for the number of clusters and the number of iterations for the algorithm to perform. In many cases it is a good idea to prompt the user for the number of clusters and iterations because it allows for a more interactive dashboard. It is also important because this easy functional change can show us how a large sample size continues to change as we continuously segment our data set into varying numbers of clusters. If a developer was eager to perform this same task, highlight (in the syntax) the portion that would typically contain (%1…%N) for whatever variable they wanted to add a prompt for they would perform the following tasks: Figure A1: Highlight the %N. Figure A2: Click “Variable”, then “Presentation”.


Figure A3: Input a variable expression. It is important to be careful prior to clicking OK here. This Variable Expression must be matched in a case sensitive fashion to the corresponding dashboard prompt. Click OK. Figure A4: Click “New”, then “Dashboard Prompt”. Figure A5: Click the green arrow, then “Variable Prompt”.


Prompt for=Presentation Variable: *Label (this is what is equal to the presentation variable that was set in the column function)=numClusters: Expand the options window: Variable Data Type=Number: A Note of Defaults The user can set a default value here. Also, just a heads up, there is some sort of undocumented default value of 5 clusters. For example: The syntax of— CLUSTER(("Products"."Product", "Offices"."Office"), ("Base Facts"."Discount Amount","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;') —returns a visualization of:

Figure A6: Default Visualization of Discount Amount versus Revenue.


Figure A7: Complete the process again for the iteration variable. Save these Prompts. Now when going into the Dashboard, where the dashboard prompt and the analysis have been input, this presentation variable can be witnessed in action.


Document History CreatedBy: BrendanDoyle MikePerhatsEditedBy: PhilGoerdtCreationDate:8/8/16LastEditDate:8/8/16

OBIEE 12c Advanced Analytic Functions

Documents

Transcript of OBIEE 12c Advanced Analytic Functions