Data Mining

Name: KRIENGSAK CHANINCHOMPOONUT

Date: December 10th , 2010

1

As a result of the increased use of various technologies in virtually all areas of data mining research, obviously the good decision making is as important as the key of successfully for the organization strategic. Data mining gives you access to the information that you need to make intelligent decisions about difficult business problems which somehow be able to identify rules and patterns in data, so that you can determine why things happen and predict what will happen in the future. The Top-Bottom technique can be use when data form as functions which can be calculate by equation. However in the real world scenario, dealing with the complex data which is not always given the accurate outcome because many cases can not be solved with mathematical equation formula which attempt to map the unknown factors into the algorithms. Therefore, another solution come up with Bottom-Top technique that tend to cross validate with the solutions from both ways which are Top-Bottom and Bottom-Top

2

Top-Down technique

Bottom-Up technique

As a result, the next number of thisdataset are likely to be 0, 4, 7 and so on as we are able to map the known factors into equation.

Unlikely the dataset at the bottom as it need to be learn the unknownfactors from the bottom to top.

Because it could not be found inany linear proportion data that canbe solve with equation. Instead, it

rather spread out over the graph with unknown direction. If we stillusing the equation to solve thisdataset, we hardly or never detectany pattern or relationship at all.

So that’s why the bottom-up isbecome in efficiency way, by try tolearn a data and recognize them once the similar pattern appear again in the dataset.

3

To answer the various types of businesses questions, data mining will help you finding patterns and relations in data that is not apparent with human eyes by analysis those dataset using mathematical algorithms such as decision trees, segmentation, clustering, association and time series etc. through Microsoft SQL Server technologies and confirm those found discovery pattern for doing predictions base on the patterns in historic . Such that the valuable information found can be used for the various application such as financial applications, marketing & sale forecast, CRM, ERP etc.

The most topic as discuss in this project will be using the database as the foundation to provide the appropriate model , algorithms base on pattern recognition or detection that found in the historical data.

4

To achieve the project, the following tools below are developing tools with including within this project

ApplicationMicrosoft SQL Database Server (MSSQL)Microsoft SQL Server Analysis Services (SSAS)Microsoft SQL Integration Services Connections (SSIS)Microsoft Visual Studio C#Microsoft Decision Tree AlgorithmMicrosoft Naïve-Bayes AlgorithmNeural Network Algorithm

HardwareServer running the SQL Database engine and Analysis ServicesPC for daily gathering data source and supply to MSSQLServer running the SSIS for daily updating the SSAS serverPC for C# coding, database, SSAS and data mining design5

There are 5 phases to implement for this project

Phase I : Identify the business problemsPhase II : Data source collectionPhase III : Database transformationPhase IV : Data mining model buildingPhase V : Model Assessment

6

Data sourceData miningSSAS Database Server

MSSQL Database Server Neural Network• Data Converting• SSIS

Convert and Supplying data to MSSQL

Produce data mining

Query data from database

NN

produce

data

m

ining

7

To identify the business need, the experiment to demonstrate for this project involve to the financial application which inquire the questions as following

To help the financial department mange a currency swap. What are/is the most factors effected to the US Dollar and Thai Baht currency exchange rate? And what is the next day currency exchange rate likely to be?

Let determine the definition of each inquired to identify for the whole this presentation as followingFundamental : As is for the financial department inquiring.

8

To get the answering regarding to the first phase questions, the appropriate data need to be collected on this process which might get the ideas from the persons whom have the particularly those experiences background which help to narrow down the huge data raw into the meaning full data instead gathering all those meaningless data.

However, the data mining techniques tend to require more historical data than the standard models and in the case of neural networks, can be difficult to interpret.

9

Contents Data SourceEconomic statistical indicators

• Bank of Thailand

Daily Thai stock index • The Stock Exchange of Thailand

Daily Thai bank interest rate • Bank of Thailand

Daily exchanges rates • Bank of Thailand

Daily gold trading price • Bloomberg• Thai Gold Trader

Daily crude oil prices • Bloomberg

Daily world stock index • Bloomberg

10

Database TablesOnce we got all expected data source, the data transformation is begin. I wrote the scripts using C# grabbing all those data from the raw source and then feeding into the MSSQL database server which will be auto daily updating.32 TablesThe only selected appropriated tables will be include in this project.Create views table as usdVSVariables responding to selected appropriated

Fundamental Database

11

12

SELECT DISTINCT TOP (100) PERCENT dbo.ExchangeRates.DateKey, dbo.GoldMarket.DollarPerOunce, dbo.Energy.Value AS CrudeOil, dbo.ExchangeRates.BuyingSightBill, StockValue.SETValue, StockValue.DJValue, InterestMRR.MRR, DepositRate.OneYearMaxFROM dbo.ExchangeRates INNER JOIN dbo.Energy ON dbo.ExchangeRates.DateKey = dbo.Energy.DateKey INNER JOIN dbo.GoldMarket ON dbo.Energy.DateKey = dbo.GoldMarket.DateKey INNER JOIN (SELECT T.DateKey, T.Value AS SETValue, D.Value AS DJValue FROM dbo.StockMarket AS T INNER JOIN dbo.StockMarket AS D ON T.DateKey = D.DateKey WHERE (T.Symbol = 'SET') AND (D.Symbol = 'DowJones')) AS StockValue ON dbo.GoldMarket.DateKey = StockValue.DateKey INNER JOIN (SELECT DateKey, BankName, MRR FROM dbo.LoanInterestRate WHERE (BankName = 'Bangkok Bank')) AS InterestMRR ON StockValue.DateKey = InterestMRR.DateKey INNER JOIN (SELECT DateKey, BankName, OneYearMax FROM dbo.DepositInterestRate WHERE (BankName = 'Bangkok Bank')) AS DepositRate ON InterestMRR.DateKey = DepositRate.DateKeyWHERE (dbo.ExchangeRates.DateKey > 19991231) AND (dbo.ExchangeRates.Currency = 'USD') AND (dbo.GoldMarket.DollarPerOunce > 0)

SQL Code

At this point, I will divide two demonstrations into two different sections which areFundamental : Predict USD-Thai currency rate exchangesCustomers : Identifying perspective customers who are a potential

Let get start the Fundamental data mining implementation first. The standard approach to modeling the fundamental factors returns the currency exchange rates is to model the whole attributes associated as the input variables to predict Thai Baht per dollar as the result by analyzing the most influent effective factors.

Mining StructureData source from SSAS serverData for training and testing is 70:30Data type as discretizedKey : DateKey

16

In order to illustrate what are/is the most important variables for the prediction of Thai Baht per dollar, I aim using hybrid algorithms approach to utilize each advantages with including a Decision tree, Naïve Bayes to classify which variables to use for input in the Neural Network algorithm. The decision tree is capable

of detecting rules like “if A then B” However, dealing with continuous values is not work quite well like “if A then 2.5” but tries to split the node as “if A is > 20 then B” So, that’s why the Neural Network would take over the outcomes given as the numeric data to compare its results against the Decision Tree. Such that, my approach to forecast Thai Baht per dollar will be more accurate base on the associated variables which can be more efficiency predicted the approximately the next day as the result.

Decision Tree

Neural Network

Input

Variable 2

Variable 3

Variable 4

Variable 5

Variable 6

Variable 1

??

?

?

Classify

Variable 2

Variable 6

17

Naïve Bayes

Input

?

?

?

All associated variables can be retrieved by survey, by using external data research, or by discuss to persons who have those experience background.

The advantage of using several factors to perform the forecasting instead depend on only one factor

is they can cross validate the result which provide more quality and precisely of data interpreted outcome.

Variable Description Usage

SETValue Thai stock index (SET) Input

DJValue Dow Jones index Input

CrudeOil Crude Oil dollar per barrel Input

DollarPerOunce Gold price dollar per ounce Input

BuyingSightBill Thai Baht per USD currency rate Output – Predicted

DateKey Date dimension Key column

18

In order to get the whole picture of how each attribute related to predictedvalue, typically we need to retrieve entirely those attributes historically in database which will be given an idea of main pattern occurred in the big

cycle for determining a ceiling and floor of data range. Then later on we can spot or narrow down in data range for seeking a pattern in a small cycle base on a big cycle.

10 Years Data range

1 Year Data range

19

10 Years Gold Price Dollar Per Ounce andBaht Per USD Currency Relationship Graph

From Jan-01-2000 To Dec-31-2010

DollarPerOunce

20

CrudeOil

10 Years Crude Oil USD/Barrel and Baht Per USD Currency Relationship Graph

From Jan-01-2000 To Dec-31-2010

21

10 Years Thai SET Index and Baht Per USD Currency Relationship Graph

From Jan-01-2000 To Dec-31-2010

SETValue

22

10 Years Dow Jones Index and Baht Per USD Currency Relationship Graph

From Jan-01-2000 To Dec-31-2010

DJValue

23

Decision tree can help identify which factors to be considered and how each factor has historically been associated with different outcomes of decision.

Concept : Decision Tree is a classification makes predictions base on the relationships between input columns in a dataset by creating a series of splits or nodes in the trees. The algorithm adds a node to the model every time an input column is found to be significantly correlated with the predictable columns. To get the big cycle of data range, in this scenario the algorithms build 2 discretized containing in buckets as following

After process decision tree now it

help to determining which variable

most effected to value under 38.32 and above 38.32

Attribute Baht per USD

Bucket 1 < 38.32

Bucket 2 >= 38.32

24

Dependency Network

Displays the relationships between the attributes that contribute the leastand most important factors to the predictive attribute. The center node of

the chart representsthe predictableattribute and all nodes around represent the input factors attribute.

The number 1 is themost important factorwhile 4 is the least.

As the diagram, the SET Value is the

least factor influential. Therefore, it is first disappeared by adjusting then Crude Oil, DJ Value and Dollar Per Ounce in order. As the result, decision tree will automatically create tree node in order by most important to least.

1

2

4

3

25

Trees Nodes

Typically, the decision trees is the classification model that contains all cases at the root node then split itself into the most several influential cases or we call children nodes which is Value – vEnergy and then each children node split themselves into the second important factor then split it again until there is no more cases can be split which is least important or we call leaf nodes as a diagram below.

According to this, the pink histogram represent value < 38.32 in the opposite green represent value >= 38.32 which each node split it own into

3 DollarPerOunce node along with data range and color to indicate the meaning categories.

26

Histogram

Each node might contain only pure single factor or a multi factors in a same node which contribute statistics ,cases supported and probability as representing by histogram. These histogram indicate percentage of node that effect to cases for example if we start travel from root node through node DollarPerOunce < 543.445 with high percentage histogram represent by green stripe along with 906 cases, probability 92.65% which imply these node determine value of Baht/USD greater than 38.32

Even through DJValue were split into greater than 10532 and less than 10532 but both nodes are support Baht/USD > 38.32 as well. Apparently the only different is they were grouped by two categories that either possibly can be fall into those node.

If we consider on DJValue and Baht/USD relationship chart, that would help you understand more clearly.

27

38.321

05

32

DJValue>= 10532

Zone

DJValue< 10532

Zone

10 Years Dow Jones Index and Baht Per USD Currency Relationship Graph

From Jan-01-2000 To Dec-31-2010

Dow Jones

28

After processing decision tree, nodes contain low histogram is not influent to predicted value instead only the most pure color would be include for interpreting.

As a result gold price is the most influent for determining Baht/USD direction. If goldprice is going up, seem likely impact to Baht/USD going down in the opposite direction.In contrast if gold price is going down then Baht/USD is going up in conversely way.

The dependency network will help to confirm Gold price is most important in tree algorithm which can be prove by looking at the next level of node gold price 543.44-862.84. It split into 3 nodes of Thai SET index. Although they are all most high histogram but they are seem likely meaningless. Because the process

29

38.32

DollarPerOunce

Go

ld >

543.44

30

repeats recursively for each child that given the whole range of SET value which can be any zone of SET range. However under Baht/USD 38.32 with Gold price 543.44 – 862.84, there are 3 SET nodes supporting this scenario possibly occurred.

Apparently, the same observation is applied for node under gold price below

543.44 which can be explained on figures page 27- 28.

For instance, If gold price drop below 543.44 with any range of Dow Jones are likely to impact Baht/USD is going up.

38.32

SETValue

Zone 1Zone 3

Zone 1

Zone 2

Zone 3

31

Even Decision tree can classify dataset into each segmentation and can point out what is the most important variable impact to predicted value. However the disadvantage of tree is built with univariate at root and splits at each node, as each split is made the data is split base on recursive from root node to leaf node where is usually very little data left to make a decision. For instance, recall from previous figure under gold price 543.44 – 862.84 node there are 3 nodes splitting which are SET value but those nodes can not specify exactly what data range of SET are, instead they are given all zone possibly.Because those 3 nodes are made decision base on their parent node recursively.

Unlikely a Naïve-Bayes, each attribute made decision independence with their own base on predicted value directly and not recursive from any others nodes. An classifier is made at leaf nodes. For instance Are small companies with annual profits of more than $500K a bad credit risk? Are large companies with annual profits in the negative still a good credit risk? Naïve-Bayes does not consider combinations of attributes like decision tree. So, if decision tree segments the data that is consider an essential part of big picture then each segment of data represented by a leaf is described through a Naïve-Bayes.

Absolutely it depend on what is/are business problem defined, if we only looking for thebig picture of data then decision tree would be provide enough information. But if weneed to focus on ,or likely to explore the others attributes those are not depend on bigpicture then we need a Naïve-Bayes for this task.

In this case, node Gold price is a big picture as when travel through entirely tree to leafnode include each path from root. Unfortunately, at the leaf node contain little data whichmight be important as well if we process with a Naïve-Bayes at the leaf.

32

1

4

3

2

Dependency Network

After executed a Naïve-Bayes, Dependency Network is given a result of order important attribute differ from Decision Tree. Crude Oil is a second most important attribute instead Dow Jones. That because Crude Oil is classified independency

directly into Baht/USD as same as to others attribute as well.

However a gold price still be the first important one.

Considering an attribute profiles as each attributes states by data range that that represent by color on the next page.

Baht/USD is split into two cases which are >= 38.32 and < 38.32 and it seem a case>= 38.32 is more reliable than case < 38.32because there are less segmentation than< 38.32. Therefore those input attributeshas a meaningful of relationship to Baht/USD.

33

Attribute Profiles

Figure on the left shows each attributes corresponding to Baht-USD. A pure color indicate the highest probability occurred.Such that gold price is very confidence for determining with blue contains value below 543.44 is 96% probability support Baht/USD >= 38.32. In contrastwith the same attribute and datarange fall in a case < 38.32 only0.83% probability but 50:41 port potion with value greater than 543.44 instead.Analyzing the result

Significantly, gold price and crude oil are likely conversely to Baht/USD in the opposite direction. Since gold price, crude oil price are drop then make Baht/USD going up. Unlikely Dow Jones and SET are quite not in linear data relationship (Figure page 28 and 30) so they can be either under and above 38.32 zone. For instance Dow Jones with below and above 1053.85 is 68:32 probability fall in value >= 38.32 and can be < 38.32 as well with probability 34:66. Therefore, Dow Jones and SET value are not quite well confidence determining Baht/USD direction in Naïve Bayes algorithm that is why they are low important impacted in dependency network.

34

CrudeOil

38.32

In this phase, I use tools to determine the accuracy of the models that were created, and examine the models to determine the meaning of discovered patterns and how to apply to business. For example, a model may determine that Baht/USD is dropped if gold price or crude oil is going up.

Obviously, a dataset in linear relationship is more meaningfulness than data in random.

Although 10 years gold price and crude oil historical dataset can be the most appropriate input attributes to process data mining.

Occasionally, the same attribute mightdoesn’t contain anyuseful patterns witha different data ranges.

For examples 1 yearof crude oil historicaldataset might contain

35

non linear dataset. But, SET might contains a well useful patterns instead. So it depends on business needs what try to approach. If only focus on a main scope, then algorithms

One year Baht/USD - Crude Oil Historical

with discretized content under a large historical dataset would be the best fit for this application. In the other hand, a small of historical dataset with numeric content might be a best solution for application that focus on a real linear number calculation such as daily stock forecasting. Because in a large dataset will take a lot of time consuming to produce the result. Even with a high performance computer especially to produce Neural Network result which might take a whole month to learn and searching just a small pattern under a multi attribute input.

Therefore a good approach for a generic result is likely to build a several model using different algorithms and then compare the accuracy of these models.

One year Baht/USD - SET Historical

36

The accuracy of analgorithm depends on the nature of the data, data range and anappropriate algorithm.

You may need to repeat

Classification Matrix

the data cleaning and transformation in order to derive more meaningful variables. Then determine the big picture of dataset with created algorithms. However if the relationships among attributes are complicated, a neural network may perform better.

Essentially it is very important to work with business analysis who have the proper domain knowledge to validate to discoveries as a bottom line before deploying those patternsdiscovered by data mining to a production used.

Similar to this experiment, a big picture pattern is found by a Decision Tree and Naïve-Bayes algorithms with a couple input attribute as gold price and crude oil need to be validated before we move to another step.

However, to accomplish this project I will assume those attributes are the most importantto determine Baht/USD direction as a big picture. For the next step, a Neural Network is anext algorithm be used for learning and searching a dataset that derived from a previousalgorithms output by attempting form those found pattern in a linear relationship.

37

Recall from the beginning of this presentation, the unknown dataset pattern can be solve by bottom-up technique. A Neural Network is a good approach for solving a complicated data as long as the input attributes are the right one.

CONCEPT

Basically, a neural network (NN) is an algorithm based on the operation of biological, inother words, is an emulation of human brain. It designed to think like a human brain by learning problems and later solve the others with similar problems.

In the human brain action potentials are the electric signals that neurons use to conveyinformation to the brain and travel through the net using what is called the synapse. Asthis signals are identical, the brain determines what type of information is being receivedbased on the path that signal took. The brain analyzes the patterns of signals being sentfrom that information it can interpret the type of information being received.

To emulate that behavior, the artificial neural network has several components: the nodeplays the role of the neuron, the weights are the links between the different nodes, so itis what the synapse is in the biological net. The input signal is modified by the weightsand summarized to obtain the total input value for a specific node (diagram next page).There are three layers in a NN: the input layer which holds one node for each input variable; the bias layer, where there could be several internal layers; and the output layer that holds the result set. An activation function is used to amplify the results of that input and obtain the value of particular node.

38

Neuron scheme

Node scheme

A diagram illustrates a neuron scheme, received the information from others neuron as the input viaa synapse while the connections between neuron and others forming like a branch or a network. Oncethe input is large than determined threshold thenneurons will be fired according to that correspondingreceived information.

Similarly to a node scheme does, the perceptron is

In

In

In

Perceptron

taking a weighted sum of inputs and sending the output to others node member, if the sum is greater than some adjustable threshold value. The inputs x1, x2, x3..xm andconnection weights w1,w2,w3,wm are typically real values. If the feature of some xi tendsto cause the perceptron tofire, the weight wi will bepositive but if the featurexi inhibits the perceptron,the weight wi will be negative

The perceptron consists ofweights, the summationprocessor and adjustablethreshold processor or biasinput. A bias input might getmore weight than othersregular input then it comes

39

affecting firing the activate function. There are several algorithms used in neural networks.The backpropagation is the one of most popular which is used in this project.

Typically, what the backpropagation algorithm does is to propagate backwards the errorobtained in the output layer while comparing the calculated value in the nodes to the realor desired value. This propagation is made by distributing the error and modifying theweights or links between the previous and present nodes. Going backwards, the valuesof the nodes in the bias input can be modified and so can be the weights between theinput and bias input, but not the values of the nodes in the regular input as they are thevalues of the variables we are using. Once the algorithm got to the input layer it goesagain forward with the new modified weights and calculates the results in the output layeragain. This process is repeated until a minimum error is reached.

GOLD

SET

w1

w2

BipolarSigmoidFunction

f Output

One node scheme

Perceptron

As explained on the right, thereare two input attributes, one biasin the first layer pass forward its weights to perceptron then sum the inputs and sending to the output layer.

The output layer is fired throughthe activation function. This entireprocess run 20 nodes as the firstlayer to produce one output layer

And the following steps are carriedout how it’s work.

BIAS

w3

40

Learning Process•Split data into 2 set, 85 % training set and 15% for validating.•Randomly 20 values of each gold price and SET weights from training set.•Generate the weights for the between the nodes.•Compare how accuracy the outputs to the actual data (validating set).•Calculate the learning errors.•Adjustable the output errors for getting improvement on the results.•Contribute a new lot of the training set and repeat the process again until a minimum learning error outputs is reached.

Implementation•Gold price data range : 1062 – 1413 •SET data range : 684 – 1047 •1 year data range Jan-01-2010 to Dec-31-2010•24 Hours total learning process time.•Query statement from SQL Server

Here is how the learning processwork as it keep try torecognize the patternagainst the actual value and solving theproblem with equation.(Internet connection required) Or just followthis link

http://www.youtube.com/watch?v=7ghfX6kK5bo

41

Performance

Due to the learning process quite take so long so it came up with 24 hours forthis experiment which was given total error was 33.43 and 0.14 average error.

Absolutely, it will take only a few minutes to generate the result if data range isin a month or 10 days but the performance is going down as a result.

One year Baht/USD – Gold Price Result One year Baht/USD – SET Result

This validation given Baht/USD predicted as 33.01 which is 0.16 error when compare to actual gold price as 33.17

42

Even in 2009, gold price 1091.50 and 681.91 SET were not include in datarange for learning but NN still recognize the similarly pattern occurred in 2010 and try to generated the similarly output.

The occurred pattern is not only rely only on gold price but SET will help NNto classify this pattern as well for instance in 2009 and 2010 were given thesame gold price as 1091.40 but different SET value as 686.41 and 784.38.So Baht/USD result will be vary depend on SET input too.

VS

Predicted ResultActual

This learning error historical demonstrateas much as it getting closer to zero, as much as NN given an accuracy result. As the NN algorithm goes back and forth to get

the correct weights that will allow it to predict the output variable, so the weights vary in value from the initial randomly generated until the final ones that comply with the error 33.43 total,each pair of predicted and actual value 0.14 average error different, 0.0002 min and 0.58 max have been found in the learning historical.

43

Implement Neural Network learning video (Internet connection Required)Or follow this link http://www.youtube.com/watch?v=VRiMbG6XIpk

44

Summary

To answering as financial department inquiring for predicting Thai Baht againstUSD currency exchange rate, A Neural Network is a bottom line of this experiment that derived the classified input attribute from Decision Trees and Naïve Bayes through the process to analyze using SQL Database and SSAS to reach the goal of Baht/USD prediction movement in a numeric data, also covering data pattern recognition with a several algorithms i.e.. classification, segmentation, approximation, and back propagation approached.

References

1.Neural Network on C# By Andrew Krillov2.Delivering Business Intelligence By Brian Larson3.Neural Network, from Wikipedia4.Back Propagation, from Wikipedia5.Decision Tree, from Wikipedia6.Naïve Bayes, from Wikipedia

Data Mining

Technology

Transcript of Data Mining