Guy Riese Literature Review

AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING

IMPERIAL COLLEGE LONDON


DEPARTMENT OF MECHANICAL ENGINEERING

GUY RIESE

19/12/2014


i

Abstract This review aims to provide an introduction to machine learning by reviewing literature on the subject of supervised and unsupervised machine learning algorithms, development of applications and data mining. In supervised learning, the focus is on regression and classification approaches. ID3, bagging, boosting and random forests are explored in detail. In unsupervised learning, hierarchical and K-means clustering are studied. The development of a machine learning application starts by collecting and preparing data, then choosing and training an algorithm and finally using your application. Large data sets are on the rise with growing use of the World Wide Web, opening up opportunities in data mining where it is possible to extract knowledge from raw data. It is found that machine learning has a vast range of applications in everyday life and industry. The elementary introduction provided by this review offers the reader a sound foundational basis with which to begin experimentation and exploration of machine learning applications in more depth.


ii

Contents

1 Introduction .............................................................................................................................. 1

1.1 Objectives ......................................................................................................................... 2

2 Supervised Machine Learning Algorithms ............................................................................. 2

2.1 Regression ........................................................................................................................ 3

2.2 Classification Decision Tree Learning ............................................................................. 5

2.2.1 ID3 ............................................................................................................................. 6

2.2.2 Bagging and Boosting ............................................................................................... 7

2.2.3 Random Forests ........................................................................................................ 9

3 Unsupervised Machine Learning Algorithms ......................................................................... 9

3.1 Clustering ........................................................................................................................ 10

3.1.1 Hierarchical Clustering: Agglomerative and Divisive ............................................ 10

3.1.2 K-means .................................................................................................................... 11

4 Steps in developing a machine learning application............................................................. 13

4.1 Collect Data ..................................................................................................................... 13

4.2 Choose Algorithm ........................................................................................................... 13

4.3 Prepare Data .................................................................................................................... 13

4.4 Train Algorithm .............................................................................................................. 14

4.5 Verify Results .................................................................................................................. 14

4.6 Use Application ............................................................................................................... 14

5 Data Mining ............................................................................................................................ 15

6 Discussion ............................................................................................................................... 16

6.1 Literature ......................................................................................................................... 16

6.2 Future Developments ..................................................................................................... 16

7 Conclusion .............................................................................................................................. 17

8 References ............................................................................................................................... 17

9 Acknowledgements ............................................................................................................... 22


1

1 Introduction Computers solve problems using algorithms. These algorithms are step-by-step instructions for the computer to sequentially follow in processing a set of inputs into a set of outputs. These algorithms are typically written line-by-line by computer programmers. But what if we don’t have the expertise or fundamental understanding to be able to write the algorithm for a program? For example, consider filtering spam emails from genuine emails (Alpaydin, 2010). For this problem, we know the input (an email) and the output (identifying it as spam or genuine) but we don’t know what actually classifies it as a spam email. This lack of understanding often arises when there is some intellectual human involvement in the problem we are trying to solve. In this example, the human involvement is that a human wrote the original spam email. Similarly, humans are involved in handwriting recognition, natural language processing and facial recognition. It is clear that these problems are something that our subconscious is able to handle effortlessly yet we don’t consciously understand the fundamentals of the process. For sequential logical tasks, like sorting a list alphabetically, we consciously understand the fundamental process and therefore can program a solution (algorithm). But this isn’t possible for more complex tasks where the process is more of an unknown ‘black box’. Machine learning is what gives us the tools to solve these ‘black box’ problems. “What we lack in knowledge, we make up for in data” (Alpaydin, 2010). Using the spam example, we can use a data set of millions of emails, some of which are spam, in order to ‘learn’ what defines a spam email. The learning principles are derived from statistical approaches to data analysis. In this way, we do not need to understand the process but we can construct an accurate and functional model (a ‘black box’) to approximate the process. Whilst this doesn’t explain the fundamental process, it can identify some patterns and regularities that allow us to reach solutions. Artificial intelligence was conceived in the mid-20th century but it was not until the 1980s that the more statistical branch, machine learning, began to separate off and become a field in its own right (Russell, 2010). Machine learning developed a scientific approach to solving problems of prediction and finding patterns in data. This quickly had value in industry which fuelled the academic exploration further. But entering the 21st century we have seen rapid rise in machine learning popularity. This is largely due to the emergence of large data sets and the demand for data mining processes to extract knowledge from them. Machine learning has since established itself as a leading field of computer science with applications ranging from detecting credit card fraud to medical diagnosis. Data mining is the process of mining data in order to extract knowledge (Kamber, 2000). With the rise of large data sets (‘big data’), data mining has thrived. Data mining tasks can be categorised as either descriptive tasks or predictive tasks. A descriptive task involves extracting qualitative characteristics of data. For example, if you have a database of customers and want to segment the customers into groups in order to find trends within those groups. A predictive task involves using the existing data to be able to make predictions on future data inputs. For example, how can we learn from our existing customers which products might be favoured by a new customer?


2

Machine learning is a vast subject with masses of literature. One of the main challenges in understanding machine learning is knowing where to start. This review will introduce the two main approaches of machine learning: supervised and unsupervised learning. We consider some of the more generalist and flexible machine learning algorithms in these categories relevant to data mining and introduce some methods of optimising them. Additionally, this review will indicate the steps to develop a machine learning application to solve a specific problem. Finally we relate this theory and practical understanding to the application of data mining. With this knowledge, the reader will have a strong machine learning foundation to enable them to approach problems and interpret relevant research themselves.

1.1 Objectives

1. Understand the background of Machine Learning. What are some of the

key approaches and applications?

2. Understand some of the different mechanisms behind Machine Learning processes.

3. Explore machine learning algorithms and the decision making process of a machine

learning program.

4. How do you develop a machine learning application?

5. Case/Application Focus: Investigate machine learning in relation to data mining.

6. Briefly discuss key areas for future development of this technology.

2 Supervised Machine Learning Algorithms The aim of a supervised machine learning algorithm is to learn how inputs relate to outputs in a data set and thereby produce a model able to map new inputs to inferred outputs (Ayodele, 2010). Therefore, a complete set of training data is prerequisite for any supervised learning task. A general equation for this can be defined as follows (Alpaydin, 2010): 𝑦 = ℎ (𝑥│𝜃) Eq. 2.1

Where the output, 𝑦, is equal to the function, ℎ, which is a function of the inputs, 𝑥, and the

features, 𝜃. The role of the supervised machine learning algorithm is to optimise the parameters

(𝜃) by minimising the approximation error and thereby producing the most accurate outputs. In layman’s terms, this means that existing ‘right answers’ are used to predict new answers to the problem; it learns from examples (Russell, 2010). We are unequivocally telling the algorithm what we want to know and actively training it to be able to solve our problem. Supervised learning consists of two fundamental stages; i) training and ii) prediction. Building a bird classification system is a problem that can be solved with a supervised machine learning algorithm (Harrington, 2012). Start by taking characteristics of the object you are trying to classify, called features or attributes. For a bird classification system, these could be weight, wingspan, whether feet are webbed and the colour of its back. In reality, you can have an infinite number of features rather than just four (Ng, 2014). The features can be of different types. In this example, weight and wingspan are numeric (decimal), whether feet are webbed is simply yes or no (binary) and if you choose a selection of say 7 different colours then each ‘back colour’ would just be an integer. According to Eq. 3.4, we want to find a function (ℎ) which we can use


3

to determine the bird species (𝑦) given inputs of particular features (𝑥). To achieve this, we require training data (i.e. data on the weight, wingspan, etc. of a number of bird species). The training data is used (stage (i)) to determine the parameters (𝜃) which can be used to define a function ℎ. It’s unlikely this will be perfectly accurate, so we can compare the outputs from our function on a test set (where we secretly already know the true outputs) in order to measure the accuracy. Provided the function is accurate, we can use our model to predict bird species given new inputs of weight, wingspan etc., perhaps entered by users trying to identify a bird (stage (ii)). This example is extremely simplistic and leaves many questions unanswered such as how do we choose the features, how do we reach a definition for the model/function ℎ, how do we optimise our algorithm for maximum accuracy and how could we deal with imperfect training data (noise)? The sections which follow will seek to answer these questions. Regression and classification are both supervised learning tasks where a model is defined with a set of parameters. A regression solution is appropriate when the output is continuous, whereas a classification solution is used for discrete outputs (Ng, 2014; Harrington, 2012).

2.1 Regression In regression analysis the output is a random variable (𝑦) and the input the independent variable (𝑥). We seek to find the dependence of 𝑦 on 𝑥. The mean dependence of 𝑦 on 𝑥 will give us the function and model (ℎ) that we are seeking to define (Kreyszig, 2006). The most basic form of regression using just one independent variable is called univariate linear regression. This can be used to produce a straight line function: ℎ(𝑥) = 𝜃0 + 𝜃1𝑥 Eq. 2.2

By finding 𝜃0 and 𝜃1 it is therefore possible to fully define the model. In seeking to choose 𝜃0 and 𝜃1 so that ℎ is as close to our (𝑥,𝑦) values as possible, we must minimise the Gauss function of squared errors (Stigler, 1981; Freitas, 2013; Beyad & Maeder, 2013):

𝐽(𝜃0, 𝜃1) = ∑(ℎ(𝑥𝑖) − 𝑦𝑖)2

𝑛

𝑖=1

Eq. 2.3

To minimise this function, we can apply the gradient descent algorithm known as the method of steepest descent (Ng, 2014; Bartholomew-Biggs, 2008; Kreyszig, 2006; Snyman, 2005; Akaike, 1974). Gradient descent is a numerical method used to minimise a multivariable function by iterating away from a point along the direction which causes the largest decrease in the function (the direction with the most negative gradient or ‘downwards steepness’). The equation for gradient descent is as follows:

𝜃𝑗 = 𝜃𝑗 − 𝛼𝜕

𝜕𝜃𝑗𝐽(𝜃0, 𝜃1) Eq. 2.4

Figure 2.1. Gradient descent. (Kreyszig, 2006)


4

Where j = 0, 1 for this case of two unknowns. 𝛼 is the step size taken and is known as the learning rate. The value of the learning rate determines a) whether gradient descent converges to the minimum or not and b) how quickly it converges. If the learning rate is too small, gradient descent can be slow. On the other hand, if the learning rate is too large, the steps taken may be too large resulting in overshoot and missing of the minimum. Figure 2.1 illustrates gradient

descent from a starting point of 𝑥0 = 𝜃𝑗0 iterating to 𝑥1 = 𝜃𝑗

1 and 𝑥2 = 𝜃𝑗2. Eventually this will

reach the minimum which lies at the centre of the innermost circle. An analogy to gradient descent is the idea of walking on the side of a hill in a valley surrounded by thick fog. The aim is to get to the bottom of the valley. Even though you cannot see where the bottom of the valley is, as long as each step you take is sloping downwards, you will certainly reach the bottom. Gradient descent is not the fastest minimisation method, however, it offers a distinct approach which is repeatedly used in many machine learning optimisation problems. Furthermore, it scales well with larger data sets (Ng, 2014) which is a significant factor in real life applications. Sub-gradient projection is a possible alternative to the descent method, however, it is typically slower than gradient descent (Kiwiel, 2001). With an appropriate learning rate, gradient descent serves as a reliable and effective tool for minimisation problems. Hence by finding values for the parameters (𝜃𝑗) we are able to find an equation for the model

(ℎ). If this model can predict values of 𝑦 for novel examples, we say that it ‘generalises’ well (Russell, 2010). In this example, we have applied only linear regression (a 1-degree polynomial). It is possible to increase the hypothesis (ℎ) to a polynomial of a higher degree whereby the fit is more accurate (curved). However, as you increase the degree of the polynomial, you increase the risk of over-fitting the data; there is a balance to be reached between fitting the training data well and producing a model that generalises the data better (Sharma, Aiken & Nori, 2014). The main approach for dealing with this problem is to use the principle of Ockham’s razor: use the simplest hypothesis consistent with the data (Allaby, 2010). For example, a 1-degree polynomial is simpler than a 7-degree polynomial, so although the latter may fit training data better, the former should be preferred. It is possible to further simplify models by reducing the number of features being considered. This is achieved by discarding features which do not appear relevant (Ng, 2014; Russell, 2010). Regression is a simple yet powerful tool which can be used to teach a program to understand data inputs and accurately predict data outputs through machine learning processes.


5

2.2 Classification Decision Tree Learning Decision trees are a flowchart-like method of classifying a set of data inputs. The input is a vector of features and the output is a single and unified ‘decision’ (Russell, 2010). This means that the output is binary; it can either be true (1) or false (0). A decision tree performs a number of tests on the data by asking questions about the input in order to filter and categorise it. This is a natural way to model how the human brain thinks through solving problems; many troubleshooting tools and “How-To” manuals are structured like decision trees. It begins at the root node, extends down branches through nodes of classification tests (decision nodes) and finally ends at a node representing a ‘leaf’ (terminal nodes) (Criminisi & Shotton, 2013). The aim is to develop a decision tree using training data which can then be used to interpret and classify novel data for which the classification is unknown. The first step in the decision tree learning process is to induce or ‘grow’ a decision tree from initial training data. We take input features/attributes and transform these into a decision tree based on provided example outputs in training data. In the example in Figure 2.2, the features are Patrons (how many people are currently sitting in the restaurant), WaitEstimate (the wait estimated by the front of house), Alternate (whether there is another restaurant option nearby), Hungry (whether customer is already hungry) and so on. The output is a decision on whether to wait for a table or not. The decision tree learning algorithm employs a ‘greedy’ strategy of testing the most divisive attribute first (Russell, 2010). Each test divides the problem up further into sub-problems which will eventually classify the data. It is important that the training data set is as complete as possible in order to prevent decision trees being induced with mistakes. If the algorithm does not have an example for a particular scenario (e.g. WaitTime of 0-10 minutes when Patrons is full) then it could output a tree which consistently makes the wrong decision for this scenario. One of the mathematical ways in which decision tree divisions are quantifiably scored is with the measure of Information Gain (𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛) (Myles et al., 2004; Mingers, 1989). 𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 is a mathematical tool for measuring how effectively a decision node divides the example data. This is based on the concept of information (𝐼𝑛𝑓𝑜) defined by Eq. 2.5 (Myles et al., 2004):

𝐼𝑛𝑓𝑜 = − ∑ (𝑁𝑗(𝑡)

𝑁(𝑡)) log2 (

𝑁𝑗(𝑡)

𝑁(𝑡))

𝑗

Eq. 2.5

Where 𝑁𝑗(𝑡) is number of examples in category 𝑗 at the node 𝑡 and 𝑁(𝑡) is the number of

examples at the node 𝑡. The maximum change in information by being processed by a decision node is defined by Eq. 2.6 (Myles et al., 2004): 𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 = 𝐼𝑛𝑓𝑜(𝑃𝑎𝑟𝑒𝑛𝑡) − ∑(𝑝𝑘)𝐼𝑛𝑓𝑜(𝐶ℎ𝑖𝑙𝑑𝑘)

𝑘

Eq. 2.6

Figure 2.2. A decision tree for deciding whether to wait for a table. (Russell, 2010)


6

Where 𝑝𝑘 is the proportion of examples that are filtered into the 𝑘th category. The optimal decision node is therefore the node which maximises this ‘change in information’. Despite this quantification, there are usually several decision trees which are capable of classifying the data. To choose the optimal decision tree, inductive bias is employed (Mitchell, 1997). The inductive bias depends on the particular type of decision tree algorithm and will be explored in Section 2.2.1. Once a decision tree has been grown, the decision tree algorithm may prune the tree (Russell, 2010; Myles et al., 2004). This combats overfitting whilst dealing with noisy data by removing irrelevant decision nodes (Quinlan, 1986). The algorithm must also separately identify and remove features which do not aid the division of examples. The chi-squared significance test is the statistical method employed for this (supported by both (Quinlan, 1986) and (Russell, 2010)) known as chi-squared pruning. The data is analysed with the null hypothesis of ‘no underlying pattern’. The extent at which degree of deviation occurs in novel data compared to the training data is calculated and a cut off of say 5% significance is applied. In this way, noise in the training data is handled and the tree design is optimised. Multiple decision tree algorithms exist, exhibiting a variety of approaches. However, the most effective use of them is to combine their methodology into an ensemble algorithm in order to obtain better predictive performance than any of the individual algorithms alone. Section 2.2.1 will explore the ID3 decision tree learning algorithm which aims to induce the simplest possible tree. Sections 2.2.2 and 2.2.3 explore some ensemble methods to machine learning.

2.2.1 ID3 The majority of classification decision tree learning algorithms are variations on an original central methodology first proposed as the ID3 algorithm (Quinlan, 1986) and later refined to the C4.5 algorithm (Quinlan, 1993). The characteristics of decision tree algorithms discussed previously apply to ID3, but it has some subtleties and limitations too. One of these is that pruning does not apply to ID3 as it does not re-evaluate decision tree solutions after it has selected one.

Instead, the approach taken by the ID3 algorithm is to iterate with a top-down greedy search method through all the possible decision tree outputs from the simplest possible solution gradually increasing complexity until the first valid solution. Each decision tree output is known as a hypothesis and are effectively different possible solutions to the model or function ℎ. This unidirectional approach works to reach a consistently satisfactory decision tree without expensive computation (Quinlan, 1986). However, it implies the algorithm never backtracks to reconsider earlier choices (Mitchell, 1997). The core decision making lies in deciding which attribute makes the optimal decision node at each point. This is solved using the statistical property 𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 discussed earlier. ID3’s approach is known as a hill-climbing search, starting with

Figure 2.3. Searching through decision tree hypotheses from simplest to increasing complexity as directed by information gain. (Mitchell, 1997)


7

empty space and building a decision tree from the top down. This approach has advantages and disadvantages (Mitchell, 1997; Quinlan, 1986). It can be considered a positive capability that ID3 in theory considers all possible decision tree permutations. Some other algorithms take a major risk of evaluating only a portion of the search space in order to leverage greater speed, but this can lead to inaccuracy. On the other hand, a problem in ID3 is the ‘goldfish memory’ approach of only considering the current decision tree hypothesis at any one time. This means that it does not actually calculate how many viable different decision trees there are, it simply picks the first it reaches, making pruning post-selection redundant. We consider ID3 an important algorithm to understand because it serves as a core algorithm that many extensions have developed from. It can easily be modified to utilise pruning and handle noisy data as well as optimised for less common conditions. It is important to consider why ID3’s inductive bias towards simpler decision trees is optimal. Ockham’s razor approach (Allaby, 2010) advises giving preference to the simplest hypothesis that fits the data. But stating this does not make it optimal. Why is the simplest solution the best choice? It can be argued that scientists tend to follow this bias, possibly because it is less likely that a simpler solution is going to coincide with being the correct solution unless it is a perfectly accurate generalisation (what we aim to reach in machine learning) (Mitchell, 1997). Also, there is evidence that this approach will be consistently faster at reaching the solution due to only considering a portion of the data set (Quinlan, 1986). On the other hand, there are contradictions in this approach. It is entirely possible to obtain two different solutions with the exact same data by taking this approach, simply if the iterations by ID3 take two different paths. This is likely to be acceptable in most applications but may be a crucial complication for others (Mitchell, 1997). The C4.5 algorithm (Quinlan, 1993) extended the original ID3 algorithm with increased computational efficiency, ability to handle training data with missing attributes, ability to handle continuous attributes (rather than just discrete) and various other improvements. One of the most significant modifications allowed a new approach to determining the optimal decision tree solution. Choosing the first simple valid solution can be problematic if there is noise in the data. This is solved by allowing production of trees which overfit the data and then pruning them post-induction. Despite a longer sounding process, this new solution was found to be more successful in practice (Mitchell, 1997). The ID3 algorithm can be considered a basic but effective algorithm for building decision trees. With refinement to the C4.5 algorithm, it is competent at producing an adequate solution without requiring vast computing resources. For this reason, it is extremely well supported and commonly implemented across numerous programming languages. It is considered a highly credible algorithm used in engineering (Shao et al., 2001), aviation (Yan, Zhu & Qiang, 2007) and wherever automated or optimal decision making processes are required.

2.2.2 Bagging and Boosting Bagging and boosting are ensemble techniques which means they use multiple learning algorithms to improve the overall performance of the machine learning system (Banfield et al., 2007). In decision tree learning, this helps to produce the optimal decision tree (rather than just a valid one). The optimal decision tree is one that has the lowest error rate in predicting outputs 𝑦 for data inputs 𝑥 (Dietterich, 2000a). Bagging and boosting improve the performance by manipulating the training data before it is fed into the algorithm.


8

Bagging is an abbreviation of ‘Bootstrap AGGregatING’ (Pino-Mejas et al., 2004) and was first developed by Leo Breiman in 1994. Bagging takes subset samples from the full training set to produce groups of training sets called “bags” (Breiman, 1996). The key methodology of bagging is to take 𝑚 examples, with replacement, from the original training set. Each bag ends up containing approximately 63.2% of the original training set (Dietterich, 2000a). Boosting was first developed by Freund and Schapire in 1995 and similarly manipulates the example training data in order to improve performance of the decision tree learning algorithm (Freund & Schapire, 1996; Freund & Schapire, 1995; Freund, 1995). The key differentiator of boosting is that it assigns a weight to each example proportional to the error in prediction of considering that data (Banfield et al., 2007). Misclassified examples are given an incrementally greater weighting in each iteration of the algorithm. In subsequent iterations, the algorithm focuses on examples with a greater weighting (favouring examples which are harder to classify than those which consistently classify correctly). Breiman (1996) identified that bagging improves the performance of unstable learning algorithms but tends to reduce the performance of more stable algorithms. Decision tree learning algorithms, neural networks and rule learning algorithms are all unstable, whereas linear regression and K-nearest neighbour (Larose, 2005) algorithms are very stable. The improvements offered by bagging and boosting are therefore very relevant to decision tree learning. But why do bagging and boosting improve the performance of unstable algorithms whilst degrading stable ones? The main components of error in machine learning algorithms can be summarised as noise, bias and variance. An unstable learning algorithm is one where small changes in the training data cause significant fluctuation in the response of the algorithm (i.e. high variance) (Dietterich, 2000a). In both bagging and boosting, the training data set is perturbed to reduce the variance of the data by adding linear classifying models and hence making the algorithm more stable (Skurichina & Duin, 1998). The effect of this is shifting the focus of the algorithm to the most relevant region of the training data. On the other hand, adding linear models to an already stable model will make no difference except less examples will be considered to reach the same solution. A machine learning algorithm is considered accurate if it produces a model ℎ with an accuracy greater than ½ (i.e. the decision tree results in greater accuracy than if each decision made was a 50/50 split). Algorithms are tested to this limit by adding noise to the training data. Noisy data is training data which contains mislabelled examples. Noise is problematic for boosting and has been show to considerably reduce its classification performance (Long & Servedio, 2009; Dietterich, 2000b; Dietterich, 2000a; Freund & Schapire, 1996). This poor performance is intuitive due to the fact that the boosting method converges to harder to classify data. Mislabelled data is obviously the hardest to classify and fruitless to focus on, hence the fatal flaw of boosting. Critically, Long & Servedio (2009) showed that the most common boosting algorithms such as AdaBoost and LogitBoost reduced to accuracies of less than ½ for high noise data, rendering them meaningless. Conversely, when directly comparing bagging and boosting methods effectiveness, Dietterich (2000b) found that bagging was “clearly” the best method. Bagging actually uses the noise to generate a more diverse collection of decision tree hypotheses and therefore introducing noise to the training data only improves accuracy. However, experimental results have shown that when there is no noise in the training data, boosting gives the best results (Banfield et al., 2007; Lemmens & Croux, 2006; Dietterich, 2000b; Freund & Schapire, 1996). Conclusively, when deciding between machine learning algorithms, an important factor to consider is confidence in the consistency of training data being provided. Boosting is ideal but bagging is more consistent.


9

A possible solution to this dilemma is explored by Alfaro, Gamez & Garcia (2013) where features of both bagging and boosting are combined in the design of a new classification decision tree learning algorithm: adabag. The common goal of both bagging and boosting is to improve accuracy by modifying the training data. Based off the AdaBoost algorithm, adabag allows analysis of the error as the ensemble is grown, reducing the problem with noise. Bagging and boosting are effective techniques for improving the predictive performance of machine learning algorithms when applied to decision tree learning. By generating an ensemble of decision trees and finding the optimal hypothesis analytically, accuracy is increased.

2.2.3 Random Forests Random forests are another ensemble learning technique used to improve the performance of algorithms in decision learning. The algorithm was originally developed by Breiman (2001) to whom the term is trademarked. Random forests was an improvement on his previous technique, bagging (Breiman, 1996). Instead of choosing one optimal decision tree, random forests uses multiple and takes the mode hypothesis as the result. Although there is no single best algorithm for every situation (Wolpert & Macready, 1997), random forests has proved to be a general top performer without requirements for tuning or adjustment and notably outperforms both bagging and boosting on accuracy and speed (Banfield et al., 2007; Svetnik et al., 2003; Breiman, 2001). Breiman (2001) found that random forests favourably shares the noise-proof properties of bagging. When compared against AdaBoost, random forests showed little deterioration with 5% noise whereas AdaBoost’s performance dropped markedly. This is because the Random Forest technique does not increase weights on specific subsets and so the increased noise has negligible effect, whilst AdaBoost’s convergence to mislabelled examples causes its accuracy to spiral. This being said, there is always room for improvement and the Random Forest technique is by no means perfect. The mechanism of up-voting by decision trees in the random forests is one possible area for improvement (Robnik-Sikonja, 2004). Margin is a measure of how much a particular hypothesis is favoured over other hypotheses from the Random Forest decision trees. By weighting each hypothesis vote with the margin, Robnik-Sikonja (2004) found the prediction accuracy of random forests improves significantly. Decision trees are a natural choice in the development of machine learning programs. Within decision trees, there a number of different algorithms and techniques including the ones explored here plus others such as CART, CHAID and MARS. Decision trees are important because they perform with large data sets and are intuitive to use. Furthermore, techniques such as random forests can improve the robustness, accuracy and speed of the learning method.

3 Unsupervised Machine Learning Algorithms In unsupervised learning, the onus of learning is even more greatly on the computer program than the developer. Where in supervised learning you have a full set of inputs and outputs in your data, for unsupervised learning you only have inputs. The machine learning algorithm must use this input data alone to extract knowledge. In statistics, the equivalent problem is known as density estimation; the problem is finding any underlying structures to the unlabelled data (Alpaydin, 2010).


10

3.1 Clustering The main unsupervised learning method is clustering: finding groups within the input data set. For example, a company may want to group their current customers in order to target groups with relevant new products and services. To do this, the company could take their database of customers and use an unsupervised clustering algorithm to divide it into customer segments. The company uses the results to have better relationships with their customers. In addition to identifying groups, the algorithm will identify outliers who sit outside of these groups. These outliers might reveal a niche that wouldn’t have otherwise been noticed. There are over 100 published clustering algorithms. This review will focus on the two most used approaches to clustering: hierarchical clustering and k-means clustering.

3.1.1 Hierarchical Clustering: Agglomerative and Divisive As suggested in the name, hierarchical clustering clusters in hierarchies. Each level of clusters in the hierarchy is a combination of the clusters below it, whereby the ‘clusters’ at the bottom of the hierarchy are singular observations and the top cluster contains the entire data set (Hastie, 2009). Hierarchical clustering is split into two sub-approaches: agglomerative (bottom-up) and divisive (top-down) as in Figure 3.1. In the agglomerative approach, clusters start out as individual data inputs and are merged into larger clusters until one cluster containing all the inputs is reached. Divisive is the reverse, starting with the cluster containing all data inputs and

subdividing into smaller clusters until reaching individual inputs or a termination condition such as the distance between two of the closest clusters is above a certain amount (Kamber, 2000). The most common form of hierarchical clustering is agglomerative. Dendrograms provide a highly comprehensible way of interpreting the structure of a hierarchical clustering algorithm in a graphical format as illustrated in Figure 3.2. Agglomerative hierarchical methods are broken down into single-link methods, complete-link methods, centroid methods and more. The difference between these methods is how the distance between clusters/groups is measured.

The single-link method, also known as nearest neighbour clustering (Rohlf, 1982), can be defined by the following distance 𝐷 linkage function (Gan, 2007): 𝐷(𝐶, 𝐶′) =

min𝑥 ∈ 𝐶, 𝑦 ∈ 𝐶′

𝑑(𝑥, 𝑦) Eq. 3.1

Figure 3.2. Dendrogram from agglomerative (bottom up) clustering technique based on data on human tumors. (Hastie, 2009)

Figure 3.1. Agglomerative and divisive hierarchical clustering. (Gan, 2007)


11

Where 𝐶 and 𝐶’ are two nonempty and non-overlapping clusters. The Euclidean distance (Gan, 2007) for 𝑛-dimensions is: 𝑑(𝑥, 𝑦) = √(𝑥1 − 𝑦1)2 + (𝑥2 − 𝑦2)2 + ⋯ + (𝑥𝑛 − 𝑦𝑛)2 Eq. 3.2

This is used in the agglomerative approach to find clusters/groups with the minimum Euclidean distance between them to join for the next level up in the hierarchy. This procedure repeats until all clusters are encompassed by one cluster of the entire data set. One of the main reasons hierarchical clustering is such a popular approach is the easily human-interpretable dendrogram format with which it can be represented (Hastie, 2009). Additionally, any reasonable method of measuring the distance between clusters can be used provided it can be applied to matrices. However, hierarchical clustering occasionally encounters difficulty with merge/split points (Kamber, 2000). In a hierarchical structure, this is critical as every point following a merge/split is derived from that decision. Therefore, if this decision is made poorly, the entire output will be low-quality. A number of hierarchical methods built from the fundamentals of this approach have been designed to solve the typical issues it is prone to, including BIRCH (Zhang, Ramakrishnan & Livny, 1997) and CURE (Yun-Tao Qian, Qing-Song Shi & Qi Wang, 2002). Hierarchical clustering is a simple but extremely flexible approach for applying unsupervised learning to any data set. It can be used as an assistive tool to allow specialists to make best use of their skill. For example, in medical applications such as analysis of EEG graphs, hierarchical clustering is used to identify and group sections that are alike whilst the neurologist can evaluate the medical meaning of these areas (Guess & Wilson, 2002). In this way, the work is delegated to make best use of each individual/component: the computer does the systematic analysis and the neurologist provides the medical insight.

3.1.2 K-means

K-means is one of the most common approaches to clustering. First demonstrated by

MacQueen (1966) it is designed for quantitative data and defines clusters by a centre point (the

mean). The algorithm begins with the initialisation phase where the number of clusters/centres

is fixed. Then the algorithm enters the iteration phase, iterating the positions of these centres

until they reach a final central rest position (Gan, 2007). The final rest position occurs when the

error function does not change significantly for further iterations. The algorithm is as follows

(Hastie, 2009):

1. For a given set of 𝑘 clusters, C, minimise the total cluster variance of all data inputs with

respect to {𝑚1, … , 𝑚𝑘} yielding the means of current clusters.

2. Given the means of current clusters {𝑚1, … , 𝑚𝑘}, assign each data input to the closest

(current) mean for a cluster.

3. Repeat until assignments no longer change.

The function being minimised is as follows (Hastie, 2009):

𝐶∗ =

min𝐶, {𝑚𝑘}1

𝐾 ∑ 𝑁𝑘 ∑ ||𝑥𝑖 − 𝑚𝑘||2

𝐶(𝑖)=𝑘

𝐾

𝑘=1

Eq. 3.3


12

Where 𝑥 represents the data inputs

and 𝑁𝑘 = ∑ 𝐼(𝐶(𝑖)) = 𝑘)𝑁𝑖=1 . Therefore

𝑁 data inputs are assigned to the 𝑘

clusters so that the distance between

the data inputs and the cluster mean is

minimised.

A key advantage to using K-means is

that it is effective in terms of

computation even with large data sets.

The computational complexity is

linearly proportional to the size of the

data set, rather than exponentially

(Hastie, 2009). However, due to this

linear approach, it can be slow on high

dimensional data beyond a critical size

(Harrington, 2012; Hastie, 2009).

The performance of K-means is heavily dependent on the initialisation phase. Not only must

the number of clusters 𝑘 be defined but also the initiation positions of the centres. The number

of clusters 𝑘 depends on the goal you are trying to achieve in the analysis and is usually well

defined in the problem, for example, creating 𝑘 customer segments, employing 𝑘 sales people

etc. Alternatively, if this information is unavailable, a “rule of thumb” approach commonly taken

is to set 𝑘 proportionally to the number of inputs in the data set (Mardia, 1979):

𝑘 ≈ √𝑁

2 Eq. 3.4

For the algorithm to perform well, it is important to take a reliable approach to defining the

cluster means. Fortunately this problem has popular solutions proposed as the Forgy Approach

(Anderberg, 1973), Macqueen Approach (MacQueen, 1966) and Kaufman Approach (Kaufman,

1990). In comparing these, it has been found that the Kaufman approach generally produces the

best clustering results (Peña, Lozano & Larrañaga, 1999). In the Kaufman Approach, the initial

cluster means are found iteratively. The starting point is the input data point closest to the

centre of the data set. Following this, centres are chosen by choosing input data point positions

with the highest number of other data points around them.

One of the earliest applications of K-means was in signal and data processing. For example, it is

used for image compression where a 24 bits image with up to 16 million colours can be

compressed to an 8 bits images with only 256 (Alpaydin, 2010). The problem is finding the

optimal 256 colours out of the 16 million in order to retain image quality in compression. This

is a problem of vector quantisation. K-means is still used for this application today.

Figure 3.3. A demonstration of iterations by the K-means clustering algorithm for simulated input data points. (Hastie, 2009)


13

The standard K-means algorithm serves its purpose well, but suffers from some limitations and

drawbacks. For this reason, it has been modified, extended and improved in numerous

publications (Chen, Ching & Lin, 2004; Wagstaff et al., 2001). The techniques employed include

a) finding better initial solutions (as discussed above), b) modifying the original algorithm and

c) incorporating techniques from other algorithms into K-means. Wagstaff et al. (2001)

recognised that the experimenter running the algorithm is likely to have some background

knowledge on the data set being analysed. By communicating this data to the algorithm,

through adding additional constraints in the clustering process, Wagstaff et al. (2001) improved

the performance of K-means from 58% to 98.6%. In a separate experiment, Chen, Ching & Lin

(2004) found that incorporating techniques from hierarchical methods into K-means increased

clustering accuracy. This literature shows that K-means is a versatile approach to clustering

which can be tailored to specific problems in order to significantly improve its accuracy.

4 Steps in developing a machine learning application

So far this review has focused on the theoretical background of machine learning techniques. This section considers practically applying this theoretical knowledge to data related problems in any field of work, from collecting data through to use of the application (Harrington, 2012).

4.1 Collect Data The first step is to collect the data you wish to analyse. Sources of data may include scraping a website for data, extracting information from an RSS feed or API, existing databases, running an experiment to collect data and other sources of publicly available data.

4.2 Choose Algorithm There are a huge number of machine learning algorithms out there, so how do we choose the right one? The first decision is between supervised learning and unsupervised learning. If you are attempting to predict or forecast then you should use supervised learning. You will also need training data with a set of inputs connected to outputs. Otherwise, you should consider unsupervised learning. At the next level, choose between regression or classification (supervised learning) and clustering or density estimation (unsupervised learning). Finally at the last level, there are tens of different algorithms you could use under each of these categories. There is no single best algorithm for all problems (Harrington, 2012; Wolpert & Macready, 1997). Understanding the properties of the algorithms is helpful, but now to find the best algorithm for your problem your strategy should be to test different algorithms and choose by trial and error (Salter-Townshend et al., 2012).

4.3 Prepare Data The next step is to prepare the data in a usable format. Certain algorithms require the features/training data to be formatted in a particular way, but this is trivial. The data first needs to be cleaned and, integrated and selected (Zhang, Zhang & Yang, 2003; Kamber, 2000). Data cleaning involves filling out any missing values in features of the training data, removing noise, filtering out outliers and correcting inconsistent data. To fill out missing values, you can


14

take a biased or unbiased approach. An example of biased is to use a probable value to fill in the missing value, whereas unbiased would be just removing the feature/example completely. The biased approach is popular when there are a large proportion of values missing. The random error and variance in the data is caused by noise. This is reduced by binning (Shi & Yu, 2006) or clustering the data in order to isolate and remove outliers. Data integration is simply merging data from multiple sources. Data selection is the problem of selecting the right data from the sample to use as the training data set. Generally the method of selecting data is heavily dependent on the type of data being filtered, however, Sun et al. (2013) explored an innovative generalised approach using dynamic weights for classification by putting a greater weight on data associated with the most features and eliminating redundant ones, demonstrating promising results.

4.4 Train Algorithm Now that all the data is cleaned and optimised, we can proceed to train the algorithm (for supervised learning). For unsupervised learning, this stage is just running the algorithm on the data as we don’t have target values to train with. For both learning types, this is where the artificially intelligent ‘machine learning’ occurs and where the real value of machine learning algorithms is exploited (Russell, 2010). The output of this step is raw ‘knowledge’.

4.5 Verify Results Before using the new found ‘knowledge’, it is important to verify/test it. In supervised learning, you can test the model you’ve created against your existing real data set to measure the accuracy. If it is not satisfactory, you can go back to the initial data preparation stages and optimise. Verifying the accuracy of unsupervised learning algorithms is significantly more challenging and beyond the scope of this review.

4.6 Use Application Finally you can use the knowledge evaluated by your algorithm. Depending on the nature of your machine learning problem, the raw data output may be sufficient or you may choose to produce visualisations for the results (Leban, 2013). The beauty of machine learning means that we do not need to program a solution to the problem line by line, the machine learning algorithm will learn from data using statistical analysis instead. But the machine learning algorithm still needs to be developed itself. Fortunately there is no single piece of software or programming language that you must use to prepare your machine learning application. The most commonly used applications are Python, Octave, R and Matlab (Ng, 2014; Freitas, 2013; Alfaro, Gamez & Garcia, 2013; Harrington, 2012). Python is one of the most widely used because of its clear syntax, simple text manipulation and established use throughout industries and organisations (Harrington, 2012). With this information, you are now equipped with the knowledge and practical know-how to develop a machine learning application.


15

5 Data Mining

In the last few centuries, innovation in the human species has accelerated rapidly. With the invention of the World Wide Web and adoption of new technologies on a global scale we are using technology like never before. The by-product of the Information Age is vast amounts of data, exceeding terabytes onto petabytes and exabytes, with immense hidden value (Goodman, Kamath & Kumar, 2007). The sheer size of databases and data sets make it impossible for a human to comprehend or analyse manually. Data mining is quite literally using machine learning approaches to extract underlying information and knowledge from data (Kamber, 2000). The knowledge can contribute greatly to business strategies or scientific and medical research. The format of the knowledge extracted depends on the machine learning algorithm used. If supervised learning approaches are applied it is possible to identify patterns in data that can be used to model it (Kantardzic, 2011). Pattern recognition and learning is one of the most widely applied uses for data mining and machine learning. Unsupervised approaches are also used in data mining. Unsupervised learning makes it possible to identify natural groupings in data. The main application of this in data mining is feature learning whereby useful features are extracted from a large data set which can then be used for classification (Coates & Ng, 2012). Applications of data mining can be seen in medicine, telecommunications, finance, science, engineering and more. For example in medicine, machine learning is frequently being used to improve diagnosis of medical conditions such as cancer and schizophrenia. Data mining of clinical data such as MRI scans allows computers to learn how to recognise cancers and underlying conditions in new patients more reliably than doctors (Savage, 2012; Ryszard S Michalski, Ivan Bratko & Miroslav Kubat, 1998). In finance, data mining is now being used to assist evaluation of credit risk of individuals and companies ahead of providing financial support through loans (Correia et al., 1993). This is arguably the most important stage in the process of offering a loan but firms have previously struggled to accurately predict the risk of default. With the large data sets that have been accumulated in this domain, data mining is providing new insights and patterns to help accurately manage these risks for financial organisations. Data mining does not yet have any social stigma attached to it. However, there are ethical issues and social impacts of data mining. For example, web mining involves scraping data from the internet and mining it for knowledge (Etzioni, 1996). This data can often include personal data from web users which is used for the profit of organisations (the web miners) (Van Wel & Royakkers, 2004). Current research suggests that no harm is currently being done to web users as a result of this, but with the uprising of ‘big data’ there is growing demand for regulation and ensuring that the power of data mining is used for ‘good’ (Etlinger, 2014). As long as users remain in control and fully understand the data they offer when using the web, the threat to privacy can be neutralised. However, the risk of this line of consent and understanding becoming blurred is high. It is important for governments and organisations to acknowledge this and take a pro-active approach with regulation.


16

6 Discussion

6.1 Literature

In writing this review it has become clear that supervised machine learning algorithms simply

apply statistical approaches to data analysis in a scalable way. In fact, one of the best technical

sources of information on regression and gradient descent was a maths textbook (Kreyszig,

2006). It provided a clear explanation of the techniques despite not directly relating them to

machine learning. This has demonstrated that machine learning has come a long way in its

scientific and mathematical approach since originally branching out of artificial intelligence.

The cause of the separation was originally due to statistical analysis no longer being supported

in artificial intelligence. However, it turned out that within these statistical analysis approaches

(machine learning) lied the most practical discoveries and applications of all.

Unsupervised learning is perhaps more closely related to artificial intelligence. The frequently

cited textbook by Russell (2010) titled “Artificial Intelligence” actually served as an excellent

source of insight into unsupervised machine learning algorithms, particularly hierarchical

algorithms and the K-means approach. This is probably because unsupervised learning deals

with the more mysterious (affiliated with artificial intelligence) type of data: unlabelled data.

Additionally, it seeks to extract knowledge or ‘intelligence’ from this data. Unsupervised

learning is particularly applicable to data mining through the application of feature learning.

With feature learning, it is possible to take a huge set of data uninterpretable by humans and

turn it into something that you can perform intricate data analysis on and obtain realised value.

It was surprising to find that with just the elemental principals covered in this review it is

possible to get started on real machine learning applications, as made apparent when discussing

the review with professionals in industry.

6.2 Future Developments

Machine learning is still a new scientific field with huge opportunities for growth and

development. Rather than working on large static data sets, it is important to devise methods

of applying machine learning to transient data and data streams (Gama, 2012). There are

significant challenges to address for maintaining an accurate decision model when the data used

to develop that model is continually changing.

It has become clear that a bias-variance trade off exists in supervised learning problems

(Sharma, Aiken & Nori, 2014). Bias and variance are both sources of error. Ideally the model

should closely fit the training data but also generalise effectively for new data. In past research,

there has been a focus on reducing the variance related error. However, as data sets grow larger

(Cambria et al., 2013), it is important to produce models which fit closely to larger data sets.

Therefore, there is a need to focus more specifically on bias related error.

We now have access to more computational power than ever before. However, when comparing

computing technology to the human brain, there is a clear discrepancy between the two in terms


17

of how fast data is processed and how much energy is consumed to do so (Norvig, 2012). A

computer can process data 100 million times faster than the brain but requires 20,000 watts of

power to do so. Comparatively, the brain consumes just 20 watts of power to do the same. Yet

machine learning systems are still only just managing to become as effective as the brain. We

need to allocate resources to understanding the brain and using it to inspire circuit and

machinery design in order to make artificial intelligence and learning processes more efficient.

7 Conclusion

There are two main approaches to machine learning: supervised learning and unsupervised

learning. These can be further broken down by different algorithms used to complete supervised

and unsupervised learning tasks. In supervised learning, types of algorithm include regression

and clustering (such as gradient descent, ID3, bagging, boosting and random forests). In

unsupervised learning, types of algorithms include hierarchical and K-means clustering.

Machine learning can be applied to facial recognition, medical diagnosis, search engines,

shopping cart recommendation systems and much more. The common indicator of a good

application is that a large source of data exists related to the problem. Machine learning

algorithms can then use their tailored decision making to translate that data into usable

knowledge, producing value.

The process of developing a machine learning algorithm is summarised as follows: start by

collecting data, choose an appropriate algorithm, prepare the data, train the algorithm with

sample data, verify the results and finally apply the knowledge produced by the algorithm.

Data mining is a growing application of machine learning as the World Wide Web and

Information Age have introduced data sets on a scale like never before. Going forward, it is

important to only use data mining ethically and not to the detriment of web users.

As most of the development in machine learning has happened in the past 30 years, there is still

much to be done. We should continue to use the human brain as a North Star in guiding further

research. The goal is to realise true artificial intelligence through improving machine learning

algorithms which may one day compete with the performance of our own brains.

8 References

Akaike, H. (1974) NEW LOOK AT THE STATISTICAL MODEL IDENTIFICATION. IEEE Transactions on Automatic Control. AC-19 (6), 716-723.

Alfaro, E., Gamez, M. & Garcia, N. (2013) adabag: An R Package for Classification with Boosting and Bagging. Journal of Statistical Software; J.Stat.Softw. 54 (2), 1-35.

Allaby, M. (2010) Ockham's razor, A Dictionary of Ecology. Oxford University Press.


18

Alpaydin, E. (2010) Introduction to machine learning. 2nd edition. Cambridge, Mass. ; London, MIT Press.

Anderberg, M. R. (1973) Cluster analysis for applications. New York ; London, Academic Press.

Ayodele, T. O. (2010) Types of Machine Learning Algorithms, New Advances in Machine Learning, Yagang Zhang (Ed.), ISBN: 978-953-307-034-6, InTech.

Banfield, R. E., Hall, L. O., Bowyer, K. W. & Kegelmeyer, K. W. (2007) A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence. 29 (1), 173-180.

Bartholomew-Biggs, M. (2008) Nonlinear Optimization with Engineering Applications. Dordrecht, Springer.

Beyad, Y. & Maeder, M. (2013) Multivariate linear regression with missing values. Analytica Chimica Acta. 796 (0), 38-41.

Breiman, L. (1996) Bagging predictors. Machine Learning. 24 (2), 123-140.

Breiman, L. (2001) Random Forests. Machine Learning. 45 (1), 5-32.

Cambria, E., Huang, G., Zhou, H., Vong, C., Lin, J., Yin, J., Cai, Z., Liu, Q., Li, K., Feng, L., Ong, Y., Lim, M., Akusok, A., Lendasse, A., Corona, F., Nian, R., Miche, Y., Gastaldo, P., Zunino, R., Decherchi, S., Yang, X., Mao, K., Oh, B., Jeon, J., Toh, K., Kim, J., Yu, H., Chen, Y. & Liu, J. (2013) Extreme Learning Machines. IEEE Intelligent Systems. 28 (6), 30-59.

Chen, J., Ching, R. K. H. & Lin, Y. (2004) An extended study of the K- means algorithm for data clustering and its applications. Journal of the Operational Research Society. 55 (9), 976-987.

Coates, A. & Ng, A. Y. (2012) Learning feature representations with K- means. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 7700, 561-580.

Correia, J., Costa, E., Ferreira, J. & Jamet, T. (1993) An Application of Machine Learning in the Domain of Loan Analysis. Lecture Notes in Computer Science. 667, 414-419.

Criminisi, A. & Shotton, J. (2013) Decision Forests for Computer Vision and Medical Image Analysis. 2013th edition.

Dietterich, T. (2000a) Ensemble methods in machine learning. Multiple Classifier Systems. 1857, 1-15.

Dietterich, T. (2000b) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning; Mach.Learn. 40 (2), 139-157.

Etlinger, S. (2014) What do we do with all this big data? TED.com, https://www.ted.com/talks/susan_etlinger_what_do_we_do_with_all_this_big_data.


19

Etzioni, O. (1996) The World- Wide Web: Quagmire or Gold Mine? Communications of the ACM. 39 (11), 65-68.

Freitas, N. d. (2013) Machine Learning Lecture Course. de Freitas, Nando, University of British Columbia, Oxford University.

Freund, Y. & Schapire, R. E. (1996) Experiments with a new boosting algorithm. ICML. pp.148-156.

Freund, Y. (1995) BOOSTING A WEAK LEARNING ALGORITHM BY MAJORITY. Information and Computation; Inf.Comput. 121 (2), 256-285.

Freund, Y. & Schapire, R. E. (1995) A decision- theoretic generalization of on-line learning and an application to boosting. Lecture Notes in Computer Science. 904, 23-37.

Gama, J. (2012) A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence. 1 (1), 45-55.

Gan, G. (2007) Data clustering : theory, algorithms, and applications. Philadelphia, PA, Society for Industrial and Applied Mathematics.

Goodman, A., Kamath, C. & Kumar, V. (2007) Statistical analysis and data mining: Data Analysis in the 21st Century. Statistical Analysis and Data Mining. , .

Guess, M. J. & Wilson, S. B. (2002) Introduction to hierarchical clustering. Journal of Clinical Neurophysiology. 19 (2), 144-151.

Harrington, P., 1977-. (2012) Machine learning in action. Shelter Island, N.Y., Manning Publications.

Hastie, T. (2009) The elements of statistical learning : data mining, inference, and prediction. 2nd edition. New York, Springer.

Kamber, M. (2000) Data mining: concepts and techniques. San Francisco ; London, San Francisco ; London Morgan Kaufmann.

Kantardzic, M. (2011) Data Mining Concepts, Models, Methods, and Algorithms. 2nd edition. Hoboken, Wiley.

Kaufman, L. (1990) Finding groups in data an introduction to cluster analysis. S.l.}, Wiley.

Kiwiel, K. C. (2001) Convergence and efficiency of subgradient methods for quasiconvex minimization. Mathematical Programming, Series B. 90 (1), 1-25.

Kreyszig, E. (2006) Advanced engineering mathematics. 9th, International edition. Hoboken, N.J., Wiley.

Larose, D. T. (2005) k ‐ Nearest Neighbor Algorithm. Hoboken, NJ, USA.

Leban, G. (2013) Information visualization using machine learning. Informatica (Slovenia). 37 (1), 109-110.


20

Lemmens, A. & Croux, C. (2006) Bagging and boosting classification trees to predict churn. Journal of Marketing Research. , .

Long, P. M. & Servedio, R. A. (2009) Random classification noise defeats all convex potential boosters. Machine Learning. , 1-18.

MacQueen, J. B. (1966) SOME METHODS FOR CLASSIFICATION AND ANALYSIS OF MULTIVARIATE OBSERVATIONS.

Mardia, K. V. (1979) Multivariate analysis. London, Academic Press.

Mingers, J. (1989) An empirical comparison of selection measures for decision-tree induction. Machine Learning. 3 (4), 319-342.

Mitchell, T. M. (. M., 1951-. (1997) Machine learning. Boston, Mass., WCB/McGraw-Hill.

Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A. & Brown, S. D. (2004) An introduction to decision tree modeling. Journal of Chemometrics. 18 (6), 275-285.

Ng, A. (2014) Machine Learning (Coursera) - Stanford by Andrew Ng. , coursera.org.

Norvig, P. (2012) Artificial intelligence: A new future. New Scientist. 216 (2889), vi-vii.

Peña, J. M., Lozano, J. A. & Larrañaga, P. (1999) An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters. 20 (10), 1027-1040.

Pino-Mejas, R., Cubiles-de-la-Vega, M., Lapez-Coello, M., Silva-Ramarez, E. & Jimanez-Gamero, M. (2004) Bagging Classification Models with Reduced Bootstrap. In: Fred, A., Caelli, T., Duin, R. W., Campilho, A. & de Ridder, D. (eds.). , Springer Berlin Heidelberg. pp. 966-973.

Quinlan, J. R. (1993) C4.5 : programs for machine learning. Amsterdam, Morgan Kaufmann.

Quinlan, J. R. (1986) Induction of decision trees. Machine Learning. 1 (1), 81-106.

Robnik-Sikonja, M. (2004) Improving random forests. Machine Learning: Ecml 2004, Proceedings. 3201, 359-370.

Rohlf, F. J. (1982) 12 Single- link clustering algorithms. Handbook of Statistics. 2, 267-284.

Russell, S. J. (. J. (2010) Artificial intelligence : a modern approach. 3rd, International edition. Boston, Mass.] ; London, Pearson.

Ryszard S Michalski, Ivan Bratko & Miroslav Kubat. (1998) Machine learning and data mining : methods and applications. Chichester, Chichester : Wiley.

Salter-Townshend, M., White, A., Gollini, I. & Murphy, T. B. (2012) Review of statistical network analysis: models, algorithms, and software. Statistical Analysis and Data Mining. 5 (4), 243-264.

Savage, N. (2012) Better Medicine Through Machine Learning. Communications of the ACM. 55 (1), 17-19.


21

Shao, X., Zhang, G., Li, P. & Chen, Y. (2001) Application of ID3 algorithm in knowledge acquisition for tolerance design. Journal of Materials Processing Tech. 117 (1), 66-74.

Sharma, R., Aiken, A. & Nori, A. V. (2014) Bias- variance tradeoffs in program analysis.

Shi, T. & Yu, B. (2006) Machine Learning and Data Mining - Binning in Gaussian kernel regularization. Statistica Sinica. 16 (2), 541-568.

Skurichina, M. & Duin, R. P. W. (1998) Bagging for linear classifiers. Pattern Recognition. 31 (7), 909-930.

Snyman, J. A. (2005) Practical Mathematical Optimization An Introduction to Basic Optimization Theory and Classical and New Gradient-based Algorithms. Dordrecht, Springer-Verlag New York Inc.

Stigler, S. M. (1981) Gauss and the Invention of Least Squares. The Annals of Statistics. 9 (3), 465-474.

Sun, X., Liu, Y., Chen, H., Han, J., Wang, K. & Xu, M. (2013) Feature selection using dynamic weights for classification. Knowledge-Based Systems. 37, 541-549.

Svetnik, V., Liaw, A., Tong, C., Culberson, J., Sheridan, R. & Feuston, B. (2003) Random forest: A classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences; J.Chem.Inf.Comput.Sci. 43 (6), 1947-1958.

Van Wel, L. & Royakkers, L. (2004) Ethical issues in web data mining. Ethics and Information Technology. 6 (2), 129-140.

Wagstaff, K., Cardie, C., Rogers, S. & Schrödl, S. (2001) Constrained k-means clustering with background knowledge. ICML. pp.577-584.

Wolpert, D. H. & Macready, W. G. (1997) No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation. 1 (1), 67-82.

Yan, K., Zhu, J. & Qiang, S. (2007) The application of ID3 algorithm in aviation marketing.

Yun-Tao Qian, Y. Q., Qing-Song Shi, Q. S. & Qi Wang, Q. W. (2002) CURE-NS: a hierarchical clustering algorithm with new shrinking scheme.

Zhang, S. C., Zhang, C. Q. & Yang, Q. (2003) Data preparation for data mining. Applied Artificial Intelligence. 17 (5-6), 375-381.

Zhang, T., Ramakrishnan, R. & Livny, M. (1997) BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery. 1 (2), 141-182.


22

9 Acknowledgements

The author would like to acknowledge and thank Dr Frederic Cegla (Senior Lecturer at Imperial College London) for his supervision of this literature review project. Additionally, Shaun Dowling (Co-founder at Interpretive.io), Barney Hussey-Yeo (Data Scientist at Wonga), Ferenc Huszar (Data Scientist at Balderton Capital) and Joseph Root (Co-founder at Permutive.com) for sharing their insights on machine learning.

Guy Riese Literature Review

Documents

Transcript of Guy Riese Literature Review