read.pudn.comread.pudn.com/downloads721/doc/2888925/Deep... · Web viewDeep Learning in Big Data...

Deep Learning in Big Data Analytics: Approaches, Applications and Tools

Abstract

The current trends and technology strives to automate the acquisition of new information, new skills and new ways of organising existing information. This gave rise to machine learning and with this a new era in the field of AI took birth. Several learning algorithms thereon took inspiration from the human brain in trying to emulate its processing complex input data, learning different facts intellectually, and taking decisions to solve computationally complex tasks. Mapping these features of human brain into a model led to the development of traditional neural nets initially and their initial success in building the model with input and output layers, weight vectors and bias although provided groundwork to proceed but is in no way a complete and sufficient approach. Further, the large volumes, variety, velocity and veracity of data available today demands much faster, intelligent processing and learning. Several iterations of the initial neural nets and learning algorithms occurred right from its inception, multilayer perceptrons and Support Vector Machines each made their own contribution and finally the novel concept of “Deep Learning” came into being. Big Data and Deep Learning are two major focus of data science today and are often dealt with in parallel.

Our report aims to make an extensive study of the various aspects, approaches and applications of machine learning. It gives an idea of the various learning tools available today and their relevance to learning complicated big data. We take the KDD 99 Intrusion Detection Dataset to understand the impact of an Artificial Neural Net to train the data (using a ten-fold cross validation technique). The analysis of their performance in terms of time and accuracy for various instances are noted. It has been observed that the time taken with respect to the number of instances decreases as the number of layers is increased and within each layer, 2, 3 and 6 layer MLP show super linear increase, 7 layer MLP show a linear increase while 4,5 layer MLP show a sub linear increase in the time taken . Layer 4 MLP gives us the optimised performance (in terms of accuracy and time taken) with a mean accuracy of 95.43%. However, it cannot be ignored that the performance decreases with the increase in the number of instances for each layer without exception. This unsatisfactory real time performance experienced in our experiment motivated us to migrate to a Deep Learning Framework “Theano”.

Keywords : Big Data, Artificial Neural Networks, Deep Learning, Theano.

1

1. Introduction

Studies in the neural networks have been divided into diverse domains of network design, structure modelling, and performance improvement to for faster learning and to arrive at more accurate results.[23] Continuous efforts are made to develop useful models and relevant algorithms for data mining, image processing, weather forecasting, stock exchange etc. Shibata and Ikeda showed that the number of neurons and the number of hidden layers in the network can affect performance [24], because a small number of layers can process faster than a big one. Their work focuses on the structure level of the neural network. Generally, number of hidden layer can increase the accuracy of learning. But it will affect the learning time much more than a small layer. In addition, a large data set is difficult to learn at one time. It requires both resources and time. Although, Deep Architectures with multiple layers seemed to be the most enticing solution, the researchers could not directly shift their paradigm because development of appropriate algorithms for them was a challenging task. Moreover, they lacked proper hardware support. Today, the momentum of large scale deep nets have ignited Artificial Intelligence in Software and Product giants like Google, Amazon, Facebook. However the progress did not occur overnight. Perceptrons, with their ability to ‘learn’ and ‘sense’ were the first to come into the picture but were soon ruled out owing to their fundamental limitations in their learning abilities. The later neural network with multiple hidden layers can learn more complicated functions but it lacks a good learning algorithm. The appearance of SVM enlightens people within a short time since it facilitates the learning procedures and performs well in many practical problems, but SVM also encounters its bottlenecks due to its shallow architectures.[22]. With time efficient algorithms, tools and hardwares like GPU, FPGA were developed which support deep architecture. The sheer size of data available today big opportunities and transformative potential on one hand while it also present unprecedented challenges to harnessing data and information on the other hand. As the data keeps getting bigger, Deep Learning is coming to play a key role in predictive data analysis.[1]

1.1Machine Learning : Approaches and Application

Machine Learning is a subset emergence from the field of artificial intelligence that aims to build intelligent systems that imitate human behaviour in learning and taking decisions and also to adapt to the changes of pattern and volume of data or to changes in the system without being explicitly programmed.Over the last two decades. Machine Learning has become one of the inevitable ingredient information technology and with that, a rather central, albeit usually hidden, part of our life. With the ever increasing amounts of data becoming available there is good reason to believe that smart data analysis will become even more pervasive as a necessary component for technological progress.[19] Machine Learning is the dream child of contributions from a number of disciplines like computational statistics for prediction analysis, pattern recognition to analyse the regularities of occurrence in data and computational learning theory to provide their

2

computational complexities. It shares strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms is infeasible.A well-known quote of Tom Mitchell provides a more formal definition to Machine Learning which goes as "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E".[19]

Big data revolution in the past 4-5 years has led to the proposals of several methodology to classify, store and most importantly train data of various degrees of volume, velocity and veracity. This has increased the importance of machine learning many folds. The trade off is to employ a technique that accurately trains data without compromising on its performance.

1.2 Applications of Machine Learning:

From NLP, to speech processing, image processing, data mining, bioinformatics and network security Machine Learning is used in diversified applications and areas.

We look into the few specification examples of applications of Machine learning:

a. Web Page Ranking: Here, the search engine returns a sorted list of webpages given a query. To achieve this goal, a search engine needs to ‘know’ which pages are relevant and which pages match the query. Such knowledge can be gained from several sources: the link structure of webpages, their content, the frequency with which users will follow the suggested links in a query, or from examples of queries in combination with manually ranked webpages. Increasingly machine learning rather than guesswork and clever engineering is used to automate the process of designing a good search engine.

b. Collaborative Filtering: As before, we want to obtain a sorted list (in this case of articles). The key difference is that an explicit query is missing and instead we can only use past purchase and viewing decisions of the user to predict future viewing and purchase habits. The key side information here are the decisions made by similar users, hence the collaborative nature of the process. It is clearly desirable to have an automatic system to solve this problem, thereby avoiding guesswork and time.

3

c. Automatic Translation:By using machine learning approach we could simply use examples of translated documents, to learn how to translate between the two instead of performing the arduous task aiming at fully understanding a text before translating it using a curated set of rules crafted by a computational linguist well versed in the two languages we would like to translate.

d. Security Applications: If the system uses face recognition as one of the parameters to facilitate access control , it is desirable to have a system which learns which features are relevant for identifying a person like lighting conditions, facial expressions, whether a person is wearing glasses, hairstyle, etc.A machine learning approach proves very useful in this regard.

e. Named Entity Recognition:It is the problem of identifying entities, such as places, titles, names, actions, etc. from documents. While systems using hand-crafted rules can lead to satisfactory results, it is far more efficient to use examples of marked-up documents to learn such dependencies automatically, in particular if we want to deploy our system in many languages.

There are several approaches to Machine Learning Algorithms like Decision Tree Learning Association Rule Learning Artificial Neural Network Inductive Logic Programming Support Vector Machine Hidden Markov Model Clustering Bayesian Networks Representation Learning Similarity and Metric Learning Sparse Dictionary Learning Genetic Algorithms

We will basically focus on Artificial Neural Networks, Hidden Markov Model and Support Vector Machine in our discussion here. We will study their response to training on test data of various dimensions and instances for each dimension and their combination.

4

2 Learning Approaches: Supervised, Unsupervised and Hybrid

2.1 Supervised learning

Supervised learning is annotated by its characteristic of having a training dataset or a supervisor that instructs or teaches the learning system on the labels to associate with training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyses the training data and produces an inferred function, which is called a classifier (if the output is discrete) or a regression function (if the output is continuous). The inferred function should predict the correct output value for any valid input object. The learning algorithm must generalize some concepts on the training given and apply the same to classify the test data whose results are already known.

Steps for supervised learning:

Determine the type of training examples. Gather a training set that is a representative of a real world use of functions. Determine the input feature representation of the learned function. The

number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.

Determine the structure of the learned function and corresponding learning algorithm.

Complete the design and then run the learning algorithm on the gathered training set.

Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.

2.2 Unsupervised Learning

Unsupervised learning studies how systems can learn to represent particular input patterns in a way that reflects the statistical structure of the overall collection of input patterns. Contrary to supervised learning there are no supervisors or separation of training set and test data set; there are no explicit target outputs or environmental evaluations associated with each input, rather the unsupervised learner brings to bear prior biases as to what aspects of the structure of the input should be captured in the output. Owing to the “unlabelled” nature of data in unsupervised learning there is no error signal for a potentially feasible solution. The network itself decides what output is best for a given input and reorganizes accordingly.

5

Maximum likelihood density estimations encompass most of the principles of unsupervised learning. This incorporates notions like outputs conveying most of the information in the inputs, proper reconstruction of the inputs, perhaps subject to constraints such as being independent or sparse as well as reporting underlying causes of the input. Several different mechanisms have been proposed to put the above into effect; these include forms of Hebbian learning, the Boltzmann and Helmholtz machines, sparse-coding, various other mixture models, and independent components analysis.

2.3 Hybrid Learning

It is a combination of supervised and unsupervised learning to exploit the advantages of the both. The distribution of supervised and unsupervised learning is dependent on the type of experiment used and also on the nature of the data sets. Examples of Hybrid learning include Apprenticeship Learning, Bandit Problems with Events etc.

3 Learning Tools

Machine learning has spread its wings outside of its cloister of academics and high end programming circles in the recent years. It has surged an ever increasing interest worldwide and very soon it will be the heart and soul of every industry and business. This sudden rise in demand can not only attributed to hardware growing cheaper and more powerful, but the proliferation of free software and the increase in diversity of machine learning libraries also deserves equal merit. Below we discuss the following machine learning tools that provide the functionalities for individual frameworks and development.

(i). Caffe: Developed by Berkeley Vision and Learning Centre (BVLC) and community contributors, caffe is a deep learning framework built with expression, speed and modularity in mind. Its expressive structure encourages application and ignites innovations whilst the extensibility of code promotes development. With a speed to process 60M images per day with a single NVIDIA K40 GPU, caffe is the perfect ingredient for research and industries development. It has set its benchmark as a power hub for many academic research projects, start-up prototypes etc.

(ii). Cuda-Covnet: It is a high performance C++/CUDA implementation of convolutional neural networks. It trains by using back propagation algorithm and can model any arbitrary layer connectivity of any network depth. Its other features include:

6

Efficient implementation of convolution in CUDA.

Supports arbitrary stride size at zero loss of efficiency (except that which comes from reducing the problem size).

Implicitly pads your images with an arbitrary-sized border of zeros without using any extra memory.

Supports block-random sparse connectivity at no performance cost. Modular design makes it easy to add new layer types, neuron activation functions, or objectives if you should need them. Mostly avoids use of temporary memory where it isn't strictly needed.

Optimizes multiple objectives simultaneously.

Saves checkpoints to disk as python pickled objects, so you can write python scripts to probe the mind of your neural net.

Capable of training on one batch of data while simultaneously loading the next from disk (or pre-processing it in some way, if necessary).

Numerically tests gradient computations for correctness.

(iii). Theano: It is a Python library that allows one to efficiently define, optimize and evaluate multi-dimensional arrays. Theano features include transparent use of a GPU to perform data intensive calculations, efficient symbolic differentiation for functions of one or many inputs, dynamic C code generation, speed and stability optimizations as well as extensive unit testing and self verification.[7]

(iv). Torch 7: Torch is a scientific computing framework with wide support for machine learning algorithms. An underlying C/CUDA implementation with a fast scripting language LuaJIT, makes Torch efficient and easy to use. Torch aims to provide its users maximum flexibility and speed in building your scientific algorithms while keeping the process extremely simple. A summary of core features includes a powerful N-dimensional array, routines for indexing, slicing, transposing etc, amazing interface to C, via LuaJIT, linear algebra routines, neural network, and energy-based models, numeric optimization routines, Fast and efficient GPU support, Embeddable, with ports to iOS, Android and FPGA backend etc. Right from its inception, Torch and Theano have been in constant competition with each other in terms of performance and speed.[10][11][12]

7

4 ANN as a Learning Tool and its Limitations in the context of Big Data Analytics

4.1 ANN and its variations

Artificial neural networks are relatively crude electronic networks of neurons whose information processing paradigm is influenced by the way biological nerves in the brain process information. The brain is capable of computationally demanding perceptual acts (e.g. recognition of faces, speech) and control activities (e.g. body movements and body functions). The advantage of the brain is its effective use of massive parallelism, the highly parallel computing structure, and the imprecise information-processing capability. With the intention to mimic the computational capabilities, processing, learning of the human brain ANNs are structured to work in unison to deal with certain problem domains. In other words we can derive an analogy with the biological neurons with the conformation given as processing elements to the neurons, combining functions to the dendrites, the transfer function to the cell body, the element output to the axons and the weights associated with each input to the synapses. The firing rule is given by the linear weighted sum of the input signals and the neuron is said to fire (or achieve action potential) when the weighted sum exceeds a given threshold. However this analogy is not claimed to be accurate in all respects as biological systems are much more elaborate. Also, real neurons do not stay on until the inputs change and the outputs may encode information using complex real neurons do not simply sum the weighted inputs and the dendritic mechanisms in pulse arrangements.

Several Learning methodologies have been incorporated in ANN, however we will confine our discussion to Hebbian Learning, Perceptron Learning and Back propagation Learning.

a. Hebbian Learning: The work of Hebb wherein he proposed a theory based on behaviour as much as possible on the physiology of nervous system is undoubtedly the most inspiring work on connectionism theory till date. Learning was based on modification of synaptic connection between the neurons. Specifically, when an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process ormetabolic change takes place in one or both cells such that A’s efficiency, as cells firing B, is increased. The principles underlying thisStatement is known as Hebbian Learning. In a nutshell positive correlation leads to synaptic strengthening and negative correlation leads to synaptic weakening. Mathematical Model of Hebbian modifications:

Δwkj (n) = F (yk (n), xj (n)) … … … (1)

Where Δwkj is the weight vector from k to j, and it is a function of the postsynaptic signal yk

(n) and presynaptic signal xj (n) respectively.

Hebb’s Hypothesis introduces a learning parameter and is given by:

Δwkj (n) = η yk (n) xj (n) … … … (2)

8

Equation (2) is also known as the Activity Product Rule. The graph is a straight line passing through the origin keeping xj constant and taking the slope as η xj.

b. The Perceptron Learning

The error correction learning procedure is simple enough in conception. The procedure is as follows: During training an input is put into the network and flows through the network generating a set of values on the output units. Then, the actual output is compared with the desired target, and a match is computed. If the output and target match, no change is made to the net. However, if the output differs from the target a change must be made to some of the connections. The perceptron learning rule, introduced by Rosenblatt, is a typical error correction learning algorithm of single-layer feed forward networks with linear threshold activation function. Perceptrons are especially suited for problems in pattern classification.

The perceptron learning algorithm is given in the following steps:

Given are K training pairs arranged in the training set (x1, y1). . . . .(xk, yk)

Step 1 η > 0 chosen

Step 2 Weights wi are initialized at small random values, the running error E is set to 0, k :=1

Step 3 Training starts here. x k is presented, x := x k , y := y k and output o = o(x) is computed

oi = 1 if < wi , x > > 00 if < wi , x > < 0

Step 4 Weights are updated

wi := wi + η(yi − oi)x, i = 1, . . . , m

Step 5 Cumulative cycle error is computed by adding the present error to E

E := E + ½ ||y – o||2

Step 6 If k < K then k := k + 1 and we continue the training by going back to Step 3, otherwise we go to Step 7

Step 7 The training cycle is completed. For E = 0 terminate the training session. If E > 0 then E is set to 0, k := 1 and we initiate a new training cycle by going to Step 3.

9

The procedure is very similar to the Hebb rule; the only difference is that when the network responds correctly, no connection weights are modified.

c. The Back Propagation Learning

Although the Perceptron algorithm described earlier works well for linearly separable data it does not often command dominance in the field of learning because the real life data are more complicated. We take a partial derivative of the error of the network to get an idea of the direction in which the error is moving.

In fact, if we take the negative of this derivative (i.e. the rate change of the error as the value of the weight increases) and then proceed to add it to the weight, the error will decrease until it reaches local minima. This makes sense because if the derivative is positive, this tells us that the error is increasing when the weight is increasing. The obvious thing to do then is to add a negative value to the weight and vice versa if the derivative is negative. Because the taking of these partial derivatives and then applying them to each of the weights takes place, starting from the output layer to hidden layer weights, then the hidden layer to input layer weights (as it turns out, this is necessary since changing these set of weights requires that we know the partial derivatives calculated in the layer downstream), this algorithm has been called back propagation algorithm.

There are basically two ways to train a neural network using the back propagation. They are the online mode and the batch mode, each of which exhibits different number of weight updates for same presentations. In the online mode the weights are computed for and modified after each input sample whereas in a batch mode weight updates are computed for each input sample but these values are stored during one pass through the training set (epoch). All the contributions are added at the end of the epoch and only then the weights will be updated with the composite value. This approach follows the gradient more closely and is often referred to as the batch training mode.

The training using back propagation method can be briefly summarised as:

Step 1: Feed training samples as input vectors through the neural network Step 2: Calculate the error at the output layer Step 3: Adjust the weights of the network to minimize error. The average of

all the squared errors for the outputs is computed to make the derivative easier.

Step 4: Continue Step 3 till the errors reach an acceptable level.

A good choice of momentum and learning rate is vital for the training success and the speed of the neural network learning. A back propagation learning with sufficient hidden layers can approximate closely to non-linear functions with very close accuracy.

Although the second generation neural networks is capable of learning more complex functions with the aid of multiple hidden layer Perceptron and a famous Back propagation

10

learning algorithm (which back-propagates the error signal computed at the output layer to get derivatives for learning, in order to update the weight vectors until convergence is reached), the learning scope so observed is not optimal as it comes with the disadvantages of ability to train the unlabeled data while in practice most data is unlabelled, weakening of correcting signal on passing through multiple layers and the slow rate of learning associated alongwith, and getting stuck at local optima.

With the view to combat the deficacies identified the research masses were divided into two groups, the first tried to improve upon Hinton’s method like increasing training data set and estimating initial weight vectors, while the second group focused on improvement in the Perceptron Learning giving birth to a new family called the “Support Vector Machines” (SVMs).

d. Support Vector Machines

Support Vector machines, given by Vladmir M Vapnik et. Al 1995, are based on the concept of decision planes that define decision boundaries. Adopting the core of statistical learning theory, it turns the hand-crafted feature layer in the original Perceptrons into a feature layer following a fixed recipe. This recipe is called the kernel function(whose job is to map the input data into another high-dimensional space. After which an intelligent optimization technique is deployed to learn the weights combining the features and the data corresponding to the output.

SVM gives the benefit of fast and easy learning in simple structures, especially those with smaller number of features and which does not contain features distributed across massive hierarchical structures. However, the same is not the case with complicated features. Despite the conversion of input data into a high dimensional feature space by the kernel,the “fixed” nature of the kernel functions already determines the mapping owing to which the information contained in the datastructure is not fully utilised. [15][16]

This problem can be solved by adding a prior knowledge to the SVM model to obtain better feature layer. However, this step cannot be regarded as a direct improvement because it involves human intervention and is highly dependent on the prior knowledge we add and hence is not perfectly aligned to the “intelligent machine” we desire.

SVM is thus not a good trend to AI and Machine Learning, owing to its fatal deficiency in its shallow architecture. This definitely rings a bell for the need of a multilevel deep hierarchical learning structure and relevant algorithms, and a further probe into the multilayer neural network trying to exploit its advantages related to “deep” and overcome the limitations thereof.[16]

4.2Big Data Analytics

Big data is a collection of data sets that are large and complex in nature. They constitute both structured and unstructured data that grow so fast that they are not manageable by traditional RDBMS or by conventional statistical tools. Big data analytics is a process of

11

collecting, organising and analysing large sets of data to discover patterns and other useful information with low latency, which is beyond the scope of traditional relational databases. Big data in its turn requires tools and methods that are able to achieve the same. The rise of Big Data has been caused by increased data storage capabilities, increased computational processing power, and availability of increased volumes of data, which give organization more data than they have computing resources and technologies to process. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyse previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in significantly better and faster decisions. The three common attributes of big data viz volume, variety and velocity has grown phenomenally in the recent years to reach unprecedented levels; this had led to the inevitable need for understanding big data analytics. With the increase in the volume, and variety there exists a possibility of losing data by streaming if it is not immediately processed and analysed. Although the option to save fast-moving data into bulk storage for batch processing at a later time gives us some relief from storage perspectives, one cannot neglect the importance of the quickness of the feedback loop for translating input data to useful information. This is because the volume and variety is always coupled with velocity. Sustaining trust in Big data coming from a number of sources is often seen as a research challenge (veracity).

Big Data Analytics faces a number of challenges beyond those implied by the four Vs. While not meant to be an exhaustive list, some key problem areas include: data quality and validation, data cleansing, feature engineering, high-dimensionality and data reduction, data representations and distributed data sources, data sampling, scalability of algorithms, data visualization, parallel and distributed data processing, real-time analysis and decision making, crowdsourcing and semantic input for improved data analysis, tracing and analyzing data provenance, data discovery and integration, parallel and distributed computing, exploratory data analysis and interpretation, integrating heterogenous data, and developing new models for massive data computation.

4.2.1 Effects of Artificial Neural Networks in Big Data Analytics

Conventional neural networks tend to get trapped in the local optima of a non convex objective function which often leads to poor performance.[1][27].Moreover, they cannot take advantage of unlabelled data which are present in abundance in Big Data. Traditional Artificial Neural Networks and conventional training methodologies fail to extract complex patterns, nor can they take advantage of raw data making feature selection an inherent,

12

tedious and inevitable task. Learning is very specific and cannot be generalized to diverse domains. Neural networks and CPU implementation cannot process parallel data efficiently, also their algorithms are not scalable. Clearly these deficiencies pose a big research challenge in dealing with “the four V’s” of Big Data. The large and unprecedented growing volumes of non-traditional body of information calls for the development of advanced technologies, better algorithms, more efficient learning architecture and inter-disciplinary teams working in close collaboration. It also demands a dramatic paradigm shift in our scientific research towards data driven discovery. The intension to combat these deficiencies and solve the research challenges led to the development of a hierarchical layer wise , greedy training algorithm also known as “Deep Learning” which will be discussed in the following sections. In the next section we show a detailed performance analysis of Artificial neural Networks on KDD-99 Dataset, taking into account the time taken for varying number of instances as well as the accuracy of test data for the same.

4.2.2 Analysing the nature of the execution time of KDD data set using 10 fold cross validation

No of Instances Time Taken

1032 108.45

2055 176.21

3148 279.8

5060 443.49

10265 805.29

20204 1751.87

30012 2099.78

40035 2667.91

50657 4838.91

Fig 1: Table showing Execution Time in seconds for a KDD 99 Intrusion Detection System for various number of instances.

13

Figure 2: Graph showing the Time Taken in seconds for a 3 layer-MLP

We use a supervised training technique to train the instances of the KDD dataset. Our problem is one of n- class classification problem, as after training and testing we expect the learning from artificial neural network to train the dataset in such a way that it correctly classifies them into normal or to which of the n attack classes it belongs to.

Using the supervised training and using Multilayer Perceptron as a learning tool for our Artificial Neural Network we train 90% of the dataset and train the remaining 10% is tested. Its accuracy of classification and time of completion is noted. The process is repeated for the same dataset taking another 90% for training and some other 10% for testing (the previous 10% which was used for testing now comes under training set). The process is iterated until each of the slots come under the “testing” criteria exactly once. The average of the values of accuracy and time is taken as the final value for the result.

14

1032 2055 3148 5060 10265 20204 30012 506570

1000

2000

3000

4000

5000

6000

We carry on our experiment taking orders of 1000, 2000, 3000, 5000, 10000, 20000, 30000, 40000 and 50000 instances.

When the number of instances increase from 1032 to 2055 (1.99 or nearly 2 times) the time of execution increases from 108.45 to 176.21 (1.62 times < 2 times). So initially with very small number of instances the slope of the graph = 1.62/1.99 = 0.814 <1 (< tan 45). Hence the graph is said to exhibit sub-linear nature initially.

i). Instances Time taken

2055-3148 176.21-279.8

Number of instances increase 1.531 times while the time taken increases 1.58. The slope of the graph at this interval is 1.58/1.53 = 1.03 >1. Hence the graph shows slightly super linear nature at this interval

ii). Instances Time taken

3148 – 5060, 5060 - 10265 279.8 – 443.49, 443.49 – 805.29

Slope of the graph = .998 = 1. Therefore linear.

iii). Instances Time taken

10265 – 20204 805.29 – 1751.87

Slope = 1.105 >1. Therefore superlinear.

iv). Instances Time taken

20204 – 30012 1751.87 – 2667.91

Slope of the graph = 1.02. Therefore slightly superlinear

v). Instances Time Taken

30012 – 50657 2667.91 – 4838.9

Slope = 1.07. Therefore superlinear.

Observation: The graph starts with exhibiting a sub linear growth initially which later became linear and started growing superlinear. We assume that the growth will be more extreme as the number of instances keep on growing and that at one point it might show an exponential growth.

15

4.2.2 Execution Time Analysis of KDD 99 Intrusion Detection Data Set using 10 fold cross validation technique for multiple layers.

We perform our experiment for instances of order of 1x 10^4, 2x 10^4, 3x10^4, 4x10^4 and 5x10^4 for 2 to 7 hidden layers. The performance in terms of time taken to execute the model and the accuracy of predicting and correctly classifying the instances are given in the tables below. They are accompanied by individual graphs for better vision and analysis.

No of Instances

2 Layer – MLP

3 Layer - MLP

4 Layer - MLP

5 Layer - MLP

6 Layer - MLP

7 Layer - MLP

10265 317.22 238.86 196.85 189.6 167.23 137.82

20204 634.53 481.39 392.48 416.48 340.01 307.1

30012 955.77 699.57 560.12 506.86 545.31 428.59

40035 1327.11 995.31 788.25 684.77 680.48 572.53

50657 1860.38 1397.41 943.62 747. 59 931.09 738.44

Fig 3: Table showing The Execution Time in seconds for an Artificial Neural Network having multiple hidden layers.

16



17


Figure 7: Graph showing the Time Taken in seconds for a 6 layer-M

18


Figure 9: Graph showing All – Layer Time Analysis in a single diagram.

19

Analysis of the above graphs provides us with the conclusion that the execution time increases significantly with the increase in the number of instances for each layer. However the variation and the rate of increase is neither same nor consistent across different layers. 2 Layer-MLP and 3-Layer MLP Graphs show that the increase is consistently superlinear (slope >1) as the number of instances increase. They are expected to take an exponential turn with the sheer increase in the number of instances. The 6 Layer-MLP shows superlinear and sublinear behaviour in turns, however as the no. of instances reache a certain threshold it shows a sharp increase in the time taken to execute the model. The 7 Layer-MLP shows a near linear increase in the time taken with respect to the number of instances.

Interestingly, Layer 4-MLP starts with a linear nature, tends to show superlinear behaviour in a range, but stabilises to a sublinear increase (slope < 0.83) with the increasing number of instances. Layer 5-MLP on the other hand fluctuates between linear and sublinear behaviour and ultimately stabilises to exhibiting a sublinear (slope < 0.72) behaviour as the number of instances are increased.

This shows that ANN works best for a 5 Layer-MLP in terms of execution time. This is convergent to Hinton’s analysis which revealed that the number of layers that gives the best performance in terms of time is an optimisation of the number of input and the number of output neurons. However, the time of execution cannot be the only criteria for judging the performance of a model which is why we take the accuracy of prediction in the next section.

20

4.3.3 Accuracy Analysis KDD 99 Intrusion Detection Data Set using 10 fold cross validation technique for multiple layers.

Let us plot the accuracy of correctly classifying the instances in the table below.

No of Instances

2 Layer - MLP

3 Layer - MLP

4 Layer - MLP

5 Layer - MLP

6 Layer - MLP

7 Layer - MLP

10265 99.92% 97.02% 98.30% 98.28% 56.89% 56.89%

20204 99.80% 94.90% 96.26% 92.74% 56.97% 56.97%

30012 99.78% 96.40% 95.87% 77.69% 56.88% 56.88%

40035 99.92% 97.08% 88.48% 82.60% 56.89% 56.89%

50657 99.85% 96.39% 98.27% 83.08% 56.87% 56.87%

Fig 10: Table showing The Accuracy of Prediction for a KDD 99 Intrusion Detection Data Set for an Artificial Neural Network having multiple hidden layers.

Analysis of the above table show a decreasing accuracy of correctly classifying the instances of a KDD 99 Intrusion Detection Data set as the number of layers increases. However, we have already ruled out 2 Layer-MLP and 3 Layer-MLP owing to their nature superlinear and expected exponential growth.

In the battle between the 4 Layer-MLP and 5 Layer-MLP, the 5 Layer-MLP was favoured in terms of time taken. However it does not satisfactorily meet the accuracy requirements with a mean accuracy of 86.67%. The 4 Layer-MLP shows a whooping 95.23% mean accuracy and hence wins the battle in terms of combined performance. The 6 Layer-MLP and the 7 Layer-MLP falls out of the race owing to an unacceptable mean accuracy of 56.93%.

21

To get a better vision of the performance let us look at the graph below

Figure 11: Graph showing All – Layer Performance Analysis in a single diagram.

The best performance observed in an ANN, however will not take its stand when the data volume and variety increase significantly. With research, we have seen that Deep Learning has the significant potential to do away with the deficiency faced in case of ANN. For the next subsequent sections we primarily focus on Deep Learning survey and Big Data Analytics.

5. Deep Learning and its Applications

Machine Learning is primarily focused on representation of input data and generalizations of the patterns already learnt for futuristic approach. It is known the poor data representation will most likely reduce the performance of a complex, advanced machine whilst a good data representation can lead to a high performance for a comparatively simplistic machine learner. Feature Engineering holds a major share of the efforts in machine learning. To mitigate the effort thus spent in constructing features and data representation from raw data, a more automated and generalised feature extraction

22

approach would be a major quantum leap in machine learning as it would allow novel researchers automatic extractions of features without human input. Deep Learning algorithms are one of phenomenal route for research in this aspect. . Such algorithms develop a layered, hierarchical architecture of learning and representing data, where higher-level (more abstract) features are defined in terms of lower-level (less abstract) features. The hierarchical learning architecture of Deep Learning algorithms is motivated by artificial intelligence emulating the deep, layered learning process of the primary sensorial areas of the neocortex in the human brain, which automatically extracts features and abstractions from the underlying data. [4] [6]

Deep Learning is based on the theory of connectionism. While an individual neuron in biology or an individual feature in machine learning model is not intelligent, a large population of these neurons and features acting collectively are capable of exhibiting intelligent behaviour.[2]

Deep Learning, today, has found its applications in multifarious domains in both small and big scales. Using supervised/unsupervised machine learning techniques to automate the hierarchical learning approach, it has paved way into the academic community drawing their attention to its research owing to its state of the art performance in various domains as speech recognition, collaborative filtering, computer vision etc. [1] Top notch research and production based companies like Google, Intel, Facebook, Apple, Microsoft take advantage of the massive volume of digital data available at our disposal today, by collecting, storing and analysing them and hence this has been aggressively pushing forward the deep learning revolution and related projects. For example, Apple's Siri, the virtual personal assistant in iPhones, offers a wide variety of services including weather reports, sport news, answers to user's questions, and reminders etc. by utilizing deep learning and more and more data collected by Apple services [17]. Google applies deep learning algorithms to massive chunks of messy data obtained from the Internet for Google's translator, Android's voice recognition, Google's street view, and image search engine [18]. Other industry giants are not far behind either. For example, Microsoft's real-time language translation in Bing voice search [19] and IBM's brain-like computer [18], [20] use techniques like deep learning to leverage Big Data for competitive advantage.

6.Existing Approaches for Deep Learning.

As the data keep getting bigger in terms of their dimensionality, volume, variety, velocity and veracity, deep learning with its hierarchical based approach (deep architecture) coupled with efficient hardware, mainly with the advent of graphics processor and increased processing power of the machines, has played a key role in predictive analytics solutions.[1]

The increase in accuracy, speed and levels of complexity of problem solving capacity of a neural net is much higher today than was observed, three decades ago, and the reason behind this is attributed to the dramatic increase in the size of the neural networks today.

23

Neural network sizes have grown exponentially over the last thirty years. Because the size of the network is of paramount importance deep learning requires high performance hardware and software architectures.[2]. We discuss some basic approaches to Deep Learning Below.

6.1 Fast CPU Implementations

The traditional methodology of training neural networks by using CPU of a single machine is by no means a fair competition to the massive volume of data at our disposal and hence is insufficient. In the lookout for a better option to stay in the race, GPU computing or using the CPUs of many machines together were proposed and experimented with. As a result of thorough research in this field, it has been observed that careful implementation of specific CPU families could yield tremendous improvement in performance characteristics. For example in 2011, the best CPUs available could run neural network workloads faster when using fixed point arithmetic than using floating point. By creating a carefully tuned fixed point implementation, Vanhoucke et al, 2011, obtained a 3 times speed up over a strong floating point system. However the performance characteristics varies with the model of the CPU used, so sometimes floating point operations can be faster too. Besides careful specialization of numerical computations, other strategies include optimizing datastructures to avoid cache misses and using vector instructions. If high degree of precision is to be achieved, one cannot neglect these implementation details. Hence when the machine learning researchers restrict the size of the network, they in turn restrict the machine learning capabilities of the network.[2]

6.2 GPU Implementations

Graphics Processing Units (GPUs) are specialized hardware components that were originally developed for graphics applications like video games. The performance characteristics needed for good video gaming turned out to be beneficial for neural network implementations as well. In contrast to CPUs that receive a considerable high computational workload and involve high branching, GPU computations are fairly and necessarily simple. Independence of individual computations, ease of parallelization and a high memory bandwidth are some features specially attributed to a GPU which makes it an indispensable consideration in making intelligent deep architectures as opposed to its CPU counterpart. However, before making a paradigm shift in terms of processing units, the researchers worked hard to prove the inefficiency of CPUs to this end.

Neural networks also benefit from the same performance characteristics. Neural networks usually involve large and numerous buffers of parameters, activation values, and gradient values, each of which must be completely updated during every step of training. These buffers are large enough to fall outside the cache of a traditional desktop computer so the memory bandwidth of the system often becomes the rate limiting factor. GPUs offer a compelling advantage over CPUs due to their high memory bandwidth. Neural network training algorithms typically do not involve much branching or sophisticated control, so they are appropriate for neural network hardware. Since neural networks can be divided into multiple individual “neurons” that can be processed independently from the other neurons

24

in the same layer, neural networks easily benefit from the parallelism of GPU computing. [2].

6.3 Large Scale Distributed Applications

It involves distribution of training and inference across various machines. However it is an arduous task when both data parallelism and model parallelism are to be achieved in distributed systems. For instance data parallelism gives a sublinear performance in terms of optimization even on increasing the size of the minibatch for a single SGD. This can be solved using asynchronous stochastic gradient descent (Bengio et al., 2001a; Recht et al., 2011). In this approach, several processor cores Share the memory representing the parameters. Each core reads parameters without a lock, then computes a gradient, then increments the parameters without a lock. This reduces the average amount of improvement that each gradient descent step yields, because some of the cores overwrite each other’s progress, but the increased rate of production of steps causes the learning process to be faster overall. Dean et al. (2012) pioneered the multi-machine implementation of this lock-free approach to gradient descent, where the parameters are managed by a parameter server rather than stored in shared memory. Till today distributed asynchronous gradient descent remains the primary strategy for training Deep Neural Architectures.[2]

6.4 Dynamic Structures

With a view to accelerate the data processing systems dynamic structures are built which enable these processing systems to dynamically determine the selected subset of many neural network that should be run on a given input. Individual neural networks also display dynamic behaviour internally by determining which subset of features to compute given information from input and is known as conditional computation, Bengio 2013; Bengio et al, 2013b.[2]A prominent strategy for accelerating the inference of classifiers is to use a cascade of them.Decision trees themselves are an example of dynamic structure, because each node in the tree determines which of its sub trees should be evaluated for each input. A simple way to accomplish the union of deep learning and dynamic structure is to train a decision tree in which each node uses a neural network to make the splitting decision (Guo and Gelfand, 1992), though this has typically not been done with the primary goal of accelerating inference computations.[2]In the same spirit, one can use a neural network, called the gater to select which one out of several expert networks will be used to compute the output, given the current input. One major setback to this approach is the decreased degree of parallelism that results from the systems following code branches from different inputs. Although one may write specialised subroutines that convolve each example with different kernels, one cannot escape the difficulty of implementation associated thereof. CPU implementations will be slow due to the lack of cache coherence and GPU implementations will slow down due to the lack of collasced memory transactions and the need to serialize warps when members of warps take different branches. Although partitioning the examples into groups that take the same branch and are processed simultaneously can mitigate these issues temporarily in

25

some cases, but in the real time settings where continuous processing is demanded, workload partitioning can give rise to load balancing issues.[2]

7. Motivation

7.1 Why Deep Learning in Big Data Analytics

Mining and extracting meaningful patterns from massive input data for decision making, prediction and inferencing is the heart of Big Data Analytics. The four V’s of Big Data presents a massive collection of raw unlabelled data to its users. Deep Learning inherently exploits the availability of massive amounts of data, i.e. Volume in Big Data, where its shallow learning counterparts fail to explore and understand the higher complexities of data patterns. Moreover, since Deep Learning deals with data abstraction and representations, it is quite likely suited for analysing raw data presented in different formats and/or from different sources, i.e. Variety in Big Data, and may minimize need for input from human experts to extract features from every new data type observed in Big Data. The key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabelled and un-categorized. [1][5].Once the hierarchical data abstractions are learnt from unsupervised data with Deep Learning, more conventional discriminative models can be trained with the aid of relatively fewer supervised/labelled data points, where the labelled data is typically obtained through human/expert input. Automatic abstractions of features from the from lower-level features to higher-level concepts through a series of processing stages and a distributed representation of learning allows deep-learning networks to have a stronger capacity for learning and can produce much better generalizations. Furthermore, an architecture with multiple levels and based on a distributed representation of data allows deep-learning networks to learn intermediate representations, which can be shared across different problem areas.[6]

Deep Learning algorithms are shown to perform better at extracting non-local and global relationships and patterns in the data, compared to relatively shallow learning architectures making it desirable to use for Big Data Analytics solution[5].

Despite the challenges to the traditional data analysis approaches owing to the Big Data Analytics taking a forefront in diverse application domains and Deep Learning research being in an infantile stage, it has actually provided the stage to the future researchers for developing novel algorithms and models in the pursuit to exploit its benefits. Deep Learning concepts provide one such solution venue. For example, the extracted representations by Deep Learning can be considered as a practical source of knowledge for decision-making, semantic indexing, information retrieval, and for other purposes in Big Data Analytics, and in addition, simple linear modeling techniques can be considered for Big Data Analytics when complex data is represented in higher forms of abstraction.[1][2][5]

26

7.2 Issues and Research Challenges.

7.2.1 Incremental learning for non-stationery, high velocity online data

Data generation at an explosive rate and their inherent need for timely processing has given rise to an emerging challenge in the field of Big Data learning. Such data analysis is useful in monitoring tasks, such as fraud detection. Incremental feature learning and extraction[5][49], denoising auto encoders[5][50], and deep belief networks[5][51] throws some light in this area. Denoising auto encoders are a variant of auto encoders which extract features from corrupted input, where the extracted features are robust to noisy data and good for classification purposes. Deep Learning algorithms in general use hidden layers to contribute towards the extraction of features or data representations. In a denoising auto encoder, there is one hidden layer which extracts features, with the number of nodes in this hidden layer initially being the same as the number of features that would be extracted. Incrementally, the samples that do not conform to the given objective function (for example, their classification error is more than a threshold, or their reconstruction error is high) are collected and are used for adding new nodes to the hidden layer, with these new nodes being initialized based on those samples. Subsequently, incoming new data samples are used to jointly retrain all the features.[5] The incremental feature learning and mapping, though improves upon the discriminative or generative objective function, entails a lot of redundant features and over fitting data owing to the monotonically addition of features. Consequently, similar features are merged to form a more compact feature set. Zhou et al. [49] demonstrate that the incremental feature learning method quickly converges to the optimal number of features in a large-scale online setting. This kind of incremental feature extraction is useful in applications where the distribution of data changes with respect to time in massive online data streams. Incremental feature learning and extraction can be generalized for other Deep Learning algorithms, such as RBM [7], and makes it possible to adapt to new incoming stream of an online large-scale data. Moreover, it avoids expensive cross-validation analysis in selecting the number of features in large-scale datasets. Calandra et al. [51] introduce adaptive deep belief networks which demonstrate how Deep Learning can be generalized to learn from online non-stationary and streaming data. Their study exploits the generative property of deep belief networks to mimic the samples from the original data, where these samples and the new observed samples are used to learn the new deep belief network which has adapted to the newly observed data. However, a downside of an adaptive deep belief network is the requirement for constant memory consumption.[5]

However, to deal with high velocity of data, ‘online learning’ provides a fairly appropriate solution. It is a sequential learning procedure where one instance is learnt at a time and the true label for each instance will soon be available which will be used for refining the model[71]-[76].This sequential one at a time learning is advantageous to Big Data as current machines cannot contain the entire dataset in their memory. It has been observed that very limited progress has been made in the field of online learning using conventional neural networks. To speed up learning, instead of following a sequential approach one can perform updates in a mini batch basis to provide a good balance between running time and computer memory. Online learning often scales naturally, is readily parallelizable, is memory

27

bounded, and is theoretically guaranteed[98] and thus is a good candidate for research in the Deep Learning domain.[1]

Another challenge associated with highly dynamic data is its characteristic of changing distribution over time. To mitigate this challenge, data streams are divided into chunks or pieces where elements within a chunk are separated by a very small time interval . They exhibit a high degree of correlation and thus follow the same distribution and hence each chunk can be considered as a single unit.Deep learning can also leverage both high variety and velocity of Big Data by transfer learning or domain adaption, where training and test data may be sampled from different distributions [99]_[107]. The empirical results of Glorot.et al [100] demonstrated that deep learning is able to extract a meaningful and high-level representation that is shared across different domains. The intermediate high-level abstraction is general enough to uncover the underlying factors of domain variations, which is transferable across domains. Although Deep Learning has shown significant potential in terms of transfer learning, an extensive research is required to leverage its performance to a desirable level.[1][2][5]

7.2.2 Issues related to High Dimensional data

High computational cost, complexity of processing and slow rate of learning are some limitations that can be attributed to data of multiple dimensions. The two approaches to handle the drawbacks related to high dimensional data include marginalized stacked denoising autoencoders (mSDAs) , Chen et al. [52] and Convolutional Neural networks. mSDAs scale effectively for high dimensional data and is faster than regular SDAs. Learning parameters in SDAs is made easy owing to its marginalising capability of noise in SDA training and also due to the non-requirement of complex optimisation algorithm or stochastic gradient descent algorithm for learning purposes. The marginalized denoising autoencoder layers to have hidden nodes, thus allowing a closed-form solution with substantial speed-ups. Furthermore, the model selection process is simplified by the fact that mSDAs have two free parameters controlling the amount of noise as well as the number of layers to be stacked.[1]

Convolutional neural network (CNN) is the second approach to combat this drawback. CNNs scale effectively on high dimensional data and the neurons in the hidden layers units do not need tobe connected to all of the nodes in the previous layer, but just to the neurons that are inthe same spatial area. Moreover, the resolution of the image data is also reduced whenmoving toward higher layers in the network. However, the results from these solutions are just an effort and by no means absolute. This lends to the need for further innovations in large-scale models for Deep Learning algorithms and architectures.[1][2]

7.2.3 Issues related to High Variety of data

The huge volume of data at our disposal today and used in images, video, audio, speech recognition, graphics, text etc come from different sources and in different formats.They are

28

inherently unstructured and come with different characteristics. A key to deal with high variety is data integration and Deep Learning has shown to be very useful in this regard. For example, Ngiam et al. [69] developed a novel application of deep learning algorithms to learn representations by integrating audio and video data. They demonstrated that deep learning is generally effective in (1) learning single modality representations through multiple modalities with unlabeled data and(2) learning shared representations capable of capturing correlations across multiple modalities.

Despite demonstrations from recent experiments of the effectiveness of Deep Learning methodologies to utilise the heterogeneous sources for high system performance, Salakhutdinov [70], the stage remains open for numerous questions. For example, how effective is the learning of joint distribution in the multimodal input space and how efficient is the learning in case of missing modalities? While current deep learning methods are mainly tested upon bi-modalities (i.e., data from two sources), will the system performance benefits from significantly enlarged modalities? Furthermore, at what levels in deep learning architectures are appropriate for feature fusion with heterogeneous data?[1]. A venture into an in-depth research will probably speak volumes about the real usefulness of Deep Learning in the coming years.

7.3 Availability of Tools to support Deep Learning in Big Data Analytics

As the data world undergoes its Cambrian explosion phase our data tools need to become more advanced to keep pace. Deep Learning has emerged as a key tool in the non-linear arms race of machine learning. Applications in text, sensor processing (IoT), image processing, and audio processing have all emerged as prime deep learning applications.In 2015 KDnuggets software poll, a new set of data tools were added especially in relevance to the emerging trend of Big Data revolution. Below we discuss the emerging tools to support Deep Learning in Big Data Analytics and their key role and usefulness in this area.

7.3.1.1 Theano

Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivalling hand-crafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs. Theano combines aspects of a computer algebra system (CAS) with aspects of an optimizing compiler. It can also generate customized C code for many mathematical operations. This combination of CAS with optimizing compilation is particularly useful for tasks in which complicated mathematical expressions are evaluated repeatedly and evaluation speed is critical. For situations where many different expressions are each evaluated once Theano can minimize the amount of compilation/analysis overhead, but still provide symbolic features such as automatic differentiation.Theano’s compiler applies many optimizations of varying complexities. They include (not exhaustive) use of GPU for computations, constant folding, avoidance of redundancy by

29

merging similar subgraphs, memory aliasing to avoid calculation, loop fusion for element wise sub-expressions, arithmetic simplifications and numerical stability.[7]

7.3.1.2 Deep Learning 4j

DeepLearning4j is the first commercial, open source, distributed deep learning library written for Java and Scala with Skymind as its supporting arm. It is integrated with big data open source frameworks like Hadoop and Spark and is designed to be used at a business level rather than for academics and research domains. Deeplearning4j aims to be cutting-edge plug and play, more convention than configuration, which allows for fast prototyping for non-researchers. DL4J is customizable at scale.[9]

Deeplearning4j includes both a distributed, multi-threaded deep-learning framework and a normal single-threaded deep-learning framework. Training takes place in the cluster, which means it can process massive amounts of data quickly. Nets are trained in parallel via iterative reduce, and they are equally compatible with Java, Scala and Clojure. Deeplearning4j’s role as a modular component in an open stack makes it the first deep-learning framework adapted for a micro-service architecture. [9]

Deeplearning4j lets us compose deep neural nets from various shallow nets, each of which form a so-called layer. This flexibility lets us combine restricted Boltzmann machines, other autoencoders, convolutional nets and recurrent nets as needed in a distributed, production-grade framework that works with Spark and Hadoop on top of distributed CPUs or GPUs.[9]

DL4J’s enriching features, besides GPU integration and scalability on Hadoop, also includes a versatile n-dimensional array class, a general vectorization tool for Machine Learning Libraries called Canova and a linear algebra library with speed twice as much as Numpy.[8]

Fig 12: An overview of the different libraries and how the architecture fits into a larger ecosystem[9]

30

7.3.1.3 Torch

Torch is an open source machine learning library, a scientific computing framework, and a script language based on the Lua programming language. It provides a wide range of algorithms for deep machine learning, and uses an extremely fast scripting language LuaJit, and an underlying Cuda C implementation.[11][12]. In a nutshell, its core features include a powerful n-dimensional array, lots of routines for slicing, indexing, transposing, amazing interface to C via LuaJIT, linear algebra and numeric optimization routines, deep neural network and energy based models, fast and efficient GPU support and embedding with ports to iOS, Android and FPGA backends.In addition to these basic features FAIR[14] has given us some additional modules to improve its performance.[12] They include

Containers that allow the user to parallelize the training on multiple GPUs using both the data-parallel model (mini-batch split over GPUs), or the model-parallel model (network split over multiple GPUs).

An optimized Lookup Table that is often used when learning embedding of discrete objects (e.g. words) and neural language models.

Hierarchical SoftMax module to speed up training over extremely large number of classes.

Cross-map pooling (sometimes known as MaxOut) often used for certain types of visual and text models.

A GPU implementation of 1-bit SGD based on the paper by Frank Seide, et al. A significantly faster Temporal Convolution layer, which computes the 1-D convolution of

an input with a kernel, typically used in ConvNets for speech recognition and natural language applications. Our version improves upon the original Torch implementation by utilizing the same BLAS primitives in a significantly more efficient regime. Observed speedups range from 3x to 10x on a single GPU, depending on the input sizes, kernel sizes, and strides.[14]

7.4 Implementing and Analysing Deep Learning for KDD 99 Data Set

We take the same Data Set, the same number of instances and implement Deep Learning using Theano Programming. We limit ourselves to single layer. The excerpts of our experiment are given below in terms of screenshots.

31

Figure 13: Screenshot showing the Time Taken in seconds and the accuracy for the best model (without hidden layers) for 10265 instances instances of the KDD 99 Inrusion Detection Data set in Theano.

32

Figure 14: Screenshot showing the Time Taken in seconds and the accuracy for the best model (without hidden layers) for 20204 instances of the KDD 99 Inrusion Detection Data set in Theano.

33


34


35


36

No of Instances Time Taken

10265 25.2

20204 50.4

30012 81

40035 96.6

50657 132

Fig 18: Table showing The Execution Time in seconds for the KDD 99 Intrusion Detection Dataset (without hidden layers) in Theano.

On analysing the results given by running the same KDD 99 Intrusion detection Dataset in Theano (without hidden layers) we find that it starts with a very low execution time of 25.20 sec which is in contrast with its neural network counterpart which gave an execution time of 805.29 sec.

Further, the increment of execution time in Theano begins linearly and then towards the larger no. of instances, shows a sub linear behaviour. This is again in contrast to traditional neural network implementation for the same dataset which consistently shows a super linear and reaches a near exponential increase in execution time towards the end. Traditional ANN takes 4838.91 seconds to execute on 50657 instances whereas Theano, for the same no. of instances and for the same dataset executes in 132 seconds showing a speed up of 36.66 approximately.

Implementation in Theano without hidden layers gives us a mean accuracy of 78.98% which is far below 96.7% for its traditional neural network counterpart. However, Theano supports heterogeneous parallel programming and achieves maximum performance ( less time of execution and high accuracy) when run on multiple cores simultaneously. So, it is without doubt, that we can conclude that using a Deep Learning Framework for high dimensional, large instance data can give us a desirable performance.

8. Future Scope

37

In the previous sections we have dealt with various applications, aspects and scope of Deep learning in Big Data Analytics. Although Deep Learning has been moderatlely explored in terms of image processing, speech recognition, pattern recognition etc, some areas like Network Security and Bio Informatics still remain unexplored. Considering the low-maturity of Deep Learning, we note that considerable work remains to done. In this section, we discuss our insights on some remaining questions in Deep Learning research, especially on work needed for improving machine learning and theformulation of the high-level abstractions and data representations for Big Data.

An important problem is whether to utilize the entire Big Data input corpus availablewhen analyzing data with Deep Learning algorithms. The general focus is to apply DeepLearning algorithms to train the high-level data representation patterns based on a portionof the available input corpus, and then utilize the remaining input corpus with thelearnt patterns for extracting the data abstractions and representations. In the context ofthis problem, a question to explore is what volume of input data is generally necessary totrain useful (good) data representations by Deep Learning algorithms which can then begeneralized for new data in the specific Big Data application domain. Upon further exploring the above problem, we recall the Variety characteristic of BigData Analytics, which focuses on the variation of the input data types and domains in BigData. here, by considering the shift between the input data source (for training the representations)and the target data source (for generalizing the representations), the problembecomes one of domain adaptation for Deep Learning in Big Data Analytics.Glorot et al[26] demonstrate that Deep Learning is able to discover intermediatedata representations in a hierarchical learning manner, and that these representationsare meaningful to, and can be shared among, different domains. However, it should be noted that their study does not explicitly encode the distribution shift of the data between the sourcedomain and the target domains and hence needs further research.Another key area of interest would be to explore the question of what criteria is necessaryand should be defined for allowing the extracted data representations to provideuseful semantic meaning to the Big Data.In some Big Data domains, the input corpus consists of a mix of both labeled and unlabeleddata, e.g., cyber security [27], fraud detection [28], and computer vision [29]. Insuch cases, Deep Learning algorithms can incorporate semi-supervised training methodstowards the goal of defining criteria for good data representation learning. A variation of semi-supervised learning in data mining,active learning methods could also be applicable towards obtaining improved data representationswhere input from crowdsourcing or human experts can be used to obtainlabels for some data samples which can then be used to better tune and improve the learntdata representations.

9. Project Proposal for the next semester.

38

We have implemented and analysed the traditional Artificial Neural Network with several layers and have found their performance unsatisfactory especially when the number of instances increase significantly. This encouraged us to move to a better platform in terms of hardware (GPU), software (Theano, Torch 7) and architecture (Deep Learning). We have implemented the Deep Learning Tool “Theano” without hidden layers in this semester and have found a promising leap in terms of accuracy and time taken.

In the next semester we would make a thorough analysis by introducing hidden layers in our program and making a comparative study of Artificial neural Networks and Theano in terms of performance. We will generate Big Data, implement Deep Learning in Big Data and record the response generated.

We would also like to explore the two comparatively untouched areas of Deep Learning viz. Network Security and Bio Informatics. Deep learning algorithms are mostly domain specific, for example Convolutional Neural Network in Deep architecture handles images efficiently because the individual neurons are tiled in such a way that they respond to overlapping regions in the visual field. Likewise, we would like to develop an efficient algorithm each for Network Security and Bio Informatics domain in order to scale up their performances.

10. Conclusion

In contrast to more conventional machine learning and feature engineering algorithms,Deep Learning has an advantage of potentially providing a solution to address the dataanalysis and learning problems found in massive volumes of input data. More specifically,it aids in automatically extracting complex data representations from large volumes ofunsupervised data. This makes it a valuable tool for Big Data Analytics, which involvesdata analysis from very large collections of raw data that is generally unsupervised andun-categorized. The hierarchical learning and extraction of different levels of complex,data abstractions in Deep Learning provides a certain degree of simplification for BigData Analytics tasks, especially for analysing massive volumes of data, semantic indexing,data tagging, information retrieval, and discriminative tasks such a classification andprediction.In the context of discussing key works in the literature and providing our insights onthose specific topics, this study focused on two important areas related to Deep Learningand Big Data: (1) the application of Deep Learning algorithms and architectures forBig Data Analytics, and (2) how certain characteristics and issues of Big Data Analyticspose unique challenges towards adapting Deep Learning algorithms for those problems.A targeted survey of important literature in Deep Learning research and application todifferent domains is presented in the paper as a means to identify how Deep Learning canbe used for different purposes in Big Data Analytics.The low-maturity of the Deep Learning field warrants extensive further research. Inparticular, more work is necessary on how we can adapt Deep Learning algorithms for problems associated with Big Data, including high dimensionality, streaming data analysis,scalability of Deep Learning models, improved formulation of data abstractions,distributed computing, semantic indexing, data tagging, information retrieval, criteria for

39

extracting good data representations, and domain adaptation. Yan LeCun, the pioneer od Deep Learning Hardware paints an unexpected portrait of what future architectures sit between current deep learning capabilities and the next stage of far smarter, more vast neural nets. He states that the current architectures are not offering enough in terms of performance to stand up to the next crop of deep learning algorithms as it overextends current acceleration tools and other programmatic limitations.[20] Future works should focus on addressing one or more of these problems often seen in Big Data, thus contributing to the Deep Learning and Big Data Analytics research corpus.[26]

References

[1] XUE-WEN CHEN1, (Senior Member, IEEE), AND XIAOTONG LIN2, Big Data Learning: Challenges and Perspectives.[2] Yoshua Bengio, Ian Goodfellow, Aaron Courville, MIT Press, In preparation, Deep Learning[3] Maryam M Najafabadi1, Flavio Villanustre2, Taghi M Khoshgoftaar1, Naeem Seliya1,Randall Wald1* and Edin Muharemagic3, Deep Learning Applications and Challenges in Big Data Analytics.[4] (2011, May). Big Data: The Next Frontier for Innovation, Competition,and Productivity. [5] Najafabadi et al, Journal of Big Data (2015) 2:1[6] ©SAP 2015, An Introduction to Deep Learning: Examining the Advantages of Hierarchical Learning[7] Theano Documentation Release 0.7 LISA Lab, University of Montreal.[8] Josh Patterson, Enterprise Deep Learning with DL4J Skymind Hadoop Summit 2015.[9] © 2015 Skymind.Website: deeplearning4j.org.DL4J is distributed under an Apache 2.0 Licence//Github//[10] “Torch: a modular machine leraning software library” 30th October 2002.Retrieved 24th

April 2014.[11] Ronan Collobert. “Torch” Github.[12] Torch ios GitHub Repository.[13] (28th January, 2014) Mikio.L. Brown, What is going on with Deep Mind and Google.[14] Soumith Chintata. Fair open sources deep learning modules for Torch.[15] Honglak Lee, university of Michigan, Turorial on deep learning and application.[16] Dandanmo, A survey on deep learning- one small step towards AI. December 4, 2012.[17] HEBB DO (1999) , The organisation of behaviour.[18] Unsupervised learning, Peter Dayan , MIT.[19] Introduction to machine learning, Alex Smola and SVN Viswanathan.[20] A glimpse into the future of deep learning hardware. August 25, 2015, Nicole Hemsoth.[21] A.Smola and S.Narayanamurthy, “An Architecture for Parallel Topic Medels”, Proc. VLDB Endowment, vol 3, no.1, pp. 703-710, 2010.[22] Deep Learning: Methods and Applications: Li Deng and Dong Y U.

40

[23] Y. Bengio, N. Boulanger, and R. Pascanu. Advances in optimizing recurrent networks. In Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP). 2013.[24] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 38:1798–1828, 2013.

2667.91

41

read.pudn.comread.pudn.com/downloads721/doc/2888925/Deep... · Web viewDeep Learning in Big Data...

Documents

Transcript of read.pudn.comread.pudn.com/downloads721/doc/2888925/Deep... · Web viewDeep Learning in Big Data...