Multimedia data mining using deep learning

Multimedia Data Mining

using deep learning

Peter Wlodarczak

[email protected]

Agenda

Aims

Multimedia Data Mining

Artificial Neural Networks

Deep learning

Challenges

Discussion

Aims

Analyze multimedia data for:

Object/face recognition

Voice commands

Natural Language Processing

Classification

Automatic caption generation

Record linkage (entity resolution)

Multimedia Data Mining I

Multimedia data mining:

Unprecedented amount of Multimedia data

since Web 2.0 and Social Media

Prosumer data

Uses algorithms to extract useful patterns

and relations from image, audio and video

data

Traditional methods often not satisfactory

Unsuitable for high dimensionality

Multimedia Data Mining II

Multimedia data mining has been

improved using deep learning in:

Visual data mining

Natural Language Processing

Deep learner are:

Machine Learning schemes

Usually multi-layered artificial neural

networks

Artificial Neural Networks I

Artificial Neural Networks:

Suitable to give good approximations for

complex problems

Consist of perceptrons, neurons,

and weighted connections,

the axons

Artificial Neural Networks II

Perceptron (Neuron)

Linear classifier

Data linearly separable using a hyperplane

Where w = weights, a = real-valued vector,

feature vector, a0 = bias

Binary classifier f(a) that maps its input

vector a to a single, binary output value

w0a0 + w1a1 + w2a2 + … + wkak = 0

Artificial Neural Networks III

w0

1

bias

attr

a1

attr

a2

attr

a3

w1 w2

w3

f(a) = kwkak + b

f(a) > 0 or

f(a) < 0

Artificial Neural Networks III

Training data

sex mask cape tie ears smokes class

Batman male yes yes no yes no Good

Robin male yes yes no no no Good

Alfred male no no yes no no Good

Penguin male no no yes no yes Bad

Catwoman female yes no no yes no Bad

Joker male no no no no no Bad

Test data

Batgirl female yes yes no yes no ?

Riddler male yes no no no no ?

Supervised learning

Artificial Neural Networks IV

Not all data is linearly separable

Artificial Neural Networks V

Multilayer Perceptron

Perceptrons organized in several layers

A layer is fully interconnected with the next

layer

All nodes except input node are perceptrons

Feedforward neural network

Uses backpropagation for training

Error propagated back to minimize loss function

Artificial Neural Networks VI

Multilayer perceptron can be used for

non-linear, multiclass classification

Artificial Neural Networks VII

Gradient descent optimization method

for learning weights

Artificial Neural Networks VIII

Complexity has to be accurate

(Occam’s razor)

Schapire 2004

Artificial Neural Networks IX

Schapire 2004

Artificial Neural Networks X

For building an accurate classifier:

Enough training examples

Good performance on training set

Classifier that is not too complex,

overfitting

Allows to get approximate solutions for

very complex problems

Support Vector Machines (SVM) are a

much simpler alternative to ANN

Deep learning I

Deep learning

No clear distinction to shallow learner

Multiple layers of non-linear processing

units

Each layer represents features at a higher

level

Forms a hierarchical representation

Majority of deep learners are aNN

Deep learning II

Deep learning neural networks

Uses Rectified Linear Unit (ReLU)

Learn faster

Half-wave rectifier

f(z) = max(z, 0)

Use backpropagation for adjusting the

weights

Deep learning III - ConvNet

LeNet 2015

Deep learning IV - ConvNet

Convolutional neural networks

Inspired by the animal visual cortex

Visual cortex is the most powerful visual

processing system in existence

Typically two stages:

Convolutional stage

Pooling stage

Characterized by

sparse connectivity

shared weights

Deep learning V - ConvNet

Shared weights

Subsets share weights and bias to form

feature map

Replicated across entire visual field

Deep learning VI - ConvNet

Each layer accepts 3D input vector and

transforms it into a 3D output vector

Filters activate when specific feature is

mapped

CS231n 2015

Deep learning VII - ConvNet

Receptive field spans all feature maps

LeNet 2015

Deep learning VIII - ConvNet

MaxPooling

Non-linear down-sampling

Partitions input into non-overlapping

rectangles

Outputs maximum value for each sub-

region

Minimizes computation for next layer

Reduces dimensionality of intermediate

representations

Deep learning IX - ConvNet

Convolutional and sampling sublayers

UFLDL 2015

Deep learning X - ConvNet

Image cascading max-pooling with

convolutionary layer

Similar to edge detector

Deep learning XI - RNN

Recurrent neural networks

Contain directed cycles

Take sequences as input, no fixed size

input and output vectors, e. g. natural

speech

Deep learning XII - RNN

No fixed size of computations

Much simpler than ConvNets

Maintain inner state exhibiting dynamic

temporal behavior

Optimized through backpropagation

Can be extended with long time memory

extensions

Don’t necessary need sequences of inputs

Deep learning XIII - RNN

Training RNN is a non-linear global

optimization problem

Trained using stochastic gradient descent

Non-linear, differentiable activation

function, e. g. rectifier

Trained through backpropagation through

time (BPTT)

Genetic algorithms can be used for training

Deep learning XIV - RNN

Many different architectures for RNN

Elman SRN Spiking neural network

Deep learning XV - RNN

RNN learns to read house numbers

RNN learns to paint house numbers

Karpathy 2015

Deep learning XVI - RNN

RNN used for

Transcribe speech to text

Voice synthetization

Machine translation

Deep learning XVII

Combining ConvNets and RNN for

image descriptions

Regions described

using language as

label space using

ConvNet

Language synthesizing

using RNN

Karpathy & Fei-Fei 2014

Deep learning XVIII

ConvNet and RNN can be combined

Automated caption generation

Deep learning XIX

Automatic feature extraction

No closed vocabulary set

Alignment of segments of sentences to

region on the image

Karpathy & Fei-Fei 2014

Deep learning XX

Other applications

Object recognition

Movie classification

Handwriting recognition

Record linkage

Challenges I

Main disadvantage large volumes of

training data needed

Overfitting if not enough training data

Optimization difficult

Finding relevant information

Privacy preservice data mining

Challenges II

Describing actions

Discussion

Future research in

Attention based models

Finding relevant information

Data democratization and Internet of

Things

Unsupervised learning

Semantic data modeling

Reasoning

Thank you for the attention

Questions?

References

Zhao, X, Li, X & Zhang, Z 2015, 'Multimedia Retrieval via Deep Learning to Rank ', IEEE Signal Processing Letters, vol. 22, no. 9, pp. 1487 -

91 <http://ieeexplore.ieee.org.ezproxy.usq.edu.au/xpls/abs_all.jsp?arnumber=7054452>.

Yu, W, Zhuang, F, He, Q & Shi, Z 2015, 'Learning deep representations via extreme learning machines', Neurocomputing, vol. 149, Part A,

pp. 308-15, <http://www.sciencedirect.com/science/article/pii/S0925231214011461>.

Xu, K, Ba, J, Kiros, R, Cho, K, Courville, A, Salakhutdinov, R, Zemel, R & Bengio, Y 2015, 'Show, Attend and Tell: Neural Image Caption

Generation with Visual Attention', Proceedings of the 32nd International Conference on Machine Learning from Data: Artificial Intelligence

and Statistics, vol. 37.

Xin, J, Wang, Z, Qu, L & Wang, G 2015, 'Elastic extreme learning machine for big data classification', Neurocomputing, vol. 149, Part A, pp.

464-71, <http://www.sciencedirect.com/science/article/pii/S0925231214011503>.

Weston, J, Chopra, S & Bordes, A 2015, 'Memory Networks', in 3rd International Conference on Learning Representations: proceedings of

the3rd International Conference on Learning Representations San Diego, viewed <http://arxiv.org/pdf/1410.3916v10.pdf>.

Weilong, H, Xinbo, G, Dacheng, T & Xuelong, L 2015, 'Blind Image Quality Assessment via Deep Learning', Neural Networks and Learning

Systems, IEEE Transactions on, vol. 26, no. 6, pp. 1275-86.

Wang, Y, Li, D, Du, Y & Pan, Z 2015, 'Anomaly detection in traffic using L1-norm minimization extreme learning machine', Neurocomputing,

vol. 149, Part A, pp. 415-25, <http://www.sciencedirect.com/science/article/pii/S0925231214011382>.

Vinyals, O, Toshev, A, Bengio, S & Erhan, D 2015, 'Show and Tell: A Neural Image Caption Generator', Google,

<http://arxiv.org/pdf/1411.4555v1.pdf>.

Noda, K, Yamaguchi, Y, Nakadai, K, Okuno, H & Ogata, T 2015, 'Audio-visual speech recognition using deep learning', Applied Intelligence,

vol. 42, no. 4, pp. 722-37, <http://dx.doi.org/10.1007/s10489-014-0629-7>.

Mao, W, Zhao, S, Mu, X & Wang, H 2015, 'Multi-dimensional extreme learning machine', Neurocomputing, vol. 149, Part A, pp. 160-70,

<http://www.sciencedirect.com/science/article/pii/S0925231214011540>.

Liu, X, Wang, L, Huang, G-B, Zhang, J & Yin, J 2015, 'Multiple kernel extreme learning machine', Neurocomputing, vol. 149, Part A, pp. 253-

64, <http://www.sciencedirect.com/science/article/pii/S0925231214011199>.

LeCun, Y, Bengio, Y & Hinton, G 2015, 'Deep learning', Nature, vol. 521, no. 7553, pp. 436-44, <http://dx.doi.org/10.1038/nature14539>.

Srivastava, N, Hinton, G, Krizhevsky, A, Sutskever, I & Salakhutdinov, R 2014, 'Dropout: a simple way to prevent neural networks from

overfitting', J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929-58.

Karpathy, A & Fei-Fei, L 2014, 'Deep visual-semantic alignments for generating image descriptions', arXiv preprint arXiv:1412.2306.

Multimedia data mining using deep learning

Data & Analytics

Transcript of Multimedia data mining using deep learning