Learning Large-Scale Multimodal Data Streams€¦ · Why learning with multimodal deep neural...

Learning Large-Scale Multimodal Data Streams

– Ranking, Mining, and Machine Comprehension

Hung-Yi LEE (李宏毅)National Taiwan University

@GTC 2017, May 8, 2017

http://speech.ee.ntu.edu.tw/~tlkagk/http://winstonhsu.info/

Winston H. HSU (徐宏民)National Taiwan University &

IBM TJ Watson Ctr., New York

@GTC, May 2017 – Winston Hsu

2


The First AI-Generated Movie Trailer – Identifying the

“Horror” Factors by Multimodal Learning

▪ The first movie trailer generated by AI system (Watson)

(tender)

(scary)

(suspenseful)

https://www.ibm.com/blogs/think/2016/08/cognitive-movie-trailer/

1


Detecting Activities of Daily Living (ADL) from

Egocentric Videos

▪ Activities of daily living – used in healthcare to refer to

people's daily self care activities

– Enabling technologies for exciting applications

▪ Very challenging!!

4

ADL: brushing teeth

2

https://www.advancedrm.com/measuring-adls-to-assess-needs-and-

improve-independence/


Our Proposal: Beyond Objects – Leveraging More

Contexts by Multimodal Learning

tap

cup

toothbrush

Objects [1]:

• tap

• cup

• toothbrush

Scene:

bathroom …Scenes:

• Bathroom: 0.8• Kitchen: 0.1

• Living room: 0.01

• ….

Sensors:

• Accelerometer

• Mic.

• Heartrate

Sensors

[1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012

[2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016

CNN for

scene recognition (67)

5

[Hsieh et al., ICME’16]


Experimental Results for ADL

– Multimodal Learning Matters!

▪ Egocentric videos collected of 20 people (by Google Glass, GeneActiv)

6

0%

10%

20%

30%

40%

50%

60%

70%Accuracy

[1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012

[2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016

@GTC, May 2017 – Winston Hsu 7

Perception/understanding is multimodal.

How to design multimodal (end-to-end)

deep learning frameworks?


Outlines

▪ Why learning with multimodal deep neural networks

▪ Requiring techniques for multimodal learning

▪ Sample projects

– Medical segmentation by cross-modal and sequential

learning

– Cross domain and cross-view learning for 3D retrieval

– Speech Summarization

– Speech Question Answering

– Audio Word to Vector

8


3D Medical Segmentation by Deep Neural Networks

▪ Motivations – 3D biomedical segmentation plays a vital

role in biomedical analysis.

▪ Brain tumors have different kinds of shapes, and can

appear anywhere in the brain very challenging to

localize the tumors

▪ Goal – To perform 3D segmentation

with deep methods and segment by

stacking all the 2D slices (sequences).

▪ Observing oncologists leverage the

multi-modal signals in tumor

diagnosis

9

3

[Tseng et al., CVPR 2017]


Multi-Modal Biomedical Images

▪ 3D multi-modal MRI images

– Different modalities used to distinguish the boundary of

different tumor tissues (e.g., edema, enhancing core,

non-enhancing core, necrosis)

– Four modalities: Flair, T1, T1c, T2

10

T1T1c T2Flair


Related Work – SegNet (2D Image)

▪Structured as encoder and decoder with multi-

resolution fusion (MRF)

▪But

– Ignoring multi-modalities

– lacking sequential learning11Badrinarayanan, et al., SegNet: A Deep Convolutional Encoder-Decoder Architecture for

Image Segmentation, 2015


3D Medical Segmentation by Deep Neural Networks

▪ Our proposal – (first-ever) utilizing cross-modal learning

in the (end-to-end) sequential and convolutional neural

networks and effectively aggregating multiple resolutions

12

[Tseng et al., CVPR 2017]

Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu and Chung-Yang Huang. Joint Sequence

Learning and Cross-Modality Convolution for 3D Biomedical Segmentation. CVPR 2017


ConvLSTM – Temporally Augmented Convolutional

Neural Networks

▪ Convolutional + sequential networks, e.g., convLSTM

– Modeling spatial cues in temporal (sequential) evolvements

▪ LSTM vs. convLSTM: Traditional LSTM

employs the dot-product; Conv-LSTM

replaces the dot-product by convolution.

13Shi, et al., Convolutional LSTM Network: A Machine Learning Approach for Precipitation

Nowcasting, NIPS 2015


Cross Modality Convolution (CMC)

– For Each Slice

T1c

T2

Flair

Chan C

Chan C

Chan C

Chan 2

Chan 1

Chan 2

Chan 1

Chan 2

Chan 1

C

w

……

…

Chan C

Chan 2

Chan 1

…

Chan 1

Chan 1

Chan 1

Chan 1w

w

K

…

h

…

T1

h

…

Chan 2

Chan 2

Chan 2

Chan 2

Chan C

Chan C

Chan C

Chan C

…

Multi-Modal Encoder

Cross-Modality Convolution

Decoder

Convolution

LSTM

: Conv + Batch Norm + ReLU

: Max pooling

: Deconv

: Conv + Batch Norm + ReLU

Encoder:

Decoder:

Flair

T2

T1c

…

…

T1

Convolution

LSTM

Cross-Modality

Convolution

Multi-Modal

Encoder

Convolution

LSTM

Cross-Modality

Convolution

Multi-Modal

Encoder

slice 1

slice 2

slice n

Convolution

LSTM

Cross-Modality

ConvolutionMulti-Modal

Encoder

slice 1

slice 2

…

slice n

slice 1

slice 2

…

slice n

slice 1

slice 2

…

slice n

slice 1

slice 2

…

slice n

…

…

…

Decoder

Decoder

Decoder

Detailed structure in Figure 2

(a) (b) (c) (d) (e) (f) (g) (h)

convolution with

kernel 4x1x1xC

Tensor(C * h * w * 4)

14


Comparing with the State-of-The-Art in BRATS-2015

(b) Ground truth (c) U-Net (e) CMC +

convLSTM (ours)

(d) CMC (ours)(a) MRI slices

15

▪ MRF is effective

▪ MME + CMC is better than regular encoder + decoder

▪ Two phase is an important training strategy for

imbalanced data

▪ convLSTM, sequential modeling, helps slightly


Sketch/Image-Based 3D Model Search

▪Speeding up 3D design and printing

– Current 3D shape search engines take text inputs only

– Leveraging large-scale freely available 3D models

▪Various applications in 3D models: 3D printing,

AR, 3D game design, etc.

16

[Liu et al., ACMMM’15]

demo4

[Lee et al., 2017]


▪ To retrieve 3D shapes based on photo inputs

▪ Challenges:

– Effective feature representations of 3D shapes (with CNNs)

– Image to 3D cross-domain similarity learning

Image-based 3D Shape Retrieval

17

Query

[Lee et al., 2017]


Our Proposal – Cross-Domain 3D Shape Retrieval with

View Sequence Learning

▪ Novel proposal – End-to-end deep neural networks for cross-domain

and cross-view learning and efficient triplet learning

▪ A brand-new problem

18

[Lee et al., 2017]

Image-CNNAdaptation

Layer

Cross-View

Convolution

Rank by

L2 distanceView-CNN…

Query Image

3D Shapes

Image

representation

Shape

representation

Top Ranked 3D Shapes:

Rendered

Views

…

View-CNN


Cross-Domain (Distance Metric) Learning: Siamese vs.

Triplet Networks

19Wang, Jiang, et al. "Learning fine-grained image similarity with deep ranking." CVPR 2014.

Contrastive

Loss

image1 image2

Triplet

Loss

positive

image

anchor

image

negative

image

identical,

weights sharedidentical,

weights shared

Neural Networks

(CNN / DNN..)


▪ Straightforward but ignoring view sequences

– Each view is passed to the same CNN (shared weights)

– View-pooling is a MAX POOLING operation

Baseline: MVCNN, 3D Shape Feature by Max Pooling

– Ignoring Sequences

……

View-Pooling

conv1 → pool5

fc6 fc7 fc8

airplane

carbed

…

feature

(4096D)(same size as pool5)

Pool 5

(4096D)

Su, Hang, et al. "Multi-view convolutional neural networks for 3d shape recognition.” CVPR 2015 20


Our Proposal: Cross-Domain Triplet NN with View

Sequence Learning

21

▪ Cross-View Convolution aggregates multi-view features

▪ The adaptation layer adapts image features to the joint embedding space

▪ Late triplet sampling speeds up the training of cross-domain triplet learning


Cross-View Convolution (CVC)

22

▪ Stack the feature maps from V views by channel:

V x ( H x W x C ) → H x W x V x C

▪ Convolve the new tensor with K kernels (1 x 1 x V x C)

– Assign K == C → #output channel == #input channel (for comparisons)

– K = C = 256 = AlexNet pool5 feature map #channels

▪ CVC works as a weighted summation across views and channels

reshape

from CNN

features


Late Triplet Sampling (Fast-CDTNN) – Speeding Up

Cross-Domain Learning

▪ Naive cross-domain triplet neural networks (CDTNN) has three streams

▪ Fast-CDTNN has two streams. It forward sampled image/3D shape, and

enumerates the triplets (combinations) at the integrated triplet loss layer

▪ In our experiments, Fast-CDTNN is ~4x - 5x faster.

23


Comparisons and Datasets

▪ New image/3D shape: 12,311 3D shapes, 10,000 images,

across 40 categories

24

Methods mAP

AlexNet pool5 [1] 7.16%

MVCNN [2] 7.92%

Joint Embedding [3] 3.44%

CDTNN + view pooling [2] 40.85%

CDTNN + Adaptation Layer 47.84%

CDTNN + Adaptation Layer + CVC 52.67%

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in

neural information processing systems. 2012.

[2] Su, Hang, et al. "Multi-view convolutional neural networks for 3d shape recognition." Proceedings of the IEEE International Conference

on Computer Vision. 2015.

[3] Li, Yangyan, et al. "Joint embeddings of shapes and images via cnn image purification." ACM Trans. Graph 5 (2015).


Sample Results and Demo

25

Bathtub

Person

Stool

Bed

Car

Bookshelf

Keyboard

Guitar

(a)

(b)

demo

Speech Summarization5

Summarization

Audio File

to be summarized

This is the summary.

➢ Select the most informative segments to form a compact version

Extractive Summaries

…… deep learning is powerful …… …… ……

[Lee & Lee, Interspeech 12][Lee & Lee, ICASSP 13][Shiang & Lee, Interspeech 13]

➢ Machine does not write summaries in its own words

Abstractive Summarization

• Now machine can do abstractive summary (write summaries in its own words)

• Title generation: abstractive summary with one sentence

Title 1

Title 2

Title 3

Training Data

title generated by machine

without hand-crafted rules

(in its own words)

Sequence-to-sequence

• Input: transcriptions of audio, output: title

ℎ1 ℎ2 ℎ3 ℎ4

RNN Encoder: read through the input

w1 w4w2 w3

transcriptions of audio from automatic speech recognition (ASR)

𝑧1 𝑧2 ……

……wA wB

RNN generator

Sequence-to-sequence

• Training data: 2M story-headline pairsROUGE-1 ROUGE-L

Manual Transcription as input 26.8 23.9

ASR Transcription as input 21.3 20.0

ASR Transcription as input+ Special Structure

22.9 20.9

Learn to ignore the words which are misrecognized when generating the title

[Yu, Lee, Lee, SLT 2016]

Demo

• http://140.112.30.37:2401/

• https://www.youtube.com/watch?v=X3BapMl7Wv8

• https://www.youtube.com/watch?v=hFVKpVMB-Rc

• https://www.youtube.com/watch?v=hYf3fARyNvg

• https://www.youtube.com/watch?v=usi8EUabU7Y

• https://www.youtube.com/watch?v=FUmd6EnVeWw

• From SONG TUYEN NEWS: https://www.youtube.com/channel/UC-P4mEcWZVrFfdZIuiODiTg

http://140.112.30.37:2401/

Speech Question Answering

6

Speech Question Answering

What is a possible origin of Venus’ clouds?

Speech Question Answering: Machine answers questions based on the information in spoken content

Gases released as a result of volcanic activity

New task for Machine Comprehension of Spoken Content

• TOEFL Listening Comprehension Test by Machine

Question: “ What is a possible origin of Venus’ clouds? ”

Audio Story:

Choices:

(A) gases released as a result of volcanic activity

(B) chemical reactions caused by high surface temperatures

(C) bursts of radio energy from the plane's surface

(D) strong winds that blow dust into the atmosphere

(The original story is 5 min long.)

New task for Machine Comprehension of Spoken Content

• TOEFL Listening Comprehension Test by Machine

“what is a possible origin of Venus‘ clouds?"

Question:

Audio Story:Neural

Network

4 Choices

e.g. (A)

answer

Using previous exams to train the network

ASR transcriptions

Model Architecture

“what is a possible origin of Venus‘ clouds?"

Question:

Question Semantics

…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……

Audio Story:

Speech Recognition

Semantic Analysis

Semantic Analysis

Attention

Answer

Select the choice most similar to the answer

Attention

The whole model learned end-to-end.

More Details

(A)

(A) (A) (A) (A)

(B) (B) (B)

Experimental ResultsA

ccu

racy

(%

)

random

Naïve approaches

Example Naïve approach:1. Find the paragraph containing most key terms in

the question.2. Select the choice containing most key terms in

that paragraph.

not easy to get high score without understanding

Experimental ResultsA

ccu

racy

(%

)

random

42.2% [Tseng, Shen, Lee, Lee, Interspeech’16]

Naïve approaches

48.8% [Fan, Hsu, Lee, Lee, SLT’16]

Analysis

• There are three types of questions

Type 3: Connecting Information➢Understanding Organization➢Connecting Content➢Making Inferences

Analysis

• There are three types of questions

Type 3: Pragmatic Understanding➢Understanding the Function of What Is Said➢Understanding the Speaker’s Attitude

Corpus & Codehttps://github.com/sunprinceS/Hierarchical-Attention-Model

Audio Word to Vector7

Framework of Spoken Language Understanding Tasks

Spoken Content

Text

Spoken Content Retrieval

Dialogue

Speech Summarization

SpeechQuestion Answering

SpeechRecognition

Can we bypass speech recognition? …

Why? Need the manual transcriptions of lots of audio to learn.

Most languages have little transcribed data.

New Research Direction: Audio Word to Vector

Typical Word to Vector

• Machine represents each word by a vector representing its meaning

• Learning from lots of text without supervision

dog

cat

rabbit

jumprun

flower

tree

Audio Word to Vector

• Machine represents each audio segment also by a vector

audio segment

vector

(word-level)

Learn from lots of audio without supervision

[Chung, Wu, Lee, Lee, Interspeech 16)

Sequence-to-sequence Auto-encoder

audio segment

acoustic featuresx1 x2 x3 x4

RNN Encoder

vector

The vector we want

We use sequence-to-sequence auto-encoder here

The training is unsupervised.

Sequence-to-sequence Auto-encoder

RNN Generatorx1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3x4

RNN Encoder

audio segment

acoustic features

The RNN encoder and generator are jointly trained.

Input acoustic features

What does machine learn?

• Typical word to vector:

• Audio word to vector (phonetic information)

𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 + 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛

𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 + 𝑉 𝑎𝑢𝑛𝑡 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒

V( ) - V( ) + V( ) = V( )

GIRL PEARL PEARLS

V( ) - V( ) + V( ) = V( )

GIRLS

Next Step ……

• Audio word to vector with semantics

flower tree

dog

cat

cats

walk

walked

run

One day we can build all spoken language understanding applications directly from audio word to vector.


Take-Home Messages

▪ Multimodal deep learning is ”must” for practical

applications

– Perception/understanding is multimodal

– Sensors are complementary and low-cost

▪ Dealing with issues including

– proper networks with domain knowledge, imbalanced training data,

cross-modality/cross-domain learning, training strategies, etc.

▪ Promising in many tasks

– Segmentation, Ranking, Summarization, Question Answering,

Unsupervised Representation

53


Thanks and Comments!

54

Hung-Yi LEE (李宏毅)

National Taiwan UniversityWinston H. HSU (徐宏民)

National Taiwan University

& IBM TJ Watson Ctr., New York

Learning Large-Scale Multimodal Data Streams€¦ · Why learning with multimodal deep neural...

Documents

Transcript of Learning Large-Scale Multimodal Data Streams€¦ · Why learning with multimodal deep neural...