Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow...

Lecture 9: RNNs (part 2) + Applications

Topics in AI (CPSC 532S): Multimodal Learning with Vision, Language and Sound

BackProp Through Time

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Truncated BackProp Through Time

Run backwards and forwards through (fixed length) chunks of the sequence, instead of the whole sequence

Truncated BackProp Through TimeRun backwards and forwards through (fixed length) chunks of the sequence, instead of the whole sequence

Loss Carry hidden states forward, but only BackProp through some smaller number of steps

Truncated BackProp Through TimeRun backwards and forwards through (fixed length) chunks of the sequence, instead of the whole sequence

Implementation: Relatively Easy

… you will have a chance to experience this in the Assignment 3

Learning to Write Like Shakespeare

train more

at first:

Learning to Write Like Shakespeare … after training a bit

Learning to Write Like Shakespeare … after training

Learning CodeTrained on entire source code of Linux kernel

DopeLearning: Computational Approach to Rap Lyrics

[ Malmi et al., KDD 2016 ]

Sunspring: First movie generated by AI

Multilayer RNNs

Vanilla RNN Gradient Flow

[ Bengio et al., 1994 ]

[ Pascanu et al., ICML 2013 ]

Vanilla RNN Gradient Flow

Backpropagation from ht to ht-1 multiplies by W (actually WhhT)

Vanilla RNN Gradient Flow [ Bengio et al., 1994 ]

Wstack

Computing gradient of h0 involves many factors of W (and repeated tanh)

Wstack

Largest singular value > 1: Exploding gradients

Largest singular value < 1: Vanishing gradients

Wstack

Gradient clipping: Scale gradient if its norm is too big

Wstack

Computing gradient of h0 involves many factors of W (and repeated tanh) Change RNN architecture

Long-Short Term Memory (LSTM)

Vanilla RNN LSTM

[ Hochreiter and Schmidhuber, NC 1977 ]

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Cell state / memory

LSTM Intuition: Forget Gate

Should we continue to remember this “bit” of information or not?

Intuition: memory and forget gate output multiply, output of forget gate can be though of as binary (0 or 1)

anything x 0 = 0 (forget)anything x 1 = anything (remember)

LSTM Intuition: Input Gate

Should we update this “bit” of information or not? If yes, then what should we remember?

LSTM Intuition: Memory Update

Forget what needs to be forgotten + memorize what needs to be remembered

LSTM Intuition: Output Gate

Should we output this bit of information (e.g., to “deeper” LSTM layers)?

LSTM Intuition: Additive UpdatesBackpropagation from ct to ct-1 only elementwise multiplication by

f, no matrix multiply by W

Uninterrupted gradient flow!

LSTM Intuition: Additive Updates

FC 1000

Similar to ResNet

LSTM Variants: with Peephole Connections

Lets gates see the cell state / memory

LSTM Variants: with Coupled Gates

Only memorize new information when you’re forgetting old

Gated Recurrent Unit (GRU)

No explicit memory; memory = hidden output

z = memorize new and forget old

RNNs: ReviewKey Enablers: — Parameter sharing in computational graphs

— “Unrolling” in computational graphs

— Allows modeling arbitrary length sequences!

RNNs: Review

Vanilla RNN

Key Enablers: — Parameter sharing in computational graphs

RNNs: Review

yVanishing

or Exploding Gradients

Vanilla RNN

RNNs: Review

yVanishing

Vanilla RNN Long-Short Term Memory (LSTM)

RNNs: ReviewKey Enablers: — Parameter sharing in computational graphs

yVanishing

Vanilla RNN Long-Short Term Memory (LSTM)

Loss functions: often cross-entropy (for classification); could be max-margin (like in SVM) or Squared Loss (regression)

LSTM/RNN Challenges

— LSTM can remember some history, but not too long — LSTM assumes data is regularly sampled

Phased LSTM

Gates are controlled by phased (periodic) oscillations

[ Neil et al., 2016 ]

Bi-directional RNNs/LSTMs

(ht�1, xt

= tanh(Whh

t�1 +W

(ht�1, xt

= tanh(Whh

t�1 +W

(ht�1, xt

= tanh(Whh

t�1 +W

(�!h

t�1, xt

( �h

t+1, xt

[�!h

]T + b

= tanh(�!W

t�1 +�!W

+�!b

= tanh( �W

t+1 + �W

+ �b

(�!h

t�1, xt

( �h

t+1, xt

[�!h

]T + b

= tanh(�!W

t�1 +�!W

+�!b

= tanh( �W

t+1 + �W

+ �b

(�!h

t�1, xt

( �h

t+1, xt

[�!h

]T + b

= tanh(�!W

t�1 +�!W

+�!b

= tanh( �W

t+1 + �W

+ �b

Bi-directional RNNs/LSTMs

(�!h

t�1, xt

( �h

t+1, xt

[�!h

]T + b

= tanh(�!W

t�1 +�!W

+�!b

= tanh( �W

t+1 + �W

+ �b

(�!h

t�1, xt

( �h

t+1, xt

[�!h

]T + b

= tanh(�!W

t�1 +�!W

+�!b

= tanh( �W

t+1 + �W

+ �b

Attention Mechanisms and RNNs Consider a translation task: This is one of the first neural translation models

English

Summary Vector

https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/

Attention Mechanisms and RNNs Consider a translation task with a bi-directional encoder of the source language

English

Build a small neural network (one layer) with softmax output that takes (1) everything decoded so far and (encoded by previous decoder state Zi) (2) encoding of the current word (encoded by the hidden state of encoder hj)

and predicts relevance of every source word towards next translation

[ Cho et al., 2015 ]

Build a small neural network (one layer) with softmax output that takes (1) everything decoded so far and (encoded by previous decoder state Zi) (2) encoding of the current word (encoded by the hidden state of encoder hj)

and predicts relevance of every source word towards next translation

ci =TX

↵jhj

[ Cho et al., 2015 ]

Attention Mechanisms and RNNs [ Cho et al., 2015 ]

Regularization in RNNs

Standard dropout in recurrent layers does not work because it causes loss of long term memory!

* slide from Marco Pedersoli and Thomas Lucas

Regularization in RNNs

Standard dropout in recurrent layers does not work because it causes loss of long term memory!

—Dropout in input-to-hidden or hidden-to-output layers

— Apply dropout at sequence level (same zeroed units for the entire sequence)

— Dropout only at the cell update (for LSTM and GRU units)

— Enforcing norm of the hidden state to be similar along time

— Zoneout some hidden units (copy their state to the next tilmestep)

[ Zaremba et al., 2014 ]

[ Gal, 2016 ]

[ Semeniuta et al., 2016 ]

[ Krueger & Memisevic, 2016 ]

[ Krueger et al., 2016 ]

Teacher Forcing

.02.08

.79Softmax

“e” “l” “l” “o”SampleTraining Objective: Predict the next word (cross entropy loss)

Testing: Sample the full sequence

Teacher Forcing

.02.08

.79Softmax

“e” “l” “l” “o”Sample

Testing: Sample the full sequence

Training and testing objectives are not consistent!

Training Objective: Predict the next word (cross entropy loss)

Teacher Forcing

Slowly move from Teacher Forcing to Sampling

Probability of sampling from the ground truth

Baseline: Google NIC captioning model

Baseline with Dropout: Regularized RNN version

Always sampling: Use sampling from the beginning of training

Scheduled sampling: Sampling with inverse Sigmoid decay

Uniformed scheduled sampling: Scheduled sampling but uniformly

Teacher Forcing

Sequence Level Training

During training objective is different than at test time — Training: generate next word given the previous

— Test: generate the entire sequence given an initial state

Optimize directly evaluation metric (e.g. BLUE score for sentence generation)

Set the problem as a Reinforcement Learning: — RNN is an Agent

— Policy defined by the learned parameters

— Action is the selection of the next word based on the policy - Reward is the evaluation metric

[ Ranzato et al., 2016 ]

Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow...

Documents

Transcript of Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow...

1 Heterogeneous Knowledge Transfer in Video …lsigal/Publications/tac2016xu.pdf1 Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization Baohan

Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture10b.pdf*slide from Louis-Philippe Morency. Unsupervised Learning We have access to but not Why would we want to tackle such a task:

E81 CSE 532S: Advanced Multi-Paradigm Software Development Venkita Subramonian, Christopher Gill, Guandong Wang, Zhenning Hu, Zhenghui Xie Department of.

Siddhesh Khandelwal Leonid Sigal - arXivSiddhesh Khandelwal University of British Columbia skhandel@cs.ubc.ca Leonid Sigal University of British Columbia lsigal@cs.ubc.ca Abstract

Semi-supervised Vocabulary-informed Learning...Semi-supervised Vocabulary-informed Learning Yanwei Fu and Leonid Sigal Disney Research y.fu@qmul.ac.uk, lsigal@disneyresearch.com Abstract

Learning Cooperative Visual Dialog Agents with Deep ...lsigal/532S_2018W2/2b.pdf · - GuessWhat?! Visual object discovery through multi-modal dialogue [de Vries et al., 2017] 20 Questions

Multi-level Semantic Feature Augmentation for One-shot ...lsigal/Publications/tip2019chen.pdf · One-shot Learning Zitian Chen y, Yanwei Fu , Yinda Zhang, Yu-Gang Jiang?, Xiangyang

Weakly-supervised Visual Grounding of Phrases with ......Disney Research lsigal@disneyresearch.com Yong Jae Lee University of California, Davis yongjaelee@ucdavis.edu Abstract We propose

VISULAS 532s - BSCRS-SOBEVECO...VISULAS 532s A credible team player in operative use. Adaptable. Portable. Versatile. The VISULAS® 532s is designed for universal use. Due to its adaptable

Walking on Thin Air: Environment-Free Physics-based ...lsigal/Publications/crv2018livne.pdf · Walking on Thin Air: Environment-Free Physics-based Markerless Motion Capture Micha

Ranking and Retrieval of Image Sequences from Multiple ...lsigal/Publications/cvpr2015_text2pic_abstract.pdf · Ranking and Retrieval of Image Sequences from Multiple Paragraph Queries

International Conference on Computer Vision (ICCV 2017) Li ...lsigal/532L/PP_ProgramsForVisualReasoning.pdf · Compositional visual reasoning Q: How many spheres are the left of the

Estación de láser compacta para la moderna terapia retiniana · panel de control del VISULAS 532s, pudiéndose modificar incluso durante el curso de un tratamiento. Gracias al indicador

Mixture-Kernel Graph Attention Network for …lsigal/Publications/iccv2019...Mixture-Kernel Graph Attention Network for Situation Recognition Mohammed Suhail 1 ;2Leonid Sigal 3 1University

E81 CSE 532S: Advanced Multi-Paradigm Software Development

Harnessing Object and Scene Semantics for Large …lsigal/Publications/cvpr2016wu.pdfHarnessing Object and Scene Semantics for Large-Scale Video Understanding Zuxuan Wu y, Yanwei Fux,

Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary

Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture18b.pdfValue Function and Q-value Function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1,

Front2Back: Single View 3D Shape Reconstruction via Front to Back …lsigal/Publications/cvpr2020yao.pdf · 2020-06-16 · the back facing depth and normal maps using as input the

Ziad Al-Halah1 2 University of British Columbia, V6T1Z4 ... · Ziad Al-Halah1 Andreas M. Lehrmann Leonid Sigal2 ziad.al-halah@kit.edu andreas.lehrmann@gmail.com lsigal@cs.ubc.ca 1Karlsruhe