Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor...

26
Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu Li, Bhairav Mehta and Koustuv Sinha IFT 6760A March 21, 2019 1 / 26

Transcript of Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor...

Page 1: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Generalized Tensor Models for RNNNsValentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets

(2019)

Tianyu Li, Bhairav Mehta and Koustuv Sinha

IFT 6760A

March 21, 2019

1 / 26

Page 2: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Overview

1 Motivation

2 Tensor Decomposition and Neural Networks

3 Nonlinear Generalization

4 Main Results

5 Experiments

6 Conclusion

2 / 26

Page 3: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Motivation

RNNs have been widely applied in many fields

Theoretical side of RNNs is lacking

Natural relationship between tensor decomposition and linear neuralnetworks

Work with tensor instead for analysis

3 / 26

Page 4: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Why Depth?

Shown recently that depth allows neural networks to express richfunctions with relatively few parameters.

Theory not well understood, due to difficulty of incorporatingnonlinearities during analysis.

4 / 26

Page 5: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Basics - Data representation

Suppose we are given a dataset of sequential structure:

X = (x(1), x(2), · · · , x(T )), xt ∈ RN

Transform the dataset in a feature tensor Φ(X) which is an outerproduct of the feature vectors.

fθ(x) = σ(Ax + b)

Φ(X) = fθ(x(1))⊗ fθ(x(2)) · · · ⊗ fθ(x(T))

5 / 26

Page 6: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Basics - Generalized Score function

To get an estimate (such as MLE), we can use a tensor W of thesame order as our feature tensor Φ(X)

The estimate or score function can be expressed as:

L(X ) = 〈W ,Φ(X)〉 = (vec(W))>vec(Φ(X))

6 / 26

Page 7: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Representing the core tensor

W ∈ Rm×m×...m is a trainable weight tensor.

The inner product shown in last slide is just the total sum of theentry-wise product of Φ(X) and WStoring the full tensor W requires exponential amount of memory.

We therefore use tensor decompositions to efficiently represent thisweight tensor.

Rank of the decomposition determine the complexity of thearchitecture.

7 / 26

Page 8: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Tensor Decomposition

CP Decomposition:

W =R∑

r=1

λrv(1)r ⊗ v

(2)r · · · ⊗ vTr

L(X ) =R∑

r=1

λr

T∏t=1

〈fθ(x(t)), v(t)r 〉

Tensor Train Decomposition:

W =

R1∑r1=1

· · ·RT−1∑

rT−1=1

g(1)r0r1 ⊗ g

(2)r1r2 ⊗ · · · ⊗ g

(T )rT−1rT

L(X ) =

R1∑r1=1

· · ·RT−1∑

rT−1=1

T∏t=1

〈fθ(x(t)), g(t)rT−1rT 〉

8 / 26

Page 9: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

CP Decomposition and Shallow Networks

L(X ) =R∑

r=1

λr

T∏t=1

〈fθ(x(t), v(t)r )〉

9 / 26

Page 10: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Tensor Trains and RNNs

Idea: Show that TT exhibits particular recurrent structure as RNN.

h(t)k =

∑i ,j

G(t)ijk fθ(x(t))ih(t−1)j =

∑i ,j

G(t)ijk [fθ(x(t))⊗ h(t−1)]i ,j

Combining the core tensors and weights to a single variable, we canrewrite the above equation in a general RNN formulation:

h(t) = g(h(t−1), x(t); Θ(t)G ), h(t) ∈ RRt

10 / 26

Page 11: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Generalized Outer Product

TTs → NNs of specific structure, simpler than the ones used inpractice:

Only multiplicative nonlinearities allowed

Idea: Change the nonlinearity

⊗ → ⊗ξ

Generalized outer product, define ξ as an associative andcommutative operator:

C = A⊗ξ B

Ci1···iN j1···jM = ξ(Ai1···iN ,Bj1···jM )

11 / 26

Page 12: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Generalized Outer Product

Replace previous RNNs’ outer product with new operator to get:

ξ(x , y) =

max(x , y , 0) ReLU

ln(ex + ey ) SoftPlus

xy Multiplicative

12 / 26

Page 13: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Generalized Shallow Network with ξ-nonlinearity

Score function:

L(X) =R∑

r=1

λr [〈fθ(x(1)), v(1)r 〉 ⊗ξ · · · ⊗ξ 〈fθ(x(T )), v

(T )r 〉]

=R∑

r=1

λrξ(〈fθ(x(1)), v(1)r 〉, · · · , 〈fθ(x(T )), v

(T )r 〉)

Parameters of the network:

Θ = ({λr}Rr=1 ∈ R, {v(t)r }R,Tr=1,t=1 ∈ RM)

Can do same with RNNs to get a Generalized RNN

13 / 26

Page 14: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Great, and we are done?

Switching ⊗ → ⊗ξ allows us to analyze more complex RNNs

But, makes connection between RNNs and their TTs difficult tounderstand

Weight tensor no longer exists for each and every generalized tensornetwork:

L(X) = 〈W,Φ(X)〉

14 / 26

Page 15: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Grid Tensors

Cohen and Shashua (2016) introduced grid tensors:M fixed vectors X (templates) → GT of order T and dimension M ineach mode:

ΓL(X)i1,i2,··· ,iT = L(X), X = (x(i1), x(i2), · · · , x(iT ))

Evaluate score function on every possible input combination of thetemplate vectors, instead of all possible input sequences.

15 / 26

Page 16: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Grid Tensors

Define a feature matrix F ∈ RM×M

Run representation function fθ : RN → RM on each x(t) ∈ X:

F = [fθ(x(1)), fθ(x(2)), · · · , fθ(x(M))]

Each generalized tensor network has a corresponding grid tensor(shown: generalized shallow network)

ΓL(X) =R∑

r=1

λr (Fv(1)r )⊗ξ (Fv

(2)r )⊗ξ · · · ⊗ξ (Fv

(T )r )

16 / 26

Page 17: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Overview of the main results

Two problems need to be considered:

UniversalityCan every tensor realizes a (generalized) shallow network/RNN ?

ExpressivityTo represent the same function, which model uses less parameters?

17 / 26

Page 18: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Universality

Regular case (linear outer product): Holds automatically

L(X) = 〈W,Φ(X)〉

Generalized case (Non-linear outer product): Can no longer work withW. Instead, work with the grid tensor:

ΓL(X)i1,i2,··· ,iT = L(X), X = (X (i1),X (i2), · · · ,X (iT ))

18 / 26

Page 19: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Universality

Theorem 1

Given an arbitrary tensor H ∈ RM×M×···×M and a template X, let the gridtensors for a:

Generalizeda shallow network S̃ be: ΓS(X)

Generalizeda RNN G̃ be: ΓG(X)

Then we can find S̃ and G̃ such that:

H = ΓS(X) = ΓG(X)

aAll the results are based on rectifier nonlinearity

19 / 26

Page 20: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Expressivity

Goal: compare models’ representation ability in terms of theirparameters

Linear case: simply compare the rank of the tensor WGeneralized case: compare in terms of the grid tensor ΓL(X)

20 / 26

Page 21: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Expressivity

Theorem 2

Given a generalized RNN of rank at most R and its grid tensor ΓG(X), itsrealization of generalized shallow network can be written as:

ΓG(X) = ΓS(X) =R̂∑

r=1

λr (Fv(1)r )⊗ξ (Fv

(2)r )⊗ξ · · · ⊗ξ (Fv

(T )r )

There exists G̃1, such that R̂ ≥ 2MTmin(M,R)T/2;

21 / 26

Page 22: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Expressivity

Theorem 3

Given a generalized RNN of rank R and its grid tensor ΓG(X), itsrealization of generalized shallow network can be written as:

ΓG(X) = ΓS(X) =R̂∑

r=1

λr (Fv(1)r )⊗ξ (Fv

(2)r )⊗ξ · · · ⊗ξ (Fv

(T )r )

There exists G̃2, such that R̂ = 1

22 / 26

Page 23: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Experiment on IMDB sentiment analysis

23 / 26

Page 24: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Experiment on Synthetic Data

24 / 26

Page 25: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Conclusion

Draw links between RNNs and TT decomposition

Introduce nontrivial nonlinearity into tensor framework

Provide theoretical analysis on universality and expressivity underrectifier nonlinearity

Extend this to LSTM and attention? Other nonlinearities?

25 / 26

Page 26: Generalized Tensor Models for RNNNs Valentin Khrulkov ...grabus/courses/... · Generalized Tensor Models for RNNNs Valentin Khrulkov, Oleksii Hrinchuk, Ivan Oseledets (2019) Tianyu

Thank You

26 / 26