VISUALIZING AND UNDERSTANDING RECURRENT NEURAL NETWORKS Presented By: Collin Watts Wrritten By:...

VISUALIZING AND UNDERSTANDING RECURRENT NEURAL NETWORKSPresented By: Collin Watts

Wrritten By: Andrej Karpathy, Justin Johnson, Li Fei-fei

PLAN OF ATTACK

What we’re going to cover:• Overview

• Some Definitions

• Expiremental Analysis

• Lots of Results

• The Implications of the Results

• Case Studies

• Meta-Analysis

SO, WHAT WOULD YOU SAY YOU DO HERE...

• This paper set out to analyze both the most efficient implementation of an RANN (we’ll get there) as well as identify what mechanisms are used internally that achieve their results.• Chose 3 different variants of RANNs:• Basic RANNs• LSTM RANNs• GRU RANNs

• Did character level language analysis as their test problem, as it is apparently strongly representative of other analysies.

DEFINITIONS

• RECURRENT NEURAL NETWORK• Subset of Artificial Neural Networks• Still use feedforward and backpropogation• Allows nodes to form cycles, creating the

potentiality for storage of information within the network• Used in applications such as handwriting

analysis, video analysis, translation, and other interpretation of various human tasks• Difficult to train

DEFINITIONS

• RECURRENT NEURAL NETWORK (Cont.)• Uses a 2 dimensional node setup, with time

as one axis and depth of the nodes as another• Nodes are referrd to as hL

t, with l = 0 being the input nodes, and l = L being the output nodes.• Intermediate vectors are calculated as a

function of both the previous time step and the previous layer. This results in the following recurrence:

MORE DEFINITIONS!• LONG SHORT-TERM MEMORY VARIANT

• Variant of the RANN designed to mitigate problems with backpropogation within a RANN.

• Adds a memory vector to each node.• Every time step, an LSTM can choose to read, write

to, or reset the memory vector, following a series of gating mechanisms.

• Has the effect of preserving gradients across memory cells for long periods.

• i, f, o, and g are the gates for whether the memory cell is updated, reset, or read, respectively, while g allows for additive additions to the memory cell.

HALF A DEFINITION...

• GATED RECURRENT UNIT• Not well elaborated on in the paper...• Given explanation is that “The GRU has

the interpretation of computing a candidate hidden vector and then smoothly interpolating towards it, as gated by z.”• My interpretation: rather than having

explicit access & control gates, this follows a more analog approach.

EXPIREMENTAL ANALYSIS (SCIENCE!)

• As previously stated, the researchers used character-level language modelling as a basis of comparison.• Trained each network to predict the following

character in a sequence.• Used Softmax classifier at each time step. • Generated a vector of all possible next characters and

fed those to the current network to get that many hidde vectors in the last layer of the network.• These outputs represented log probabilities of each

character being the next character in the sequence.


• Rejected the use of two other datasets (Penn treeback dataset and Hutter Prize 100MB of Wikipedia dataset) on the basis of them containing both standard English language and markup.• Stated intention for rejecting was to use a controlled

setting for all types of neural networks, rather than compete for best results on these data sets. • Decided on Leo Tolstoy’s War and Peace, consisting

of 3,258,246 characters and the source code of the Linux Kernel (randomized across files and then concatenated into a single 6,206,996 character file).


• War and Peace, was split into 80/10/10 for training/validation/testing.

• Linux Kernel, was split into 90/5/5 for training/validation/testing.

• Tested the following properties for each of the 3 RANNS:• Number of Layers (1, 2 , or 3)• Number of Parameters (64, 128, 256, 512 cell

counts)

RESULTS (AND THE WINNER IS...)

• Test set cross entropy loss:

RESULTS (AND THE WINNER IS...)

IMPLICATIONS OF RESULTS (BUT WHY...)

• The researchers paid attention to several characteristics beyond just the results of their findings. One of their stated goals was to arrive at why these emergent properties exist.

• Interpretable, long-range LSTM cells• Have been theorized to exist, but never proven.• They proved them.• Truncated back-propagation (used for performance

gains as well as combatting overfitting) limits understanding dependencies more than X characters away, where X is the depth of the backpropogation.• These LSTM cells have been able to overcome that

challenge while retaining performance and fitting characteristics.

VISUALIZATIONS OF RESULTS (BUT WHY...)

• Text color is a visualization of tanh(c) where -1 is red and +1 is blue.

VISUALIZATIONS OF RESULTS (BUT WHY...)

IMPLICATIONS OF RESULTS (BUT WHY...)

• Also paid attention to gate activations (remember the gates are what cause interactions with the memory node) in LSTMs.• Defined the ideas of “left saturated” and “right saturated”

• Left saturated: If the gates activate less than 0.1 (10% of the time).

• Right saturated: If the gates activate more than 0.9 (90% of the time)

• Of particular note:• Right saturated forget gate cells (cells remembering values)• No left saturated forget gate cells (no cells being purely feed

forward)• Found that activations in the first layer are diffuse (this is

unexplainable by the researchers, but found to be very strange)

VISUALIZATIONS OF RESULTS (BUT WHY... LSTMS)

VISUALIZATIONS OF RESULTS (BUT WHY...GRUS)

ERROR ANALYSIS OF RESULTS

• Compared against two standard n-gram models for analysis of LSTMs effectiveness.• An error was defined to be if the probability

of the next character being the character that was actually there was less than 0.5.• Found that while the models shared many of

the same errors, there were distinct segments that each one failed differently on.


Linux Kernel

War and peace


• Found that LSTM has significant advantages over standard n-gram models when computing the probability of special characters. In the Linux Kernel model, brackets and whitespce are predicted significantly better than in the n-gram model, because of it’s ability to keep track of relationships between open and closing brackets.• Similarly, in War and Peace, LSTM was able to

more correctly predict carriage returns, due to the relationship being outside of the n-gram models’ effective range of relationship prediction.

CASE STUDY { LOOK, BRACES! }

• When it specifically compes to closing brackets (“}”) in the Linux kernel, the researchers were able to analyze the performance of the LSTM versus the n-gram models.

• Found that LSTM did better than n-gram for distances of up to 60 characters. After that, the performance gains levelled off.

META-ANALYSIS (THE GOOD)

• The researchers were able to very effectively capture and elucidate their point via their visualizations and implications.• They seem to have proven several until

now only theorized ideas on how RANNs work in data analysis.

META-ANALYSIS (THE BAD)

• I would have appreciated a more in depth explanation of why they rejected the standard ANN competitive datasets. It would seem to follow that those would be a more true measure of the capabilities, which is why they are chosen in the first place.• There wasn’t a lot of explanation as to why their

parameters were chosen for each RANN, or what their parameters for evaluation were. (What is test set cross-entropy loss?)• Data was split differently across each of the texts, so

that the total count for validation and tests was the same. I don’t see what this offers. If anything, you would want the count of training to be the same.

META-ANALYSIS (THE UGLY)

• This paper does not ease the reader into understanding the ideas involved. Required reading several additional papers to get the implications of things they assumed the reader knew.• Some ideas were not clearly explained

even after researching the related works.

FINAL SLIDE

•Questions?•Comments?•Concerns?•Corrections?

VISUALIZING AND UNDERSTANDING RECURRENT NEURAL NETWORKS Presented By: Collin Watts Wrritten By:...

Documents

Transcript of VISUALIZING AND UNDERSTANDING RECURRENT NEURAL NETWORKS Presented By: Collin Watts Wrritten By:...