Deep learning - Chatbot

18
Chatbot Sequence to Sequence Learning 29 Mar 2017 Presented By: Jin Zhang Yang Zhou Fred Qin Liam Bui Overview Network Architect ure Loss Function Improveme nt Technique s

Transcript of Deep learning - Chatbot

Topic Modelling Machine Learning research publications

ChatbotSequence to Sequence Learning29 Mar 2017Presented By: Jin Zhang Yang Zhou Fred Qin Liam Bui

1

Overview

Network Architecture

Loss Function

Improvement TechniquesChatbot Concept

Deep Learning for Chatbot: http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/

First of all, lets see a demo. This is a customer service chatbot demo. We can see that you can let it find an order as easy as chatting with a person. Thats why chatbot is a very hot topic. Many companies are working on various kinds of chatbots, including travel search engine, personal health companion and so on.There are three ways to compare chatbots.Retrieval-based models dont generate any new text, they just pick a response from a predefined responses repository based on the input and context. In such case, retrieval-based methods dont make grammatical mistakes. However, they may be unable to handle unseen cases for which no appropriate predefined response exists. For the same reasons, these models cant refer back to contextual entity information like names mentioned earlier in the conversation.However, Generative models dont rely on pre-defined response. They generate new responses from scratch. Generative models are smarter. They can refer back to entities in the input and give the impression that youre talking to a human. However, these models are hard to train, are quite likely to make grammatical mistakes (especially on longer sentences), and typically require huge amounts of training data. Generative models are smarter. They can refer back to entities in the input and give the impression that youre talking to a human.Chatbots can be built to support short-text conversations, such as FAQ chatbot, or long conversations, such as customer support chatbot.Chatbots can be set to closed domain or open domain. The demo of this customer service chatbot is an example of the closed domain, in which the questions and answers are limited to specific area. In an open domain (harder) setting the user can take the conversation anywhere, such as siri.Retrieval-based models (easier) use a repository of predefined responses and some kind of heuristic to pick an appropriate response based on the input and context. The heuristic could be as simple as a rule-based expression match, or as complex as an ensemble of Machine Learning classifiers. These systems dont generate any new text, they just pick a response from a fixed set.Due to the repository of handcrafted responses, retrieval-based methods dont make grammatical mistakes. However, they may be unable to handle unseen cases for which no appropriate predefined response exists. For the same reasons, these models cant refer back to contextual entity information like names mentioned earlier in the conversation. Generative models are smarter. They can refer back to entities in the input and give the impression that youre talking to a human. However, these models are hard to train, are quite likely to make grammatical mistakes (especially on longer sentences), and typically require huge amounts of training data.Short-Text Conversations (easier) where the goal is to create a single response to a single input. For example, you may receive a specific question from a user and reply with an appropriate answer. Then there are long conversations (harder) where you go through multiple turns and need to keep track of what has been said. Customer support conversations are typically long conversational threads with multiple questions.In a closed domain (easier) setting the space of possible inputs and outputs is somewhat limited because the system is trying to achieve a very specific goal. Technical Customer Support or Shopping Assistants are examples of closed domain problems.In an open domain (harder) setting the user can take the conversation anywhere. There isnt necessarily have a well-defined goal or intention. The infinite number of topics and the fact that a certain amount of world knowledge is required to create reasonable responses makes this a hard problem.

2

Overview

Network Architecture

Loss Function

Improvement TechniquesLSTM for Language ModelLanguage ModelPredict next word given the previous wordsRNN Unable to learn long term dependency, not suitable for language modelLSTM3 sigmoid gates to control info flow

Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

The foundation of building a chatbot is language modelling. Generally speaking, a language model takes in a sequence of inputs, looks at each element of the sequence and tries to predict the next element of the sequence.

In theory, RNNs are absolutely capable of handling such long-term dependencies. A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs dont seem to be able to learn them.

LSTMs are explicitly designed to avoid the long-term dependency problem. The key to the LSTM is the cell state, easy for information to just flow along it unchanged.The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means let nothing through, while a value of one means let everything through!An LSTM has three of these gates, to protect and control the cell state.

3

Overview

Network Architecture

Loss Function

Improvement TechniquesFirst step: which previous information to throw away from the cell state

LSTM for Language ModelSecond step: what new information to be stored in the cell state- A sigmoid layer decides which values to update- A tanh layer creates new candidate values C~t that could be added to the state- Combine these two to create an update to the state

Third step: filter Ct and output only what we want to output

Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

The first step in our LSTM is to decide what information were going to throw away from the cell state. This decision is made by a sigmoid layer called the forget gate layer. It looks at ht-1 and xt and outputs a number between 0 and 1 for each number in the cell state Ct-1. A1 represents completely keep this while a 0 represents completely get rid of this.Lets go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

The next step is to decide what new information were going to store in the cell state. This has two parts. First, a sigmoid layer called the input gate layer decides which values well update. Next, a tanh layer creates a vector of new candidate values, C~t, that could be added to the state. In the next step, well combine these two to create an update to the state.In the example of our language model, wed want to add the gender of the new subject to the cell state, to replace the old one were forgetting.Its now time to update the old cell state, C~t-1into the new cell state C~t. The previous steps already decided what to do, we just need to actually do it.We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add C~t*it .This is the new candidate values, scaled by how much we decided to update each state value.In the case of the language model, this is where wed actually drop the information about the old subjects gender and add the new information, as we decided in the previous steps.

Finally, we need to decide what were going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state were going to output. Then, we put the cell state through tanh (to push the values to be between 1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case thats what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if thats what follows next.

RNNs can be used as language models for predicting future elements of a sequence given prior elements of the sequence. However, we are still missing the components necessary for building translation models since we can only operate on a single sequence, while translation operates on two sequences the input sequence and the translated sequence.

4

Seq2Seq model comprises of two language models:Encoder: a language model to encode input sequence into a fixed length vector (thought vector)Decoder: another language model to look at both thought vector and previous output to generate next words

Overview

Network Architecture

Loss Function

Improvement Techniques

Sequence To Sequence ModelNeural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473

Sequence to sequence models build on top of language models by adding an encoder step and a decoder step. In the encoder step, a model converts an input sequence into a thought vector. In the decoder step, a language model is trained on both the output sequence as well as the thought vector from the encoder. Since the decoder model sees an encoded representation of the input sequence as well as the output sequence, it can make more intelligent predictions about future words based on the current word. 5

Overview

Network Architecture

Loss Function

Improvement TechniquesWhichCrane?

I like crane because Sequence To Sequence Model

For example, in a standard language model, we might see the word crane and not be sure if the next word should be about the bird or heavy machinery. However, if we also pass an encoder context, the decoder might realize that the input sequence was about construction, not flying animals. Given the context, the decoder can choose the appropriate next word and provide more accurate reply.6

Overview

Network Architecture

Loss Function

Improvement Techniques

Sequence To Sequence ModelSequence Model with Neural Network: https://indico.io/blog/sequence-modeling-neuralnets-part1/

Now that we understand the basics of sequence to sequence modeling, we can consider how to build one. We will use LSTM as encoder and decoder.

The encoder takes a sequence(sentence) as input and processes one symbol(word) at each time step. Its objective is to convert a sequence of symbols into a fixed size feature vector that encodes only the important information in the sequence while losing the unnecessary information. Each hidden state influences the next hidden state and the final hidden state can be seen as the summary of the sequence. This state is called the context or thought vector, as it represents the intention of the sequence. From the context, the decoder generates another sequence, one symbol(word) at a time. Here, at each time step, the decoder is influenced by the context and the previously generated symbols.We can train the model using a gradient-based algorithm, update parameters of encoder and decoder, jointly maximize the log probability of the output sequence conditioned on the input sequence. Once the model is trained, we can make predictions

The context can be provided as the initial state of the decoder RNN or it can be connected to the hidden units at each time step. Now our objective is to jointly maximize the log probability of the output sequence conditioned on the input sequence.7

Overview

Network Architecture

Loss Function

Improvement TechniquesGenerating a word is a multi-class classification task over all possible words, i.e. vocabulary.W* = argmaxW P(W|Previous words)

Example :I always order pizza with cheese and mushrooms0.15 pepperoni0.12anchovies0.01 . rice0.0001and1e-100

Loss Function

whichever a model gives us the highest prop for all the words should be our model. 8

Cross Entropy Loss:

Cross-Entropy:

Cross-Entropy for a sentence w1, w2, , wn:

Overview

Network Architecture

Loss Function

Improvement TechniquesEvaluating Language Model: https://courses.engr.illinois.edu/cs498jh/Slides/Lecture04.pdfPerplexity:In practice, a variant called perplexity is usually used as metric to evaluate language models.

Evaluate per word perplexity. For the probability, we definitely think of cross entropy.https://courses.engr.illinois.edu/cs498jh/Slides/Lecture04.pdfby applying the chain rule, we can get perplexity per word.

9

Cross entropy can be seen as a measure of uncertaintyPerplexity can be seen as number of choices

Overview

Network Architecture

Loss Function

Improvement Techniques

Cross entropy loss vs Perplexity:Entropy: ~2.58Perplexity: 6 choicesWhich statement do you prefer?- The die has 6 faces- The die has 2.58 entropy

We can see perplexity as the average choices each time. The higher it is, the more choices of words you have, then the more uncertain the language model is.Example: 6 faced balanced die. Each face is numbered from 1 to 6 so we have

10

Problem:The last state of the encoder contains mostly information from the last elements of the encoder sequenceInverse Input Sequence helps in some cases

How are you ?I am fine .Attention Mechanism:Allow each stage in decoder to look at any encoder stagesDecoder understand the input sentence more and look at suitable positions to generate words

Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473

Compressing an entire input sequence into a single fixed vector is challenging. The last state of the encoder contains mostly information from the last elements of the encoder sequenceThis mechanism will hold onto all states from the encoder and give the decoder a weighted average of the encoder states for each element of the decoder sequenceDuring the decoding phase, we take the state of the decoder network, combine it with the encoder states, and pass this combination to a feedforward network. The feedforward network returns weights for each encoder state. We multiply the encoder inputs by these weights and then compute a weighted average of the encoder states.11

Problem:The last state of the encoder contains mostly information from the last elements of the encoder sequenceInverse Input Sequence helps in some casesAttention Mechanism:Allow each stage in decoder to look at any encoder stagesDecoder understand the input sentence more and look at suitable positions to generate words

Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473Seq2SeqSeq2Seq with attentionSentence Length - 3013.9321.50Sentence Length - 5017.8228.45

BLEU score on English-French Translation corpus

BLEU (bilingual evaluation understudy): measures the correspondence between a machine's output and that of a human BLEU = sum: max(word count in generated sentence, word count in referenced sentence)/total generated sentence length for each word12

Problem:Maximizing conditional probabilities at each stage might not lead to maximum full-joint probability.Storing all possible generated sentences are not feasible due to resource limitation.Possible output 2: Never been betterPossible output 1: I am fine Beam Search:At each stage in decoder, store best M possible outputs

Sequence to Sequence Learning: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdfConditional Probability: 0.6 0.4 1Conditional Probability: 0.4 0.9 1Full-joint Probability:0.240.36

Possible Output 1:Possible Output 2:Possible Output M:How are you ?I am fine .

Maximizing conditional probabilities at each stage might not lead to maximum full-joint probability.We could store all possible generated sentences so that we always find the maximum full-joint probability, but it would not be feasible.A practical solution would be something in between.13

Problem:Maximizing conditional probabilities at each stage might not lead to maximum full-joint probability.Storing all possible generated sentences are not feasible due to memory limitation.Beam Search:At each stage in decoder, store best M possible outputs

Sequence to Sequence Learning: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdfSeq2Seq withbeam-size = 1Seq2Seq with beam size = 1228.4530.59

BLEU score on English-French Translation corpus.Max sentence length 50

14

15

APPENDIX

Cross Entropy Loss:

Cross-Entropy:

Cross-Entropy for a sentence w1, w2, , wn:

Overview

Network Architecture

Loss Function

Improvement Techniques

sum of log-probability in decoding steps

17

1. Reinforcement Learning:Longer sentence is usually more interesting. So, we can use sentence length as rewards to further train the model:Action: Word choiceState: Current generated sentenceReward: Sentence Length

2. Adversarial Training:Make generated sentences look real using Adversarial training:Generative Model: generate sentences based on inputsDiscriminant Model: tries to tell if a sentence is true response or generated responseObjective: train generative model to fool discriminant model

Adversarial Learning for Neural Dialogue Generation: https://arxiv.org/abs/1701.06547

18