Deep Learning Automated Helpdesk

AUTOMATED HELPDESKFINAL YEAR PROJECT (7TH SEM)

SUBMITTED BYNIKHIL PATHANIA

PARTHA PRATIM KURMIPRANAV SHARMARISHABH KUMAR

SOURAV KUMAR PAUL

PRESENTATION TIMELINE

Theoretical NLPKnowledge Base Design

By – Pranav Sharma

Practical NLP ApplicationForming of Tokens

By – Rishabh Kumar

Clustering

By – Sourav Kr Paul

Tensorflow

By – Nikhil Pathania

Query Model

By – Partha PratimKurmi

PROJECT TIMELINE

Problem FormulationSep-2016

Literature SurveySept-Oct 2016

Design MethodologyNov-2016

Synchronizing Modules

Nov-2016

Basic ImplementationJan- Feb 2017

Working ModelMar-2017

Accuracy ImprovementsMar- Apr 2017

PROBLEM STATEMENT

Automate the task of customer centers.

AIM - Build a system to answer questions like"How to recharge my mobile?" - PayTM

"How to pay my bills?" - PayTM

"Why is my refund not credited?" - Book My Show

Training Model

1.1Raw Data

1.2NLP

1.3Preprocessing

1.4Knowledge Base

1.5Clustering

O/P

INFORMATION RETRIEVAL

• Data Sources• FAQ's• Past forum data

• Proper data extraction model• Knowledge base

DATA EXTRACTION MODELSWHY NLP?

• 3 steps process.• Extends with clustering.• Fast, accurate.

NLP• 4 step process.• No extension with clustering.• Smaller domain.

PATTERN MATCHING

ExampleKnowledge Base - “The CEO of IBM is Samuel Palmisano.”Query - “Who is the CEO of IBM?”Format - \Q is \A

Training Model

1.1Raw Data

1.2NLP

1.3Preprocessing

1.4Knowledge Base

1.5Clustering

O/P

NATURAL LANGUAGE PROCESSING

• Problem Domain – English.• Aim.• Origin - Turing Test.• Annotating the sentence.• Clouds exist on mars. => <cloud, exist, mars>• Kernel sentences, T Expressions.

KERNEL SENTENCE, T-EXP

• Kernel Sentences.• Ternary Expressions.• <Subject, Relation, Object>

AN EXAMPLE

KNOWLEDGE BASE

• What is it?• What to store? Proper data structure.• Mapping to original set.• NLP Annotations, parameterized variants.

Training Model

1.1Raw Data

1.2NLP

1.3Preprocessing

1.4Knowledge Base

O/P

1.5Clustering

PREPROCESSING:-

• Tokenization• Stop words removal.• Stemming.• POS Tagging.

NLTK ( NATURAL LANGUAGE TOOLKIT )

• Suite of libraries.• Python Support.• Few libraries which we will be using are :-• Lexical analysis.• Parts of speech tagger

TOKENIZATION:-

Tokenization( Word Tokenize)• Breaking stream into meaningful elements.• Stream may or may not be a meaningful sentence.

EXAMPLE:-

"Recharge your mobile by visiting this link"

After tokenization:-['Recharge', 'your', 'mobile', 'by', 'visiting', 'this', 'link']

STOP WORDS :-

E.g. “is, for, the, in, etc”

Target :- REMOVE THE STOP WORDS

STOP WORDS REMOVED BY NLTK:-

EXAMPLE :-

FromTokenization['Recharge', 'your', 'mobile', 'by', 'visiting', 'this', 'link']

After Stop Words removal['Recharge', 'mobile', 'visiting', 'link']

STEMMING:-

Word = Stem + AffixesExample:- playing = play(stem) + ing(affixes)

TARGET:- Removing affixes from word (called stemming)E.g. plays, playing, playful all reduced to 'play'

Library in NLTK :- PorterStemmer

EXAMPLE :-

From Stop words removal :-['Recharge', 'mobile', 'visiting', 'link']

After Stemming :-['Recharge', 'mobile', 'visit', 'link'] // input for clustering is generated

POS TAGGING:-

POS (part of speech) = Category of Tokens in linguistics, such as verb noun etc.

Target :- Tag the tokens with the POS with a universal format.

EXAMPLE :-From Stemming:-['Recharge', 'mobile', 'visit', 'link']

After POS Tagging:-[('Recharge', 'NN')][('mobile', 'NN')][('visit', 'VBG')][('link', 'NN')]

Training Model

1.1Raw Data

1.2NLP

1.3Preprocessing

1.4Knowledge Base

O/P

1.5Clustering

DOCUMENT CLUSTERING – WHAT AND WHY?• Unsupervised document organization• Automatic topic organization• Topic extraction• Fast Information retrieval and filtering

EXAMPLES

• Web document clustering for search users.

• QA document clustering to solve common problems and questions.

WHY K-MEANS? WHY NOT ANY HIERARCHICAL ALGO?

• Time Complexity

CLUSTERING

• Algorithm• Find k (most dissimilar) documents • Assign them as k centroid

• Until no change• For each document

• Find the most similar cluster • Use cosine similarity fn

• Recalculate the centroid of each cluster• Stop If no document was reassigned

K-MEANS USING JACCARD DISTANCE MEASURE• Problems in Simple K-Means Procedure.• Greedy Algorithm• Doesn't guarantee the best solution.

• JACCARD Distance Measure• Find k most dissimilar document.

OUTPUT OF PREPROCESSING

• Possible text documents are :• Recharge mobile visit link• Recharge landline visit link• Cancel ticket process• Add money wallet

CALCULATING TF-IDF VECTORS

• Term Frequency – Inverse Document Frequency• (Weight) Ranks the importance

• Terms frequent in Document and rare in Set• Ex: College name NITS. - name is frequent but not rare.

TF-IDF VECTOR SPACE

Add Cancel

Recharge

landline

link mobile

money

process

ticket visit wallet

0.00 0.00 0.17 0.00 0.17 0.35 0.00 0.00 0.00 0.17 0.000.00 0.00 0.17 0.35 0.17 0.00 0.00 0.00 0.00 0.17 0.000.00 0.46 0.00 0.00 0.00 0.00 0.00 0.46 0.46 0.00 0.000.46 0.00 0.00 0.00 0.00 0.00 0.46 0.00 0.00 0.00 0.46

SELECT K-CLUSTER ( K =3)

• Use Jaccard Distance Measure - {{0},{2},{3}}Document No (I) Document No (J) Similarity0 1 0.60 2 0.000 3 0.001 2 0.001 3 0.002 3 0.00

AFTER FIRST ITERATION

• Assigning of documents to its most similar cluster. - {{0,1},{2},{3}}• Clusters After 1st iteration: (vecspace – centroid centers)

Add Cancel

Recharge

landline

link mobile

money

process

ticket visit wallet

0.00 0.00 0.17 0.17 0.17 0.17 0.00 0.00 0.00 0.17 0.000.00 0.46 0.00 0.00 0.00 0.00 0.00 0.46 0.46 0.00 0.000.46 0.0 0.0 0.0 0.0 0.0 0.46 0.0 0.0 0.0 0.46

CLUSTERING OUTPUT

• { { Recharge mobile visit link, Recharge landline visit link },

{ Cancel ticket process }, { Add money wallet }}

Training Model

1.1Raw Data

1.2NLP

1.3Preprocessing

1.4Knowledge Base

O/P

1.5Clustering

TENSOR FLOW

• What• Why• Where

PROGRAMMING MODEL AND BASIC CONCEPTS• Computation Graph• Nodes• Tensors• Session• Extend• Run

COMPUTATION GRAPH

IMPLEMENTATION

• Single Device Execution• Multi Device Execution• Cross Device Communication

SINGLE DEVICE EXECUTION

CROSS DEVICE COMMUNICATION

PERFORMANCE

• Data Parallel Training• Model Parallel Training• Concurrent Step for Model Computation Pipelining

DATA PARALLEL TRAINING

MODEL PARALLEL AND CONCURRENT STEPS

CLUSTERING USING TENSOR FLOW

• Training Sets• Nodes• Data flow• Feed as Input• Output

Query Model

2.1Query

2.2NLP

2.3Preprocessing

2.4Recommendation Engine

O/P

RECOMMENDATION ENGINE

• Recommendation Engine analyzes available data to answer the questions• The various steps are:1. Data collection2. Preprocessing and Transformations3. Classifier Ensemble

PREPROCESSING AND TRANSFORMATIONS

• The training set is taken consisting of FAQs, past forums etc.• Given a question, we want to deduce its genre from the texts • Only the text of the question is extracted. • Feature selection to evaluate the importance of a word using

TF-IDF


• Training set derived from the key parts of speech in each sentenceExample How to recharge my mobile

Part of Speech

Verb Noun Object

Decision label Task Electronics


• recharge mobile

• Find TF-IDF vector• Compare it with distinct clusters using cosine similarity

CLASSIFIER ENSEMBLE

• Ensemble modelling is used for classification using three classifiers• Naïve Bayesian using FAQ training set• POS Naïve Bayesian• Threshold Biasing classifier

ENSEMBLE STRUCTURE

• Learning algorithm that uses multiple classifiers• Classify using a weighted vote for their decisions• The classifier having better precision is considered

RESULTS

• Documents are hand-tagged with the genres• In the Ensemble approach, we use a bag approach• The count of genres is taken into account• The top tallied genre is used to generate result• Answer is "recharge mobile visit link"

Query Model

2.1Query

2.2NLP

2.3Preprocessing

2.4Recommendation Engine

O/P

INNOVATION• Sections Removed• User friendly• Reduced Man-power• Future plans to collaborate with college website.

CONCLUSION AND OUTCOMESThe outcomes of this project can be formulated (but not limited to) in the following points :-1. Complete Designed Architecture.2. Proper modules and uses defined.3. Model solution to the problem.

Hence we would like to conclude that the theoretical and survey aspect of the problem is complete. We have selected the best tech solutions after surveying for all existing alternatives. Thus, a working model is soon to be expected from the team.

LITERATURE SURVEYSerial No

Paper Title Authors

1 Natural Language Annotations for Question Answering

Boris Katz, Gary Borchardt and Sue Felshin

2 Using English for Indexing and Retrieving Katz, Boris3 Recommendation engine: Matching

individual/group profiles for better shopping experience

Sanjeev Kulkarni, Ashok M. Sanpal, Ravindra R. Mudholkar, kiran Kumari

4 Recommendation engine for Reddit Hoang Nguyen, Rachel Richards, C.C. Chan, Kathy J. Liszka

5 TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo

6 Executing a programon the MIT tagged-token dataflow architecture.IEEE Trans. Comput., 1990.

Arvind and Rishiyur S. Nikhil


Paper Title Author

7 An efficient K-Means Algorithm integrated with Jaccard Distance Measure for Document Clustering

Mushfeq-Us-Saleheen Shameem, Raihana Ferdous

8 An Intelligent Similarity Measure for Effective TextDocument Clustering

M.L.AISHWARYA1Department of Computer Science , K.SELVI2

9 K Means Clustering with Tf-idf Weights Jonathan Zong10 Comparison Between K-Mean and Hierarchical

AlgorithmUsing Query Redirection

Manpreet kaur , Usvir Kaur

11 Question Answering System on Education Acts Using NLP Techniques

Dr.M.M. Raghuwanshi Professor , Department Of Computer Science and Technology


Paper Title Author

12 Affective – Hierarchical Classification of Text – An Approach Using NLP Toolkit

Dr.R.Venkatesan Asst.Prof-III/CSE

13 Building high-level features using large scale unsupervisedlearning. In ICML’2012, 2012.

Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, and AndrewNg.

14 Preprocessing Techniques for Text Mining - An Overview

Dr. S. Vijayarani1, Ms. J. Ilamathi, Ms. Nithya, Assistant Professor, M. Phil Research Scholar,Department of Computer Science

15 Annotating the World Wide Web using Natural Language

Boris Katz

THANK YOU !!

Deep Learning Automated Helpdesk

Data & Analytics

Transcript of Deep Learning Automated Helpdesk