Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned...

20
USING NLP TO CLASSIFY COMPLAINTS Is it complaining or a complaint?

Transcript of Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned...

Page 1: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

USING NLP TO CLASSIFY COMPLAINTS

Is it complaining or a complaint?

Page 2: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

TABLE OF CONTENTS

1 IntroductionWho are you?

2 DefinitionsWhat exactly is NLP? What is the difference between complaining and a complaint?

3 The ProblemHow did this project come to be? What is it even about?

4 Data and FeaturesWhat dataset ? How did you use the data to create a training set?

5 The ModelWhat models did you test and use?

6 MonitoringHow did you monitor your model’s performance?

Page 3: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

MEET THE SPEAKER

Kyra KochData Scientist at TIAA

Graduated with honors and a computer science degree from Clemson University

Currently pursuing a master's degree in computer science with an emphasis in machine learning from Georgia Tech

Has worked on many different proofs of concept and projects, such as anomaly detection, forecasting models, and similarity identification, at TIAA

Page 4: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

TERMINOLOGY

nat·u·ral lan·guage proc·ess·ing/ˈnaCH(ə)rəl ˈlaNGgwij ˈprōˌsesiNG/

nouna subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data

com·plaint/kəmˈplant /

nouna statement that a situation is unsatisfactory or acceptable, with an emphasis on regulatory complaints

com·plain·ing/kəmˈplāniNG/

verbthe expression of dissatisfaction or annoyance, often subjective and unimportant from a regulatory perspective

Page 5: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

COMPLAINING“ Please take your disgusting commercials off

my TV. “-Complaining Customer

“ **** you **** you .“-Complaining Customer

“ Stop send ing these god**** ******* em ails you ********.“

-Complaining Customer

“ You are just a bunch of id io ts unab le to he lp m e . “

-Complaining Customer

“ More ********. Not from you . You 're cool. From your id io t bosses. Please te ll them to go stra igh t to **** for m e , do not pass "Go," do

not collect $200. All they d id was send m e righ t back to square one , the ********. “

-Complaining Customer

“ DON'T LIKE THE NEW WEBSITE. “-Complaining Customer

“ I NEED MY ******* MONEY RIGHT NOW! “-Complaining Customer

Page 6: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

COMPLAINTS“ My only option for access is by telephone, but

I don't have telephone service where I am. “-Customer Complaint

“ Som ehow the re is a fa ilu re to com m unica te . “-Customer Complaint

“ Three weeks and counting, nobody can ye t answer m y questions about the d iscrepancies

in the loan paym ent sta tem ents I got. I'm a lso on m y 4th ca ll about it now. “

-Customer Complaint

“ The curren t investm ent represen ta tive has not been ava ilab le nor taken tim e to p lan with m e . “

-Customer Complaint

“ The fon t on the actua l m essage is too sm all. It is hard to read for a pe rson with not the best of

sigh t. “-Customer Complaint

“ Just checked m y account and the changes we have d iscussed have not occurred . “

-Customer Complaint

Page 7: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

PROBLEM

Can we use machine learning to classify complaints for quality assurance?

Page 8: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

DATAWhat is the text?

FINRA Summary

4,336ROWS

3REVIEWERSAlpha/Bravo/

Charlie

1DETERMINATION

Consensus/Disagreement

1FINAL VOTE

The dataset was cleaned by removing stop words, excess whitespace, punctuation and numbers.

If there was a discrepancy in the classification between the three reviewers, that entry was marked as ‘disagreement’ and used in the testing set.

If the three different reviewers are in agreement, we can say with confidence that the label is correct.

What is FINRA?FINRA is a private, not -for -profit corporation authorized by Congress to act as a self -regulatory organization for the financial industry

Page 9: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

FEATURE GENERATION AND MODEL DEVELOPMENT

How can we use the data we have to get the computer to solve our problem?

Page 10: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

TERMINOLOGYbag of words/bag əv wərds/

nouna way of simplifying text to be used in natural language processing and information retrieval

doc·u·ment/ˈdäkyəmənt/

nouna single text within the corpus

cor·pus/ˈkôrpəs/

nouna collection of written texts

term fre·quen·cy -in·verse doc·u·ment fre·quen·cy/tərm ˈfrēkwənsē inˈvərs ˈdäkyəmənt ˈfrēkwənsē /

nouna num erica l sta tistic tha t is in tended to re flect how im portan t or un ique a word is to a docum ent in the corpus

Page 11: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

BAG OF WORDS

In essence, a bag of words model is quite simply a frequency count of how often each word occurs in a single document in the corpus.

For example:2 documents in the corpus: “this is a test test ” and “this is another test”The resulting dataset will look as follows:

message “this” “is” “a” “test” “another”this is a test test 1 1 1 2 0this is another test 1 1 0 1 1

Page 12: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

TF-IDFTerm frequency -inverse document frequency is a method of identifying unique or important words

𝑡𝑡𝑡𝑡 − 𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡,𝑑𝑑 = 𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 � 𝑙𝑙𝑙𝑙𝑙𝑙𝑁𝑁𝑖𝑖𝑡𝑡𝑡𝑡

𝑡𝑡𝑡𝑡𝑡𝑡,𝑑𝑑 = (num ber of tim es te rm 𝑡𝑡 appears in a docum ent)/(to ta l num ber of te rm s in the docum ent)

𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡,𝑑𝑑 = 𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒((to ta l num ber of docum ents)/(num ber of docum ents with te rm 𝑡𝑡 in it))

For exam ple :Using the sam e exam ple as the previous slide , the re su lting da tase t will look as

fo llows:

message “this” “is” “a” “test” “another”th is is a te st te st 0 0 .1386 0 0th is is anothe r te st 0 0 0 0 .1386

Page 13: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

DATASETS

TRAINING

Used for building the model and letting it

learn

Is usually representative of the

total population

55% of the data

VALIDATION

Used for iterating on the model and fine -tuning parameters

25% of the data

TESTING

Used after fine -tuning the model to ensure that the parameters are not overfitted to

the training and validation set

20% of the data

Page 14: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

MODEL

ran·dom fo·rest/ˈrandəm ˈfôrəst/

nounan ensemble of decision trees with a random subset of the features, which eliminates a decision tree’s problem of overfitting on the training dataset

de·ci·sion tree/dəˈsiZHən trē /

nouna decision support tool tha t uses a tree -like m ode l of decisions and the ir possib le consequences, includ ing chance even t ou tcom es, re source costs, and u tility

Page 15: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

VALIDATIONAfter training a random forest of 100 trees with a minimum leaf size of 2, we run the validation set through the model.

Overall Accuracy: 78.51%

Predicted Non -Complaint Predicted Complaint

Actual Non -Complaint 31276.85% of Actual Non -Complaints

9423.15% of Actual Non -Complaints

Actual Complaint 13920.50% of Actual Complaints

53979.50% of Actual Complaints

When it comes to predicting, the model has trouble classifying non -complaints. However, both percentages are within a reasonable margin of each other, so it is not a cause for concern.

* Note: All data provided in this presentation is artificial to minimize risk.

Page 16: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

TESTINGAfter fine -tuning the parameters and adjusting the appropriate thresholds, the model was tested on the final holdout set.

Overall Accuracy: 79.58%

Predicted Non -Complaint Predicted Complaint

Actual Non -Complaint 29175.58% of Actual Non -Complaints

9424.42% of Actual Non -Complaints

Actual Complaint 8317.22% of Actual Complaints

39982.78% of Actual Complaints

Maintained overall accuracy Distribution was better Consistent

* Note: All data provided in this presentation is artificial to minimize risk.

Page 17: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

DEPLOYMENT

API

Page 18: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

MONITORING

989 489426 3,651

V0.4.1

83.52%Miscla ssifica t ion Ra te : 16.48%

* Note: All data provided in this presentation is artificial to minimize risk.

Page 19: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

MONITORING

* Note: All data provided in this presentation is artificial to minimize risk.

Page 20: Using NLP to Classify Complaints - Analytics Frontiers 2020 · FINAL VOTE The dataset was cleaned by removing stop words, excess whitespace, punctuation and ... All data provided

KEY TAKEAWAYS

Text Data Bag of Words

Random Forest

Quality Control Monitoring

Data FormattingThe data comes in as

raw text data that needs to be formatted

Word CountThe text is transformed

into a word count dataset

EnsemblePut the data through a collection of decision

trees

Human ResourcesHave a manual review of instances where the

model predicts differently than the

human

MonitorMonitor model

performance and make regular adjustments

Steps Taken to Create and Utilize an NLP Model