Text Analytics Presentation
-
Upload
skylar-ritchie -
Category
Documents
-
view
291 -
download
0
Transcript of Text Analytics Presentation
An Introduction to Text Analytics
An Introduction to Text Analytics in IBM SPSS ModelerSkylar RitchieShawn Bergman1
1
ObjectivesTo give a broad overview of text analyticsDefining key termsDescribing important steps in the processTo provide a step-by-step tutorial for how to use IBM SPSS Modeler to...Read in source textExtract concepts, sentiment, and text link patterns from recordsCategorize recordsVisualize the results2
Having read the 225-page Users Guide cover to cover and watched countless videos on Modeler, I can personally attest that the two most difficult aspects of learning the software areDistinguishing between terms that look similar, but signify very different ideasComing up with an organizational framework for understanding the many things you can do in Modeler
The first half of the presentation is dedicated to the first difficulty, and the second half of the presentation, to the second
The overriding goal of this presentation is for you to feel as though you can explore the software for yourselves
In putting it together, I tried to focus only on the essentials, and even though I only scratched the surface of what the software can do, we will have to hustle to make it through everything
However, we will post this presentation with all of its examples and videos on the Office of Research Consultation website so that you can use it as a resource and refer back to it when you need it
In the interest of time, I am going to cover the first half of the presentation relatively quickly, but if I am moving too quickly, please do not hesitate to ask questions and slow me downjust understand that we may not get to everything and that you may have to watch some of the videos at the end for yourself2
Overview of Text AnalyticsObjective #13
3
Text Analytics
The process of deriving high quality information from text --Marisa Peacock, Social Media Strategist
A technology and process both, a mechanism for knowledge discovery applied to documents, a means of finding value in text. Solutionsanalyze linguistic structure...discern entities...as well as relationships, concepts, and even sentiments. They...automate classification...of source documents. They exploit visualization for exploratory analysis. --Seth Grimes, Analytics Strategy ConsultantExtraction: to discern entities, relationships, concepts, and sentimentsCategorization: to automate classificationVisualization
4
Lets start with a definition of text analytics
One thing both of these definitions have in common is that they both describe text analytics as a process
Furthermore, both definitions describe the outcome of this process in similar terms: the outcome is high quality information, knowledge, and value
The second definition, however, is somewhat more descriptive than the first, since it enumerates the principal steps in this process
Those steps are toDiscern entities, relationships, concepts, and the relationship between themsomething IBM calls extractionAutomate classificationsomething IBM calls categorizationVisualize the results
In my presentation today, I will first describe these steps in greater detail and then show you how to perform them for yourself4
What does text analytics look like? 5Handout provided
So what does this process look like?
On the macro-level, the process involves four primary steps: Reading in source textExtracting linguistic entities, relationships, and sentimentCategorizing recordsVisualizing the results
On more of a micro-level, the primary steps of extracting and categorizing can be broken down further:Extraction involves passing the source text through a variety of dictionaries (to be described in greater detail) in order to identifyConceptsTypesText Link Analysis patternsCategorization involves taking these extraction results and applying a number of grouping techniques in order to create categories and descriptors that classify records
These diagrams depict text analytics as a linear process; however, as the Users Guide repeatedly emphasizes, text analytics is an iterative process, so a more accurate depiction might include a feedback loop5
6
6
Key Terms
Source text file
FieldDocument/record
7
Lets take a look at the first step in the text analytics process: sourcing
Source text can take the form of either a computer file (such as an Excel file) or a Web feed (such as an RSS feed with various web links)
Since the focus of todays presentation is to demonstrate how to perform extraction, categorization, and visualization, I will use an Excel file as source text
Using a Web feed as a source is a little less straightforward, but if you are interested in that as well, I can make that the topic of a future presentation
Within an Excel file, you have worksheets, whose columns are known as fields and whose cells are referred to as either documents or records, two terms that IBM uses interchangeably
For the sake of simplicity, I will refer to them in the future as records 7
8
8
Key TermsTypes: higher-level concepts
Concepts: lead terms under which similar terms are grouped together
Terms: single words (uni-terms) or word phrases (multi-terms) that are interesting or relevant9Handout provided
Lets turn now to the second main step in text analytics: extraction
Here for the first time we encounter a number of terms that look similar, but signify very different ideas
In fact, these ideas are arranged hierarchically wither terms at the bottom and types at the highest level of abstraction
Terms and concepts are always written as lowercase words or word phrases, and types are always enclosed in brackets
The general types that come with the Core Librarymore on that laterinclude , , , and
But types in other more specific libraries can themselves be more specific:Types in the Opinions Library include , , , and among othersTypes in the Employee Satisfaction Library include , , , and among others9
Substitution Dictionary: Terms ConceptsAn editable collection of synonymous terms grouped under a target term, or concept
Target TermSynonymsuniversityuniversity, college, school, academy, institute, polytechnic, alma mater, graduate school studentstudent, scholar, undergraduate, graduate, grad student, postdoctoral fellow, freshman, sophomore, junior, seniorprofessorprofessor, prof, tenured faculty member, dean, assistant professor, associate professor, lecturer, academic
10
As I mentioned earlier, there are several linguistic dictionaries that are instrumental in the extraction process
The first of these is known as a substitution dictionary, and it is responsible for grouping terms under what are called target terms or concepts
The computer scans all of the records, and whenever it finds synonymous terms, it essentially rewrites them as the target term
It is important to note that this dictionaryand all the othersare editableSo if, for example, you want to distinguish between universities and institutes, you can separate the two terms in your substitution dictionaryAnd if, on the other hand, you want to use two terms synonymously, you can combine them in this dictionary10
Type Dictionary: Concepts TypesAn editable collection of concepts grouped under a label known as the type nameConceptType5 stara lot betterbeyond my expectationsabhorbizarrecant standall about the samebeen with it for too little timecant think of any
11
The second linguistic dictionary is known as a type dictionary, and as its name implies, it is responsible for grouping concepts under their respective types
Here the computer assigns a higher-level descriptive label to the concepts themselves, and although it is generally pretty good at assigning types when given some kind of context, if it is not given context, it will often assign the type
11
Exclude DictionaryAn editable collection of terms and types that will be removed from the final extraction resultsExclude Listany kind of problemcant say enoughcant waiti was out ofif it aint broke, dont fix itprefer not toto work withwent down to
12
The third and final linguistic dictionary is known as the exclude dictionary, and as its name suggests, whatever it contains is excluded from the final extraction
As you peruse this dictionary, you might find a term or phrase that you do want to extract, and by deselecting it in this dictionary, you can ensure that it shows up in the extraction results
There is also a way to assign unwanted terms and phrases to the exclude dictionary12
Text Link Analysis (TLA)A pattern-matching technology that is used to extract relationships found betweenEither conceptsOr types
13Handout provided
Text Link Analysis (or TLA) is where text analytics really demonstrates its value
TLA patterns are the fourth and final kind of extraction results
Whereas the other extraction results (terms, concepts, and types) represent a single linguistic unit, TLA patterns represent the relationships between these units and can express the meaning of an entire sentence with a subject, verb, and predicate
As the examples at right indicatePatterns can contain 2 or more concepts or typesOrder is important (indicated by the + operator), but sentiments always come last13
14
14
Key TermsCategorization: the process of assigning records to a category when the text within them matches a descriptor
Category: higher-level ideas that capture the central message of the text
Descriptor: concepts, types, patterns, and category rules that have been used to define a category15
Finally lets turn to the third main step in text analytics: categorization
Whereas extraction involves bundling the terms, concepts, and types within records, categorization bundles the records themselves on the basis of what they contain
Descriptors determine whether or not a record is assigned to a given category, and descriptors can take the form of either concepts, types, TLA patterns, or category rules15
Category RulesStatements that classify records into a category based on a logical expression using extracted concepts, types, and patterns as well as Boolean operatorsOperatorMeaningExample+And(order important) + university + excellent&And(order not important) & excellent & university|Or | student | university!()Not!()!(student)
Matching SentenceThis is a 5 star university
16Handout provided
Since we have already covered concepts, types, and TLA patterns, lets move on and cover category rules
In one way, category rules are like TLA patterns: they often join concepts or categories to describe a record and determine whether or not it belongs in a category
In another way, however, category rules are unlike TLA patternsIn the first place, they can use operators such as the ampersand or the vertical bar, in which case order is not important(excellent & university) would capture the exact same records as (university & student)In the second place, category rules can indicate the absence of something, whereas TLA patterns only focus on the presence of things!(student) would capture all of the records that do not contain student, and this might be a considerable numberUsually, you would want to use the not operator in conjunction with another operator such as student & !(professor)16
Wildcard OperatorThe Boolean operator * that acts as a variable and stands in for a missing word or word fragmentUsageExampleMatching PhrasesSpace after wordgraduate *graduate schoolgraduate studentSpace before word* graduateuniversity graduateNo space after wordgraduate*graduatesgraduatedNo space before word*graduateundergraduate
17
The fifth and final Boolean operator is known as the wildcard, and you can think of it as a variable that represents a missingPrefixSuffixOr word that precedes or comes after a given word
If there is a space either before or after the wildcard, the wildcard represents a missing word
If, on the other hand, there is no space, then the wildcard only represents a part of a word
Wildcards can be useful for generalizing category descriptors, but in some instances, they can overgeneralize
For example, graduated can be either an adjective or a verb, and if it is an adjective, it can refer to an alumnus or to a cylinder, and depending on the context, you may want to capture one concept but not the other with your descriptor17
Grouping TechniquesThe mechanisms underlying the categorization process18Handout provided
Having covered category rules, the fourth kind of category descriptor, lets turn to the grouping techniques that generate both the categories and their descriptors
There are four of these: concept inclusion, concept root derivation, semantic networks, and co-occurrence18
Concept InclusionWhat?Grouping based on subsets and supersets
How?Breaking concepts into componentsDe-inflecting components
When?Text that is somewhat technical
19
Concept inclusion is a grouping technique that involves breaking concepts into their component sets, de-inflecting these components, and then identifying areas of overlap
For example, lets say you had the multi-term concepts graduate faculty, faculty committees, and tenured faculty members
These concepts would first be broken down into their component sets and then these sets would be de-inflected (e.g., converting nouns from plural to singular)
In the process at right, I have illustrated the de-inflection process by underlining the parts of the word that are removed in a subsequent step
In these component sets, the order of the words is not important; the only thing that is important for the concept inclusion technique is whether or not these component sets have areas of overlap
Concept inclusion is a technique that is relatively robust and works well on text that contains technical jargon
19
Concept Root DerivationWhat?Grouping based on morphological relationships
How?Breaking concepts into componentsDe-inflecting componentsRemoving suffixes to find root
When?Any text, but few categories
20
Concept root derivation employs a very similar process, but goes one step furtherstripping words down to their morphological or structural roots so that areas of overlap can be identified
As you can see at right, psychology, psychological, and psychologist all have the same rootpsycholog-and the concepts can be grouped into categories on the basis of this similarity20
Semantic NetworkWhat?Grouping based on semantic relationships
How?Synonyms: are relationshipHyponyms: is a relationship
When?Text that is not highly technical
21
Unlike concept root derivation, which categorizes concepts on the basis of morphological relationships, the semantic network technique looks for and categorizes concepts on the basis of semantic relationships, relationships having to do with word meanings
These semantic relationships generally take the form of either synonyms or hyponyms, where the former denotes an are relationship, and the latter, an is a relationshipProfessors and teachers, for example, might be considered synonyms, since they both are educatorsPsychology and social science, on the other hand, are hyponyms, since psychology is a social science
21
Co-occurrenceExampleConceptsStudents flock to ASUstudents = WASU = XASU focuses on sustainabilityASU = Xsustainability = YSustainability is the way of the futuresustainability = Yway of the future = Z
22
The fourth and final grouping technique is that of co-occurrence
Cxy represents the number of records in which two concepts co-occur; Cx, the number in which the first concept occurs; Cy, the number in which the second occurs
Generally, concepts must co-occur two or more times in order for them to be categorized together; however, this setting can be adjusted either higher or lower If your setting is high, you will generate fewer categories, but these categories will contain concepts that are more similar to each other If your setting is low, you will generate more categories, but they will be more heterogeneous
Co-occurrence is a relatively straightforward technique, but if you are interested in how it computes a similarity coefficient for two concepts, several sample calculations are illustrated at right22
Extraction v. CategorizationExtractionCategorizationEndsTo discover what records containTo classify records based on what they containMeansSubstitution dictionaryType dictionaryExclude dictionaryConcept root derivationConcept inclusionSemantic networkCo-occurrenceOutputConceptsTypesTLA patternsCategoriesDescriptorsConceptsTypesTLA patternsCategory rules
23
To sum up what we have said so far, extraction differs from categorization both in terms of its purpose or end and in terms of its means to that end
The purpose of extraction is to discover what records contain, whereas the purpose of categorization is to classify records on the basis of what they contain
The means used are also differentExtraction takes place by comparing records against a number of dictionariesCategorization, on the other hand, involves applying a variety of algorithms to the extraction results to create categories
In this way, concepts, types, and TLA patterns are both output and input: output for the extraction process and input for the categorization process
They are what gets pulled out of records and what the software then turns around and uses to classify those records23
Modeler TutorialObjective #224
Now that we have parsed out what the terminology means, lets take a look at the software itself and see how to perform the various tasks associated with sourcing, extracting, categorizing, and visualizing
As I mentioned earlier, one difficulty in learning Modeler is distinguishing between terms that look similar; however, a second difficulty concerns organizing the many different tasks you can perform in Modeler
To surmount this second difficulty, I have provided a number of charts so that you can keep track of what we have done and what we are doing
If you have the data set, you may find it helpful to follow along on your computer
24
25
25
Creating a New StreamOpen IBM SPSS Modeler 17.1Select Click OkTo create another stream, click
26
A stream is just your workspace, and it lays out in a visual fashionWhat data you are usingWhat processes you are running it through26
27
27
28
The data set that you gave us to analyze is a focus group conversation about the strategic direction of the College of Business
Because you are probably less interested in moderator comments than you are in those of participants, you may want to filter out the moderators remarks in Excel before you start the analysis process
28
Sourcing an Excel File
Click the tab Double click the node or click and drag it into the stream Double click the node within the stream or right click and click EditClick on the tabSelect the Select theA Select Click Ok
29
29
30Handout provided
Less information in substitution, type, and exclude dictionariesNo categories
More information in substitution, type, and exclude dictionariesNo categories
More information in substitution, type, and exclude dictionariesPre-built categories
Templates initiate the extraction phrase and pull out concepts and types
There are many different kinds of templates, some of which contain more in their substitution, type, and exclude dictionaries than others
There are also what are called text analysis packages (or TAPs) that comeNot only with a wealth of information in their dictionariesBut also with a number of pre-built categories that you may be interested in when you are conducting your analysis
For example, there is a TAP for employee satisfaction surveys, and the categories that it comes with include positive and negative sentiment towardCoworkersManagersCommunicationJob securityBenefitsEtc.
If you are not interested in all of the pre-built categories, you can delete or modify them to suit your preferences30
31
31
Starting an Interactive Workbench Session with the Basic Resources Template
Click the tab Double click the node or click and drag it into the streamDouble click the node within the stream or right click and click EditClick on the tabSelect the Click on the tabSelect Click
32
32
Interactive Workbench Categories & Concepts View
Categories PaneExtraction Results PaneData Pane33
33
Interactive Workbench Resource Editor View
Type DictionarySubstitution DictionaryExclude Dictionary34
34
35
35
Starting an Interactive Workbench Session with the Opinions Template
Double click the node within the stream or right click and click EditClick on the tabClickSelect Click OkClick
36
36
Interactive Workbench Categories & Concepts ViewConcept View37
37
Interactive Workbench Categories & Concepts ViewType View38
38
Interactive Workbench Resource Editor ViewType DictionarySubstitution DictionaryExclude Dictionary39
39
40
40
Starting an Interactive Workbench Session with the Opinions Text Analysis Package
Double click the node within the stream or right click and click EditClick on the tabSelect ClickSelect
Click Click
41
41
Interactive Workbench Categories & Concepts ViewCategories PaneExtraction Results PaneData Pane42
42
Interactive Workbench Resource Editor ViewType DictionarySubstitution DictionaryExclude Dictionary43
43
Templates v. Text Analysis PackagesLibrariesPre-Built CategoriesBasic Resources TemplateLocalCoreVariationsNonlinguistic EntitiesNoOpinions TemplateLocal CoreVariationsNonlinguistic EntitiesOpinionsBudgetSlang EmoticonNoOpinions Text Analysis PackageLocal CoreVariationsNonlinguistic EntitiesOpinionsBudgetSlang EmoticonYes
44Handout provided
44
45Handout provided
45
46
46
Interactive Workbench Categories & Concepts View
47
47
Editing the Substitution Dictionary
Right click on the conceptSelect Add to SynonymClick NewCreate the target term to which you want to assign the synonymClick OkClick
48
48
Interactive WorkbenchCategories & Concepts ViewResource Editor View
49
49
50
50
Interactive Workbench Categories & Concepts View
51
51
Editing the Type Dictionary
Right click on the conceptSelect Add to TypeClick MoreSelect the type to which you want to assign the conceptClick Ok
Click Ok againClick
52
52
Interactive WorkbenchCategories & Concepts ViewResource Editor View
53
53
54
54
Interactive Workbench Categories & Concepts View
55
55
Editing the Exclude Dictionary
Right click on the conceptClick Exclude from ExtractionClick
56
56
Interactive WorkbenchCategories & Concepts ViewResource Editor View
57
57
58
58
Extracting TLA Patterns
In the Text Link Analysis View, clickSelect a type pattern to see the concept patterns that correspond to itClick to see the concepts and type webs corresponding to these patterns
59
59
Interactive Workbench Text Link Analysis View
60
60
61
61
Automatically Building Categories
In the Categories & Concepts View, click Click Edit:SelectClick ClickSelectSelectSelectSelectSelectClick OkClick
62
62
Interactive Workbench Categories & Concepts View
CategorySubcategoryDescriptorVisualization Pane: Category Bar63
63
Interactive Workbench Categories & Concepts ViewCategory Web64
64
Interactive Workbench Categories & Concepts ViewCategory Web Table65
65
66
66
Interactive Workbench Categories & Concepts View
67
67
Manually Categorizing Concepts
Select the concept you want to categorizeClick Select the category to which you want to assign the concept: Click Ok
68
68
Interactive Workbench Categories & Concepts View
69
69
70
70
Interactive Workbench Categories & Concepts View
71
71
Manually Categorizing Types
Select the type you want to categorizeClick Select the category to which you want to assign the concept or create a new category: Click Ok
72
72
Interactive Workbench Categories & Concepts View
73
73
74
74
Interactive Workbench Text Link Analysis ViewType Patterns
Concept Patterns
75
75
Manually Categorizing TLA Patterns
Select the TLA pattern you want to categorizeClick Select the category to which you want to assign the concept or create a new category: Click Ok
76
76
Interactive Workbench Categories & Concepts View
77
77
78
78
Manually Creating Category Rules
Right click on the category for which you want to create a ruleClick Create Category RuleCreate your rule byDragging concepts or types into the Rule EditorCombining them with Boolean operatorsClick to see how many records matchClick
79
79
Interactive Workbench Categories & Concepts View
80
80
81Handout provided
Now that we have explored the extraction and categorization results with the Opinions Template, lets move to the Opinions Text Analysis Package
As youll remember from the first part of the tutorial, the difference between a template and a text analysis package is that the former does not come with pre-built categories, whereas the latter does81
82
Because the focus group conversation is not in the proper format with a question as the field header and each record as one persons response to that question, we will switch to a slightly different data set that is in the proper format so that we can demonstrate the remaining capabilities
This data set is a questionnaire about a companys safety program, and the field that we will be looking at has to do with what employees want the company to stop doing with regard to safety
Because this is an employee opinion questionnaire, we can use the employee opinion text analysis package82
83
83
Interactive Workbench Categories & Concepts View
84
84
Manually Adjusting Categories
Right click on the category or categories that you want to adjustSelect either Move to Category or Merge Categories or Edit > Delete
85
85
Interactive Workbench Categories & Concepts View
86
86
87
87
Interactive Workbench Categories & Concepts View
88
88
Generating Model
Once you are satisfied with the categories you have created, clickDrag the newly created modeling node into your stream Right click on your source node Click ConnectClick on your modeling node to connect the two nodes
89
89
Stream90
90
91
91
Converting Model Categories to Fields
Right click on your modeling nodeClick EditClick on the tabSelectChange theClick Ok
92
92
93
93
Deriving a Total Negativity Score
Click on the tabDouble click the node or click and drag it into the streamDouble click the node within the stream or right click and click EditGive a descriptive name to yourClick to create a formulaIn Expression Builder, click on a category that you want to be in your formulaClick to add itClick on an operator such as Add another categoryWhen you are finished, click OkRepeat the process to create additional formulas
94
94
95
95
Deriving an Overall Sentiment ScoreClick on the tabDouble click the node or click and drag it into the streamDouble click the node within the stream or right click and click EditGive a descriptive name to yourSelectDefine field settings:
Click Ok
96
96
Stream
97
97
98
98
Visualizing Model ResultsClick on the tabDouble click the node or click and drag it into the streamDouble click the node within the stream or right click and click EditClick on the tabSelectSelect overlay:
SelectClick
99
99
100
100
SummaryTo give a broad overview of text analyticsDefining key termsDescribing important steps in the processTo provide a step-by-step tutorial for how to use IBM SPSS Modeler to...Read in source textExtract concepts, sentiment, and text link patterns from recordsCategorize recordsVisualize the results101
101
Additional ResourcesUsers Guide: http://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ModelerTextAnalytics.pdf Introduction to SPSS Text Analytics Webinar: https://www.youtube.com/watch?v=tK-o4MnRScQ&list=WL&index=2 102