Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found...
Transcript of Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found...
Unstructured Data
James Balamuta STAT 385 @ UIUC
Lecture 23: Apr 03, 2019
• Unstructured Data • Text Representations • String Operations
• Length, Case, Concatenation, Substring, Split String • Text Mining
Announcements
• hw07 is due Wednesday, April 10th, 2019 at 6:00 PM
• Sign up for Quiz 07.
• Window: Apr 7th - Apr 9th
• Sign up: https://cbtf.engr.illinois.edu/sched
• Want to review your homework or quiz grades?Schedule an appointment.
Last Time
• Connecting to a Database
• Interactively obtaining and updating data
• Structured Query Language
• Declarative domain-specific language that handles data querying, manipulation, access, and definitions.
Topic OutlineTools for Reproducibility
Working with Data
Simulations
Modeling Data
Visual Exploratory Data Analysis
Interactive Interfaces
Relational Data
Unstructured Data
Packages
High Performance Computing
Lecture Objectives
• Manipulate unstructured data
• Understand where unstructured data is found
• Differentiate between character values
Unstructured Data
Structures of Data… how data is shaped …
Structured Semistructured Unstructured
* Typical form for scientific experiments and company databases ** RMarkdown Document Properties (YAML), JavaScript Object Notation (JSON), XML*** Pure text documents, images, social media posts, and so on. No visible relationship.
~5 - 10% ~5 - 10% ~80 - 90%
--- title: "Untitled" author: "JJB" date: "1/27/2018" output: html_document ---
Pinky said, "Gee, Brain. What are we going to do tonight?" The Brain replied, "The same thing we do every night, Pinky. Try to take over the world."
key: value ?????????Rectangular
* ** ***
Previously
E-mails and Spam… common unstructured data problems …
Text on Web Pages… common unstructured data problems …
Free Response in Surveys… common unstructured data problems …
Text Representations
Definition: Character is a single symbol that is displayed
'a' 'b' 'c' 'D' 'E' 'F'
'1' '2' '3' '4'
' ' '*' ',' ' " '
Definition: String multiple characters combined together.
'UIUC' 'STAT' 'Chambana' 'Chicago' 'Illinois'
Character Representation
class("S") # [1] "character"
class("STAT 385") # [1] "character"
double_quote = "Hello World!"
single_quote = 'Hello World!'
complex_string = "It's happening!"
escape_string = 'It\'s happening!'
white_space = " "
empty_string = ""
Characters Welcome… constructing strings …
Escape Characters… using special characters …
Symbol Description
\n newline
\r carriage return
\t tab
\b backspace
\a alert (bell)
\f form feed
\v vertical tab
Symbol Description
\\ backslash \
\' ASCII apostrophe '
\" ASCII quotation mark "
\` ASCII grave accent (backtick) `
\nnn character with given octal code (1, 2 or 3 digits)
\xnn character with given hex code (1 or 2 hex digits)
\unnnn Unicode character with given code (1--4 hex digits)
Your TurnConstruct a string that includes the following quote in R
– Patrick Burns, R-help (2005)
“Actually, I see it as part of my job to inflict R on people who are perfectly happy to have never heard of it. Happiness doesn't equal proficient and efficient. In some cases the proficiency of a person serves a greater good than their momentary happiness.”
String Operators
Length
# Total number of elements length("stat") # [1] 1
# How many letters per element? nchar("stat") # [1] 4
# Example String Vector ex_string = c("stat", "eoh", "r")
length(ex_string) # [1] 3
nchar(ex_string) # [1] 4 3 1
Determining the character count of a string
Behind Length… determining size …
3 elements
s t a t" "
e o h" "R" "
length(x)
Behind Length… determining size …
1 element
4 characters
s t a t" "length(x)
nchar(x)
Real-World Example… giving a name for a mobile order at Starbucks …
nchar(x)
Your TurnCount the number of characters used in this sound clip:
https://factba.se/transcript/donald-trump-remarks-steel-aluminum-tariffs-march-8-2018
# Convert all letters to lower case tolower("sTaT 385 at UiUc") # [1] "stat 385 at uiuc"
# Convert all letters to upper case toupper("sTaT 385 at UiUc") # [1] "STAT 385 AT UIUC"
Modifying Case… UPPER to lower or lower to UPPER …
your_name = "James" paste("Hello World to you", your_name, "!") # Add white space # [1] "Hello World to you James !"
paste0("Hello World to you", your_name, "!") # Omits white space # [1] "Hello World to youJames!"
paste("Hello World to you", your_name, "!", sep = "--") # Control separator # [1] "Hello World to you--James--!"
paste("Hello World to you", your_name, "!", sep = "") # Mimic paste0 # [1] "Hello World to youJames!"
Concatenating Strings… merging strings together …
Previously
subject_ids = seq_len(5)
paste0("S", subject_ids) # Create Subject IDs # [1] "S1" "S2" "S3" "S4" "S5"
paste0("S", subject_ids, sep = "-") # Specify a separator # [1] "S1-" "S2-" "S3-" "S4-" "S5-"
paste0("S", subject_ids, collapse = "") # Reduce entries to a single value # [1] "S1S2S3S4S5"
Vectorized Concatenation… merging strings together …
Your TurnForm the following string:
To concatenate the following expressions:
x = seq_len(5) mod = 2
remainder = x %% mod
"Dividing {x} by {mod} gives a remainder of {remainder}"
Substring… extracting characters from a string …
s t a t
[1] [2] [3] [4]
" "s t
[1] [2]
" "
Substring… extracting characters from a string …
substr(x = <data>, start = <begin-loc>, stop = <end-loc>)
String Data Strings to extract data from
Start Positional Index Value to begin at inside the string
End Positional Index Value to go up to in the string
substr("stat", 1, 2) # [1] "st"
substr("Illinois", 4, 8) # [1] "inois"
substr("coding", 7, 10) # [1] ""
substr(c("stat", "Illinois"), 1:2, 3:4) # [1] "sta" "lli"
Substrings… cutting up a string into smaller strings …
Your Turn
Convert the start of all elements in the following vector to having a capital letter
x = c("mumford", "female", "male", "joe", "pete")
Splitting a String… breaking a string in half …
strsplit(x = <data>, split = <pattern>)
String Data Text data to be split apart
Pattern Value to split on
Splitting a String
# List different dishes dishes = c("Spaghetti and Meatballs", "French Onion Soup")
# Break apart words by space strsplit(dishes, " ") # [[1]] # [1] "Spaghetti" "and" "Meatballs" # # [[2]] # [1] "French" "Onion" "Soup"
… breaking a string in half …
Text Mining
+0.25 EC
Unofficial
Retrieving Data
# GitHub Web API v3 library("gh")
# Functional programming library("purrr")
# Sys.setenv("GITHUB_PAT" = "not-telling")
# What are we obtaining? owner = "stat385-sp2019" repo = "disc" number = 71
# Download issue text submitters = gh( "GET /repos/:owner/:repo/issues/:number/comments", owner = owner, repo = repo, number = number, .limit = 1000)
# Extract responses comment_txt = submitters %>% map_chr("body")
# Write to a text file writeLines(comment_txt, con = "color_data.txt")
… grabbing and extracting text data …
CA: Brianna
Definition: Tokenization is the process of splitting text into individual units such as a word (or "token").
Text: What's your name? Mine is James.
Tokens: - What's - your - name? - Mine - is - James.
What's problematic about some of the words?
Text Files # In R, read in text by line # number color_responses = readLines("color_data.txt")
# Each line is an element in # the vector color_responses # [1] "Blue" "Blue" "Blue" "blue" # [5] "Orange" "" "Green"
# Count values table(color_responses) # Black blue # 2 5 3
… grabbing and extracting text data …
Empty string
Definition: Pre-processing is a set of rules to clean the unstructured text data prior to applying tokenization.
• converting • all letters to lower or upper case • numbers into words or
removing numbers • removing
• punctuations: !@#$%^&*().,;: • white spaces: "", new lines, tabs,
and so on • stop words: "the", "a", "an", "in" • sparse or low occurring words
Read Text: What's your name? Mine is James.
Converting: what's your name? mine is james.
Removing: what your name mine james
Pre-processing
# Original Data color_responses # [1] "Blue" "Blue" "Blue" "blue" # [5] "Orange" "" "Green"
# Convert response to lowercase lower_resp = tolower(color_responses) # [1] "blue" "blue" "blue" "blue" # [5] "orange" "" "green"
# Remove empty lines nonempty_resp = lower_resp[ lower_resp != "" # Empty string ] # [1] "blue" "blue" "blue" "blue" # [5] "orange" "green" # ^^^
# Remove punctation no_punct = gsub( "[[:punct:]]", # regex Identifier "", # Empty string nonempty_resp )
… cleaning text up …
Collapsing Entries
# Remove empty lines nonempty_resp = color_responses[ color_responses != "" # Empty string ]
# Create large string combined_str = paste(nonempty_resp, collapse = " ")
# Break the string apart combined_str_split = tolower( strsplit( combined_str, split=" " )[[1]] )
# Count values table(combined_str_split) # black blue # 5 10
… large to small …
Definition: Text Vectorization is a way of converting text data into numerical values. Common approaches are:
- Bag of Words: Counting the frequency of each word without focusing on appearance order.
- Sentiment Analysis: Assigns +1 to a positive word and -1 to negative words based on a "sentiment" lexicon/dictionary
Bag of Words: To be or not to be.
Counts to: 2 be: 2 or: 1 not: 1
Sentiment Analysis: I'm happy, thrilled, and a bit horrified to be here!
Sentiment = +1 +1 -1 = +1 happy: +1 thrilled: +1 horrified: -1
Your Turn
Determine the sentiment score for:
• I was on hold for 40 minutes and customer support is a nightmare
• This product is great. It changed my life for the better.
Lexicon: • Positive: Great, Life, Support • Negative: Changed, Hold, Nightmare
Definition: Document-term matrix provides a structure for counting words by arranging the frequency of how many times the words appear across multiple documents.
D1 D2 D3 D4 D5
apple 11 2 7 3 0
stat 0 0 0 10 0
user 5 3 8 0 1
survey 3 0 0 12 0
Terms Tokens or words
Documents Collection of text
Frequency of Term Appearance
# New package! # install.packages("tm") library("tm")
# Create a "body" of documents corpus = Corpus( VectorSource( combined_str ) )
# Construct the Document Matrix tdm = TermDocumentMatrix( corpus )
# View counts inspect(tdm)
DTM… building a document matrix …
Your TurnWhat happens to the DTM if we change `combined_str` to
`combined_str_split`?
# Create a "body" of documents corpus = Corpus(VectorSource(combined_str))
# Construct the Document Matrix tdm = TermDocumentMatrix(corpus)
# View counts inspect(tdm)
Recap• Unstructured Data
• Text data such as e-mails, help posts, free response
• Text Representation
• How R thinks about strings
• String Operators
• Ways to modify strings in R
• Text Mining
• Converting unstructured data into structured data.
This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License