Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found...

49
Unstructured Data James Balamuta STAT 385 @ UIUC Lecture 23: Apr 03, 2019 Unstructured Data Text Representations String Operations Length, Case, Concatenation, Substring, Split String Text Mining

Transcript of Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found...

Page 2: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Announcements

• hw07 is due Wednesday, April 10th, 2019 at 6:00 PM

• Sign up for Quiz 07.

• Window: Apr 7th - Apr 9th

• Sign up: https://cbtf.engr.illinois.edu/sched

• Want to review your homework or quiz grades?Schedule an appointment.

Page 3: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Last Time

• Connecting to a Database

• Interactively obtaining and updating data

• Structured Query Language

• Declarative domain-specific language that handles data querying, manipulation, access, and definitions.

Page 4: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Topic OutlineTools for Reproducibility

Working with Data

Simulations

Modeling Data

Visual Exploratory Data Analysis

Interactive Interfaces

Relational Data

Unstructured Data

Packages

High Performance Computing

Page 5: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Topic Outline

http://r4ds.had.co.nz/introduction.html

Page 6: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Lecture Objectives

• Manipulate unstructured data

• Understand where unstructured data is found

• Differentiate between character values

Page 7: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Unstructured Data

Page 8: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Structures of Data… how data is shaped …

Structured Semistructured Unstructured

* Typical form for scientific experiments and company databases ** RMarkdown Document Properties (YAML), JavaScript Object Notation (JSON), XML*** Pure text documents, images, social media posts, and so on. No visible relationship.

~5 - 10% ~5 - 10% ~80 - 90%

--- title: "Untitled" author: "JJB" date: "1/27/2018" output: html_document ---

Pinky said, "Gee, Brain. What are we going to do tonight?" The Brain replied, "The same thing we do every night, Pinky. Try to take over the world."

key: value ?????????Rectangular

* ** ***

Previously

Page 9: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

E-mails and Spam… common unstructured data problems …

Page 10: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Text on Web Pages… common unstructured data problems …

Page 11: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Free Response in Surveys… common unstructured data problems …

Page 12: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Text Representations

Page 13: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Definition: Character is a single symbol that is displayed

'a' 'b' 'c' 'D' 'E' 'F'

'1' '2' '3' '4'

' ' '*' ',' ' " '

Page 14: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Definition: String multiple characters combined together.

'UIUC' 'STAT' 'Chambana' 'Chicago' 'Illinois'

Page 15: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Character Representation

class("S") # [1] "character"

class("STAT 385") # [1] "character"

Page 16: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

double_quote = "Hello World!"

single_quote = 'Hello World!'

complex_string = "It's happening!"

escape_string = 'It\'s happening!'

white_space = " "

empty_string = ""

Characters Welcome… constructing strings …

Page 17: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Escape Characters… using special characters …

Symbol Description

\n newline

\r carriage return

\t tab

\b backspace

\a alert (bell)

\f form feed

\v vertical tab

Symbol Description

\\ backslash \

\' ASCII apostrophe '

\" ASCII quotation mark "

\` ASCII grave accent (backtick) `

\nnn character with given octal code (1, 2 or 3 digits)

\xnn character with given hex code (1 or 2 hex digits)

\unnnn Unicode character with given code (1--4 hex digits)

Page 18: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Your TurnConstruct a string that includes the following quote in R

– Patrick Burns, R-help (2005)

“Actually, I see it as part of my job to inflict R on people who are perfectly happy to have never heard of it. Happiness doesn't equal proficient and efficient. In some cases the proficiency of a person serves a greater good than their momentary happiness.”

Page 19: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

String Operators

Page 20: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Length

# Total number of elements length("stat") # [1] 1

# How many letters per element? nchar("stat") # [1] 4

# Example String Vector ex_string = c("stat", "eoh", "r")

length(ex_string) # [1] 3

nchar(ex_string) # [1] 4 3 1

Determining the character count of a string

Page 21: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Behind Length… determining size …

3 elements

s t a t" "

e o h" "R" "

length(x)

Page 22: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Behind Length… determining size …

1 element

4 characters

s t a t" "length(x)

nchar(x)

Page 23: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Real-World Example… giving a name for a mobile order at Starbucks …

nchar(x)

Page 24: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Your TurnCount the number of characters used in this sound clip:

https://factba.se/transcript/donald-trump-remarks-steel-aluminum-tariffs-march-8-2018

Page 25: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

# Convert all letters to lower case tolower("sTaT 385 at UiUc") # [1] "stat 385 at uiuc"

# Convert all letters to upper case toupper("sTaT 385 at UiUc") # [1] "STAT 385 AT UIUC"

Modifying Case… UPPER to lower or lower to UPPER …

Page 26: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

your_name = "James" paste("Hello World to you", your_name, "!") # Add white space # [1] "Hello World to you James !"

paste0("Hello World to you", your_name, "!") # Omits white space # [1] "Hello World to youJames!"

paste("Hello World to you", your_name, "!", sep = "--") # Control separator # [1] "Hello World to you--James--!"

paste("Hello World to you", your_name, "!", sep = "") # Mimic paste0 # [1] "Hello World to youJames!"

Concatenating Strings… merging strings together …

Previously

Page 27: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

subject_ids = seq_len(5)

paste0("S", subject_ids) # Create Subject IDs # [1] "S1" "S2" "S3" "S4" "S5"

paste0("S", subject_ids, sep = "-") # Specify a separator # [1] "S1-" "S2-" "S3-" "S4-" "S5-"

paste0("S", subject_ids, collapse = "") # Reduce entries to a single value # [1] "S1S2S3S4S5"

Vectorized Concatenation… merging strings together …

Page 28: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Your TurnForm the following string:

To concatenate the following expressions:

x = seq_len(5) mod = 2

remainder = x %% mod

"Dividing {x} by {mod} gives a remainder of {remainder}"

Page 29: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Substring… extracting characters from a string …

s t a t

[1] [2] [3] [4]

" "s t

[1] [2]

" "

Page 30: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Substring… extracting characters from a string …

substr(x = <data>, start = <begin-loc>, stop = <end-loc>)

String Data Strings to extract data from

Start Positional Index Value to begin at inside the string

End Positional Index Value to go up to in the string

Page 31: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

substr("stat", 1, 2) # [1] "st"

substr("Illinois", 4, 8) # [1] "inois"

substr("coding", 7, 10) # [1] ""

substr(c("stat", "Illinois"), 1:2, 3:4) # [1] "sta" "lli"

Substrings… cutting up a string into smaller strings …

Page 32: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Your Turn

Convert the start of all elements in the following vector to having a capital letter

x = c("mumford", "female", "male", "joe", "pete")

Page 33: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Splitting a String… breaking a string in half …

strsplit(x = <data>, split = <pattern>)

String Data Text data to be split apart

Pattern Value to split on

Page 34: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Splitting a String

# List different dishes dishes = c("Spaghetti and Meatballs", "French Onion Soup")

# Break apart words by space strsplit(dishes, " ") # [[1]] # [1] "Spaghetti" "and" "Meatballs" # # [[2]] # [1] "French" "Onion" "Soup"

… breaking a string in half …

Page 35: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Text Mining

Page 36: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

+0.25 EC

Unofficial

Page 37: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Retrieving Data

# GitHub Web API v3 library("gh")

# Functional programming library("purrr")

# Sys.setenv("GITHUB_PAT" = "not-telling")

# What are we obtaining? owner = "stat385-sp2019" repo = "disc" number = 71

# Download issue text submitters = gh( "GET /repos/:owner/:repo/issues/:number/comments", owner = owner, repo = repo, number = number, .limit = 1000)

# Extract responses comment_txt = submitters %>% map_chr("body")

# Write to a text file writeLines(comment_txt, con = "color_data.txt")

… grabbing and extracting text data …

CA: Brianna

Page 38: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Definition: Tokenization is the process of splitting text into individual units such as a word (or "token").

Text: What's your name? Mine is James.

Tokens: - What's - your - name? - Mine - is - James.

What's problematic about some of the words?

Page 39: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Text Files # In R, read in text by line # number color_responses = readLines("color_data.txt")

# Each line is an element in # the vector color_responses # [1] "Blue" "Blue" "Blue" "blue" # [5] "Orange" "" "Green"

# Count values table(color_responses) # Black blue # 2 5 3

… grabbing and extracting text data …

Empty string

Page 40: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Definition: Pre-processing is a set of rules to clean the unstructured text data prior to applying tokenization.

• converting • all letters to lower or upper case • numbers into words or

removing numbers • removing

• punctuations: !@#$%^&*().,;: • white spaces: "", new lines, tabs,

and so on • stop words: "the", "a", "an", "in" • sparse or low occurring words

Read Text: What's your name? Mine is James.

Converting: what's your name? mine is james.

Removing: what your name mine james

Page 41: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Pre-processing

# Original Data color_responses # [1] "Blue" "Blue" "Blue" "blue" # [5] "Orange" "" "Green"

# Convert response to lowercase lower_resp = tolower(color_responses) # [1] "blue" "blue" "blue" "blue" # [5] "orange" "" "green"

# Remove empty lines nonempty_resp = lower_resp[ lower_resp != "" # Empty string ] # [1] "blue" "blue" "blue" "blue" # [5] "orange" "green" # ^^^

# Remove punctation no_punct = gsub( "[[:punct:]]", # regex Identifier "", # Empty string nonempty_resp )

… cleaning text up …

Page 42: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Collapsing Entries

# Remove empty lines nonempty_resp = color_responses[ color_responses != "" # Empty string ]

# Create large string combined_str = paste(nonempty_resp, collapse = " ")

# Break the string apart combined_str_split = tolower( strsplit( combined_str, split=" " )[[1]] )

# Count values table(combined_str_split) # black blue # 5 10

… large to small …

Page 43: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Definition: Text Vectorization is a way of converting text data into numerical values. Common approaches are:

- Bag of Words: Counting the frequency of each word without focusing on appearance order.

- Sentiment Analysis: Assigns +1 to a positive word and -1 to negative words based on a "sentiment" lexicon/dictionary

Bag of Words: To be or not to be.

Counts to: 2 be: 2 or: 1 not: 1

Sentiment Analysis: I'm happy, thrilled, and a bit horrified to be here!

Sentiment = +1 +1 -1 = +1 happy: +1 thrilled: +1 horrified: -1

Page 44: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Your Turn

Determine the sentiment score for:

• I was on hold for 40 minutes and customer support is a nightmare

• This product is great. It changed my life for the better.

Lexicon: • Positive: Great, Life, Support • Negative: Changed, Hold, Nightmare

Page 45: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Definition: Document-term matrix provides a structure for counting words by arranging the frequency of how many times the words appear across multiple documents.

D1 D2 D3 D4 D5

apple 11 2 7 3 0

stat 0 0 0 10 0

user 5 3 8 0 1

survey 3 0 0 12 0

Terms Tokens or words

Documents Collection of text

Frequency of Term Appearance

Page 46: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

# New package! # install.packages("tm") library("tm")

# Create a "body" of documents corpus = Corpus( VectorSource( combined_str ) )

# Construct the Document Matrix tdm = TermDocumentMatrix( corpus )

# View counts inspect(tdm)

DTM… building a document matrix …

Page 47: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Your TurnWhat happens to the DTM if we change `combined_str` to

`combined_str_split`?

# Create a "body" of documents corpus = Corpus(VectorSource(combined_str))

# Construct the Document Matrix tdm = TermDocumentMatrix(corpus)

# View counts inspect(tdm)

Page 48: Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found • Differentiate between character values. Unstructured Data. Structures of Data

Recap• Unstructured Data

• Text data such as e-mails, help posts, free response

• Text Representation

• How R thinks about strings

• String Operators

• Ways to modify strings in R

• Text Mining

• Converting unstructured data into structured data.