Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found...

Unstructured Data

James Balamuta STAT 385 @ UIUC

Lecture 23: Apr 03, 2019

• Unstructured Data • Text Representations • String Operations

• Length, Case, Concatenation, Substring, Split String • Text Mining

https://creativecommons.org/licenses/by-nc-sa/4.0/







Announcements

• hw07 is due Wednesday, April 10th, 2019 at 6:00 PM

• Sign up for Quiz 07.

• Window: Apr 7th - Apr 9th

• Sign up: https://cbtf.engr.illinois.edu/sched

• Want to review your homework or quiz grades?Schedule an appointment.

https://cbtf.engr.illinois.edu/sched

Last Time

• Connecting to a Database

• Interactively obtaining and updating data

• Structured Query Language

• Declarative domain-specific language that handles data querying, manipulation, access, and definitions.

Topic OutlineTools for Reproducibility

Working with Data

Simulations

Modeling Data

Visual Exploratory Data Analysis

Interactive Interfaces

Relational Data

Unstructured Data

Packages

High Performance Computing

Topic Outline

http://r4ds.had.co.nz/introduction.html

http://r4ds.had.co.nz/introduction.html

Lecture Objectives

• Manipulate unstructured data

• Understand where unstructured data is found

• Differentiate between character values

Unstructured Data

Structures of Data… how data is shaped …

Structured Semistructured Unstructured

* Typical form for scientific experiments and company databases ** RMarkdown Document Properties (YAML), JavaScript Object Notation (JSON), XML*** Pure text documents, images, social media posts, and so on. No visible relationship.

~5 - 10% ~5 - 10% ~80 - 90%

--- title: "Untitled" author: "JJB" date: "1/27/2018" output: html_document ---

Pinky said, "Gee, Brain. What are we going to do tonight?" The Brain replied, "The same thing we do every night, Pinky. Try to take over the world."

key: value ?????????Rectangular

* ** ***

Previously

E-mails and Spam… common unstructured data problems …

Text on Web Pages… common unstructured data problems …

Free Response in Surveys… common unstructured data problems …

Text Representations

Definition: Character is a single symbol that is displayed

'a' 'b' 'c' 'D' 'E' 'F'

'1' '2' '3' '4'

' ' '*' ',' ' " '

Definition: String multiple characters combined together.

'UIUC' 'STAT' 'Chambana' 'Chicago' 'Illinois'

Character Representation

class("S") # [1] "character"

class("STAT 385") # [1] "character"

double_quote = "Hello World!"

single_quote = 'Hello World!'

complex_string = "It's happening!"

escape_string = 'It\'s happening!'

white_space = " "

empty_string = ""

Characters Welcome… constructing strings …

Escape Characters… using special characters …

Symbol Description

\n newline

\r carriage return

\t tab

\b backspace

\a alert (bell)

\f form feed

\v vertical tab

Symbol Description

\\ backslash \

\' ASCII apostrophe '

\" ASCII quotation mark "

\` ASCII grave accent (backtick) `

\nnn character with given octal code (1, 2 or 3 digits)

\xnn character with given hex code (1 or 2 hex digits)

\unnnn Unicode character with given code (1--4 hex digits)

Your TurnConstruct a string that includes the following quote in R

– Patrick Burns, R-help (2005)

“Actually, I see it as part of my job to inflict R on people who are perfectly happy to have never heard of it. Happiness doesn't equal proficient and efficient. In some cases the proficiency of a person serves a greater good than their momentary happiness.”

String Operators

Length

# Total number of elements length("stat") # [1] 1

# How many letters per element? nchar("stat") # [1] 4

# Example String Vector ex_string = c("stat", "eoh", "r")

length(ex_string) # [1] 3

nchar(ex_string) # [1] 4 3 1

Determining the character count of a string

Behind Length… determining size …

3 elements

s t a t" "

e o h" "R" "

length(x)

Behind Length… determining size …

1 element

4 characters

s t a t" "length(x)

nchar(x)

Real-World Example… giving a name for a mobile order at Starbucks …

nchar(x)

Your TurnCount the number of characters used in this sound clip:

https://factba.se/transcript/donald-trump-remarks-steel-aluminum-tariffs-march-8-2018

https://factba.se/transcript/donald-trump-remarks-steel-aluminum-tariffs-march-8-2018

# Convert all letters to lower case tolower("sTaT 385 at UiUc") # [1] "stat 385 at uiuc"

# Convert all letters to upper case toupper("sTaT 385 at UiUc") # [1] "STAT 385 AT UIUC"

Modifying Case… UPPER to lower or lower to UPPER …

your_name = "James" paste("Hello World to you", your_name, "!") # Add white space # [1] "Hello World to you James !"

paste0("Hello World to you", your_name, "!") # Omits white space # [1] "Hello World to youJames!"

paste("Hello World to you", your_name, "!", sep = "--") # Control separator # [1] "Hello World to you--James--!"

paste("Hello World to you", your_name, "!", sep = "") # Mimic paste0 # [1] "Hello World to youJames!"

Concatenating Strings… merging strings together …

Previously

subject_ids = seq_len(5)

paste0("S", subject_ids) # Create Subject IDs # [1] "S1" "S2" "S3" "S4" "S5"

paste0("S", subject_ids, sep = "-") # Specify a separator # [1] "S1-" "S2-" "S3-" "S4-" "S5-"

paste0("S", subject_ids, collapse = "") # Reduce entries to a single value # [1] "S1S2S3S4S5"

Vectorized Concatenation… merging strings together …

Your TurnForm the following string:

To concatenate the following expressions:

x = seq_len(5) mod = 2

remainder = x %% mod

"Dividing {x} by {mod} gives a remainder of {remainder}"

Substring… extracting characters from a string …

s t a t

[1] [2] [3] [4]

" "s t

[1] [2]

" "

Substring… extracting characters from a string …

substr(x = <data>, start = <begin-loc>, stop = <end-loc>)

String Data Strings to extract data from

Start Positional Index Value to begin at inside the string

End Positional Index Value to go up to in the string

substr("stat", 1, 2) # [1] "st"

substr("Illinois", 4, 8) # [1] "inois"

substr("coding", 7, 10) # [1] ""

substr(c("stat", "Illinois"), 1:2, 3:4) # [1] "sta" "lli"

Substrings… cutting up a string into smaller strings …

Your Turn

Convert the start of all elements in the following vector to having a capital letter

x = c("mumford", "female", "male", "joe", "pete")

Splitting a String… breaking a string in half …

strsplit(x = <data>, split = <pattern>)

String Data Text data to be split apart

Pattern Value to split on

Splitting a String

# List different dishes dishes = c("Spaghetti and Meatballs", "French Onion Soup")

# Break apart words by space strsplit(dishes, " ") # [[1]] # [1] "Spaghetti" "and" "Meatballs" # # [[2]] # [1] "French" "Onion" "Soup"

… breaking a string in half …

Text Mining

+0.25 EC

Unofficial

Retrieving Data

# GitHub Web API v3 library("gh")

# Functional programming library("purrr")

# Sys.setenv("GITHUB_PAT" = "not-telling")

# What are we obtaining? owner = "stat385-sp2019" repo = "disc" number = 71

# Download issue text submitters = gh( "GET /repos/:owner/:repo/issues/:number/comments", owner = owner, repo = repo, number = number, .limit = 1000)

# Extract responses comment_txt = submitters %>% map_chr("body")

# Write to a text file writeLines(comment_txt, con = "color_data.txt")

… grabbing and extracting text data …

CA: Brianna

Definition: Tokenization is the process of splitting text into individual units such as a word (or "token").

Text: What's your name? Mine is James.

Tokens: - What's - your - name? - Mine - is - James.

What's problematic about some of the words?

Text Files # In R, read in text by line # number color_responses = readLines("color_data.txt")

# Each line is an element in # the vector color_responses # [1] "Blue" "Blue" "Blue" "blue" # [5] "Orange" "" "Green"

# Count values table(color_responses) # Black blue # 2 5 3

… grabbing and extracting text data …

Empty string

Definition: Pre-processing is a set of rules to clean the unstructured text data prior to applying tokenization.

• converting • all letters to lower or upper case • numbers into words or

removing numbers • removing

• punctuations: !@#$%^&*().,;: • white spaces: "", new lines, tabs,

and so on • stop words: "the", "a", "an", "in" • sparse or low occurring words

Read Text: What's your name? Mine is James.

Converting: what's your name? mine is james.

Removing: what your name mine james

Pre-processing

# Original Data color_responses # [1] "Blue" "Blue" "Blue" "blue" # [5] "Orange" "" "Green"

# Convert response to lowercase lower_resp = tolower(color_responses) # [1] "blue" "blue" "blue" "blue" # [5] "orange" "" "green"

# Remove empty lines nonempty_resp = lower_resp[ lower_resp != "" # Empty string ] # [1] "blue" "blue" "blue" "blue" # [5] "orange" "green" # ^^^

# Remove punctation no_punct = gsub( "[[:punct:]]", # regex Identifier "", # Empty string nonempty_resp )

… cleaning text up …

Collapsing Entries

# Remove empty lines nonempty_resp = color_responses[ color_responses != "" # Empty string ]

# Create large string combined_str = paste(nonempty_resp, collapse = " ")

# Break the string apart combined_str_split = tolower( strsplit( combined_str, split=" " )[[1]] )

# Count values table(combined_str_split) # black blue # 5 10

… large to small …

Definition: Text Vectorization is a way of converting text data into numerical values. Common approaches are:

- Bag of Words: Counting the frequency of each word without focusing on appearance order.

- Sentiment Analysis: Assigns +1 to a positive word and -1 to negative words based on a "sentiment" lexicon/dictionary

Bag of Words: To be or not to be.

Counts to: 2 be: 2 or: 1 not: 1

Sentiment Analysis: I'm happy, thrilled, and a bit horrified to be here!

Sentiment = +1 +1 -1 = +1 happy: +1 thrilled: +1 horrified: -1

Your Turn

Determine the sentiment score for:

• I was on hold for 40 minutes and customer support is a nightmare

• This product is great. It changed my life for the better.

Lexicon: • Positive: Great, Life, Support • Negative: Changed, Hold, Nightmare

Definition: Document-term matrix provides a structure for counting words by arranging the frequency of how many times the words appear across multiple documents.

D1 D2 D3 D4 D5

apple 11 2 7 3 0

stat 0 0 0 10 0

user 5 3 8 0 1

survey 3 0 0 12 0

Terms Tokens or words

Documents Collection of text

Frequency of Term Appearance

# New package! # install.packages("tm") library("tm")

# Create a "body" of documents corpus = Corpus( VectorSource( combined_str ) )

# Construct the Document Matrix tdm = TermDocumentMatrix( corpus )

# View counts inspect(tdm)

DTM… building a document matrix …

Your TurnWhat happens to the DTM if we change `combined_str` to

`combined_str_split`?

# Create a "body" of documents corpus = Corpus(VectorSource(combined_str))

# Construct the Document Matrix tdm = TermDocumentMatrix(corpus)

# View counts inspect(tdm)

Recap• Unstructured Data

• Text data such as e-mails, help posts, free response

• Text Representation

• How R thinks about strings

• String Operators

• Ways to modify strings in R

• Text Mining

• Converting unstructured data into structured data.

This work is licensed under the Creative Commons

Attribution-NonCommercial-ShareAlike 4.0 International

License







Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found...

Documents

Transcript of Lecture 23: Apr 03, 2019 Unstructured Data€¦ · • Understand where unstructured data is found...