Email Data Cleaning (KDD’05) Jie Tang 1, Hang Li 2, Yunbo Cao 2, Zhaohui Tang 3 1 Tsinghua...

33
Email Data Cleaning (KDD’05) Jie Tang 1 , Hang Li 2 , Yunbo Cao 2 , Zhaohui Tang 3 1 Tsinghua University 2 Microsoft Research Asia 3 Microsoft Corporation

Transcript of Email Data Cleaning (KDD’05) Jie Tang 1, Hang Li 2, Yunbo Cao 2, Zhaohui Tang 3 1 Tsinghua...

Email Data Cleaning(KDD’05)

Jie Tang1, Hang Li2, Yunbo Cao2, Zhaohui Tang3

1 Tsinghua University2 Microsoft Research Asia

3 Microsoft Corporation

Outline

Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

Motivation

Email is one of the most common modes of communication

Text mining applications on emails Email classification Email summarization Term extraction from email …

Term ExtractionFrom: SY <[email protected]> - Find messages by this author Date: Mon, 4 Apr 2005 11:29:28 +0530Subject: Re: ..How to do addition??

Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows

import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][];

public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) }

-- Sandeep Yadav Tel: 011-243600808 Homepage: http://www.it.com/~Sandeep/

On Apr 3, 2005 5:33 PM, ranger <[email protected]> wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to> enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx

Extra line break

Missing spaceExtra spaceMissing period

Case errors.

Hi Ranger, Your design of Matrix class is not good. What are you doing with two matrices in a single class? Make class Matrix as follows:

Outline

Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

Related Work -- Data Mining

Email Cleaning Several products have the feature of email cleaning by using rules E.g. eClean (2000), WinPure ListCleaner Pro (2004)

Information Extraction from Email Extracting contact information, etc E.g. Kristjansson and Culotta (2004), Culotta, Bekkerman, and McCallum

(2004), Viola (2005)

Web Page Cleaning Removing banner ads, decoration pictures E.g. Yi and Liu (2003), Lin and Ho (2002)

Tabular Data Cleaning Detecting and removing duplicate information E.g. Hernández and Stolfo (1998), Rahm and Do (2000), SQL Server 2005

Related Work -- Language Processing

Sentence Boundary Detection Palmer and Hearst (1997)

Case Restoration Lita and Ittycheriah (2003) Mikheev (2002)

Spelling Error Correction Golding and Roth (I996)

Outline

Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

Our Approach -- Cascaded Approach

Cleaning = non-text block filtering + text normalization

•Non-text block filtering

- Quotation detection

- Header detection

- Signature detection

- Program code detection

•Text normalization

- Paragraph normalization * Extra line break detection

- Sentence normalization * Missing period detection

- Word normalization * Case restoration

Cascaded ApproachNoisy Email

Message

Non-text Block Filtering

Quotation Detection

Header Detection

Program Code Detection

Cleaned Email Message

Paragraph Normalization

Extra Line Break Detection

Sentence NormalizationMissing Periods and

Missing spaces Detection

Extra Spaces Detection

Word Normalization

Case Restoration

Signature Detection

From: SY <[email protected]> - Find messages by this author Date: Mon, 4 Apr 2005 11:29:28 +0530Subject: Re: ..How to do addition??

Hi Ranger, Your design of Matrix class is not good.what are you doing with two matrices in a single class?make class Matrix as follows

import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][];

public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) }

-- Sandeep Yadav Tel: 011-243600808 Homepage: http://www.it.com/~Sandeep/

On Apr 3, 2005 5:33 PM, ranger <[email protected]> wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition..

Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows

Hi Ranger, Your design of Matrix class is not good. What are you doing with two matrices in a single class? Make class Matrix as follows.

Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class? make class Matrix as follows.

In a particular text mining application, we can retain some of the blocks

Quotation Detection

Header Detection

Signature Detection

Extra line break Detection

Missing Period and Missing Space Detection

Program Code Detection

Extra Space Detection

Case Restoration

Outline

Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

Technical Issues

Non-text filtering Quotation detection Header detection Signature detection Program code detection

Text normalization Extra line break detection Sentence normalization Case restoration

SVMs

Start line model

Two SVM models

End line model

Training data

Test data

Feature extraction

Feature extraction

Identified blocks

Position feature

Positive word feature

...

SVMs

Feature extraction

Position feature

Positive word feature

...

Start line feature set End line feature set

Non-text Filtering Using SVMs

Header detectionSignature detectionProgram code detection

Features Used in Header Detection

Position Feature Is the first line?

Positive Word Features Begins with: “From:”, “Re:”, “In article”, etc.

Contains: “original message”, “Fwd:”, etc.

Ends with: “wrote:”, “said:”, etc.

Negative Word Features Contains: “Hi”, “dear”, “thank you”, “best regards”, etc.

Number of Words Feature Number of words in the current line

Person Name Feature Contains a person name?

Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark, etc.

Special Pattern FeaturesContains one type of special patterns: email, date, number, URL, percentage, etc.

Number of Line Breaks Feature Number of line breaks exist before the current line

Special Pattern Features Contains one type of special patterns: email, date, number, URL, percentage, etc.

Positive Word FeaturesBegins with: “From:”, “Re:”, “In article”, etc. Contains: “original message”, “Fwd:”, etc.Ends with: “wrote:”, “said:”, etc.

Position Feature Is the first line?

Ending Character FeaturesEnds with: colon, semicolon, quotation mark, question mark, exclamation mark, etc.

From: SY <[email protected]> - Find messages by this author Date: Mon, 4 Apr 2005 11:29:28 +0530Subject: Re: ..How to do addition??

Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows

import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][];

public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) }

-- Sandeep Yadav Tel: 011-243600808 Homepage: http://www.it.com/~Sandeep/

On Apr 3, 2005 5:33 PM, ranger <[email protected]> wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx

Two SVM models are

employed to respectively

identify the start line

and end line.

Header Detection

Position Feature

Positive Word Features (“From:”)

Negative Word Features

Number of Words Feature

Person Name Feature

Ending Character Features

Special Pattern Features (“email”)

Number of Line Breaks Feature

Position Feature

Positive Word Features (“Subject:”)

Negative Word Features

Number of Words Feature

Person Name Feature

Ending Character Features (“??”)

Special Pattern Features

Number of Line Breaks Feature

- Input: An annotated email dataset. - Output: Discovered features.- Algorithm:

Step 1: Preprocessing. This step first processes emails by using hard rules. it replaces several special patterns by a tag. For example, an email address “[email protected]” is to be replaced by a tag <email>.

Step 2: Learning patterns. This step take the header lines as positive samples and the other lines as negative samples. It employs the pattern learning tool to discovering the patterns. An example of the discovered patterns is: “<begin> Date: <week> <date> <time> <end>”.

Step 3: Generating features. This step generates features according to the learned patterns by using heuristic rules. For the above example, the corresponding feature can be: “^\s*Date: <week> <date> <time>\s*$”. The feature represents whether or not the current line contains the pattern.

Automatic Feature Generation

Generated Features

From: <email>

Subject: (.*?) Re:

<<email>> wrote in message

Date: <week> <date>

Subject:

<week> <date> <time>

Date:

-----Original Message-----

To: <email>

….

- Feature definition is tedious.- Can we automate the feature generation?

Example Features Used in Signature DetectionPosition Feature Is the first line or the last line?

Positive Word Features Contains: “Best Regards”, “Thanks”, “Sincerely”, “Good luck”, etc.

Number of Words Feature Number of words in the current line

Person Name Feature Contains a person name?

Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark…

Special Symbol Pattern Features

Contains consecutive special symbols such as: “--------”, “======”, “******”.

Case Features Whether the tokens are all in upper-case, all in lower-case, all capitalized or only the first token is capitalized

Number of Line Breaks Feature Number of line breaks exist before the current line

Position Feature Position of the current line

Declaration Keyword Features Starts with: “string”, “char”, “double”, “dim”, “typedef struct”, “#include”, “import”, “#define”, “#undef”, etc.

Statement Keyword Features There are four kind of statement keyword features:- “i++”; - “if”, “else if”, “switch”, and “case”; - “while”, “do{”, “for”, and “foreach”; - “goto”, “continue;”, “next;”, “break;”

Equation Pattern Features There are four kind of equation pattern features:- “=”, “<=” and “<<=” - “a=b+/*-c;” - “a=B(bb,cc);” - “a=b;”

Function Pattern Feature Contains function pattern? E.g., pattern covering “fread(pbBuffer,1, LOCK_SIZE, hSrcFile);”

Example Features Used in Program Code Detection

SVMs

Extra line break model

One SVM model

Training data

Test data

Feature extraction

Feature extraction

Identified extra line breaks

Position feature

Bullet feature

...

Feature set

Extra Line Break Detection Using SVMs

Features Used in Extra Line Break Detection

Position Feature Is the first line or the last line?

Greeting Word Features Contains: “Hi” and “Dear”, etc.

Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark, etc.

Case Features Whether the current line ends with a word in lower case letters and whether or not the next line starts with a word in lower case letters

Bullet Features Is the next line one kind of bullet of a list item like “1.” and “a)”?

Number of Line Breaks Feature Number of line breaks exist after the current line

Case Features Whether the current line ends with a word in lower case letters and whether or not the next line starts with a word in lower case letters

Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows

One SVM model is employed to identify whether a line break is an extra one or not.

Extra Line Break Detection

Position Feature

Greeting Word Features

Ending Character Features

Case Features

Bullet Features

Number of Line Breaks Feature

Case restoration

tri-gram + sentence level decoding

Jack utilize outlook express to retrieve emails.

Jack

jack

JACK

Utilize

utilize

UTILIZE

Outlook

outlook

OUTLOOK

Express

express

EXPRESS

To

to

TO

Receive

receive

RECEIVE

Emails

emails

EMAILS

2 12 1

2 1

( )( | )

( )i i i

i i ii i

C w w wP w w w

C w w

2 1 2 12 1

2 12 1

2 1 1

( ) ( ( ))( ) 0

( )( | )

( ) ( | )

i i i i i ii i i

i ii i i

i i i i

C w w w D C w w wif C w w w

C w wP w w w

w w P w w otherwise

Backoff scheme:

Outline

Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

Datasets in Experiments

73.2% contain extra line breaks 85.4% need sentence normalization 47.1% contain case errors Only 1.6% are absolutely clean

Data Set # of EmailContaining

HeaderContaining Signature

Containing Prog. Code

Text Only

DC 100 1.00 0.87 0.15 0.0Ontology 100 1.00 0.77 0.02 0.0

NLP 60 1.00 0.883 0.0 0.0ML 40 1.00 0.975 0.05 0.0

Jena 700 0.996 0.97 0.38 0.0Weka 200 0.995 0.975 0.17 0.0005

Protégé 500 0.28 0.822 0.032 0.168OWL 500 0.384 0.932 0.042 0.048

Mobility 400 0.44 0.745 0.0 0.183WinServer 400 0.449 0.672 0.0125 0.221Windows 1000 0.476 0.653 0.007 0.218

PSS 1000 0.492 0.668 0.01 0.208BR 310 0.495 0.643 0.0 0.244

J2EE 255 1.00 0.561 0.094 0

5565 3256(0.585) 4229(0.760) 401(0.072)768(0.138

)3256(0.585) 4229(0.760)

0.15

0.380.17

5565

Cleaning Results -- 5-fold Cross Validation

Cleaning Task Precision Recall F1-Measure

HeaderOur Method 0.9695 0.9742 0.9719

Baseline 0.9981 0.6055 0.7537

SignatureOur Method 0.9133 0.8838 0.8983

Baseline 0.8854 0.2368 0.3736

Quotation 0.9818 0.9201 0.9500

Program Code 0.9297 0.7217 0.8126

Extra Line Break

Our Method 0.8553 0.9765 0.9119

Baseline 0.6355 0.9813 0.7715

Sentence 0.9493 0.9391 0.9442

Baseline methods• Header detection (eClean2000)• Signature detection (rule based)• Extra line break detection baseline (eClean2000)

For case restoration:-Our method can reach 98.15% in terms of accuracy-The accuracy of Trucasing is about 97.7%

Automatic Features vs. Manual Features

Detection Task Precision Recall F1-Measure

HeaderManual 0.9695 0.9742 0.9719

Automatic 0.9932 0.9626 0.9777

SignatureManual 0.9133 0.8838 0.8983

Automatic 0.7616 0.6671 0.7112

BR J2EE

40

50

60

70

80

90

100

Precision Recall F1-Measure

Per

cent

age(

%)

Original Data Baseline Our Method

30

40

50

60

70

80

Precision Recall F1-Measure

Per

cent

age(

%)

Original Data Baseline Our Method

Term Extraction Using Email Cleaning

40

50

60

70

80

90

100

Precision Recall F1-Measure

Per

cent

age(

%)

Original Data +Header +Signature +Quotation +Program +Paragraph

How Cleaning Processing Helps Term Extraction

+74.2%+6.4% +41%

BR

30

40

50

60

70

80

Precision Recall F1-Measure

Per

cent

age(

%)

Original Data +Header +Quotation +Signature +Program +Paragraph

How Cleaning Processing Helps Term Extraction (cont.)

+42.4%

+2.3%

+24.7%

J2EE

Outline

Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary

Summary

Formalized email data cleaning as non-text filtering and text normalization

Conducted email cleaning in ‘cascaded’ approach Used SVM models for header, signature, program code,

and extra line break detection Our approach significantly outperforms baseline methods When applied to term extraction, significant improvement

on extraction accuracy can be obtained

Thanks!