Ranking Similarity between Political Speeches using Naive ...

1
Ranking Similarity between Political Speeches using Naive Bayes Text Classification James Ryder and Sen Zhang Dept. of Mathematics, Computer Science, & Statistics State University of New York at Oneonta [email protected] --- [email protected] Text Classification Training • Comprised of two subcomponents - Concordancer - Metaconcordancer Al l training is done offline and prior to any cI a ss ifi catio n • Training generates finished metaconcordance files that are used during classification • A set of metaconcordance files is a set of categories containing N files Training - Metaconcordancer Combine all concordances (C 1 - C M ) for a single author (Ai) into a single metaconcordance file (MTA) M eta concordan ce r A, MTA Fil e Cr ea teamet aco nc ord an ce fil ef or a uth or A i' Thi s isa co mplete desc ri p ti onof thi s auth o r' stextsand is u sed at C l ass ifi ca ti on tim e. Inaugural Speech Experiment Design Training Phase - Th e In augural speeches of the recent ten u.s. pres ident s: Barack Obama, George W. Bu sh, Bill Clinton, George Bush, Ron ald R ea gan, Jimmy Carter, Richard Nixon, Lyndon Johnson, John Kennedy, and Dwight E is enhower. - For tho se who served two term s, only his second inaugural speech w as colle ct ed. Classification Phase - Obama 's Inaugural speech - Bu sh 's Second Inaugural speech - Bu sh 's Farewell speech Top Frequently used Words of the George W. Bush's Farewell Speech «\it! took cronomy JUStICe """ 01 cause ,hru .. hort 1 ''''''ghlS gave "'''''' ..." hard h 'Ie ars .t' ailroad two j omg hI',,,,,,, /xU'll show rug t men ouse ne ver . attack A democraCy Laura · wh ose Americans hers Am encan best li round da YSS tates merI t; a past \; mus ibock a Ion President (alk may home Id wl1e th er l:: d J us,,,, r<omf.: war hdp cO ountnr ""' libe rty ("" A son · , keep an l ger Every United fi mmd ...l , h", . . menca S saf,..tvar£aq d1y wrught future 1 · r tIme new d .. .."" gIven "1'1" Y young ;n,dl'li= see lIe """1 eClslons ei ght honor . times suffmng !CCUli1)' ce te r rorists good pro5J'<f1l}' ah ead th ank pa fi lh r,ng "tohl}' meet pea id<dogy i"uruJe seven (,I/o" nght t<1T1If TC Authors Training Example • Set of N authors - We are given a snippet of text said to be written by one of the authors in the set of authors (categories) • This TC system should attempt to predict which author is most likely to have wr itten the snippet of text In the training phase, we need to obtain samples of the writing for each author in the category (author) set Prepare for Classification Col lect all MTA (category) fil es into one folder. • Edit the category list file by inserting the names of all category files to compare against. To be ready to classify some unknown snippet of text, one needs - All category files prepared (MTA) - The category list file Cat ego ri es A = {M TA A" MTA A2 , ... , MTAAN } is th e set of a ll cat ego ry fil es Pr elimin ary Experi ment R es ults {1.1}: Ranking vi a Comp arin g Th eir Sp ee ch es with Obam a's In augural Sp ee ch (iI . {o <) / '?, oJ G "% t. a ? ? "" ? -< -< '" t1> r- I a em :;: t1> t1> ;;: ;;: Ql Ql Obama's Inaugural Speech " " Top Frequently used Words of the Barack Obama's Inaugural Speech Joint and Conditional Probability Usi ngvar ia bles X andY. P(X, Y) mea ns th e probab il ity th at X will take on a spec ifi c v al ue x and Y w ill t ake on a speci fi c value y. Th at is. X will occur a nd Y will occ ur too. P(X. Y) = P(Y. X). This idea is kn own as Joint Probability. P(Y = Y I X = x) is read th e "probability d, at Y will take on th e specificva l ue y GIV EN TH AT X ha s already taken on th e speci fi c valu e x". Conditional Probability P(Y I X) . P(X. Y) = P(Y. X) and P(Y, X) = P(X. Y) P(X. Y) = P(X) P(Y I X) P(y. X) = p ry ) P(X I Y) Th e formula above is read 1h e probability that X occ urs and Y occurs is the same as th e probab ility th at X occu rs and given th at X h as occurred , Y occu rs "andvicev ers a o n th e ri ght side . P(X. Y) = P(Y. X) P(X) P(Y I X) = P(Y) P(X I Y) P(Y I X) = P(Y) P(X I Y) P(X) Th e formula above is readM(th e chan ce of Y given th at X occurred ) is «th e chan ce o f Y occ urrin g) ti mes ( th e chance ol X occurring given th atY occurred)) divided by ( th e cha n ce 01 X occurring) This is th e standard Bayes Theorem TC Author Example Training • Find many works of literature that each author has written • For each author, create a single concordance for each of this author 's M texts (T) A = {A " A 2 , ... ,AN} Th e se t of a ll N a uthors TA i = {T" T 2 , ... , T M} Th e se t of a ll textsf or a uth orAi C, Concor da ncer 11-- .. , : C M Cr eatea co ncor da n ce fil efor eac hof a uthor 'stexts(T A i ), C, th roughCM ' Mapping Classic Bayes Theorem into Na·ive Bayes Text Classification System We will use a mo di fi ed (naN e) versio n 01 Bayes Th eorem to crea te an ordered li st 01 ca tegories l or which a given in p uttext may belong. Th e ordering is based u po n th e relative li ke lih oo dth atth e text is sim il ar to a ca t ego ry in stance for a ll ca t egor ies in th e ca tego ry set. a P(Y I X) b c = PlY) PIX I Y) P(X) d a) X is th e in putt ext ($ource) th at w e a tte mptt o classify. Y is a single in stan ce of a catego ry from among a set ol categori esbe in gco nsidered. a is read 'th e probab ility th altext X belongs to category Y". b) is th e probability thatthi s is instance 'j' from the catego ry set. If the category set co nta in s 1 a categor ies th en P(Y) is 0. 10. c) is the probability of all words in th e in p utt ext be in g fo un d in c at eg o ry in stan ce Y I from the se t o f cat ego ries . d) is th e probabi li ty that in pultext X is in lac lth e inp uttext X. Clearly, this is P(1) and th erelore will be discarded in th e fin al formula witho ut affecting relative sco resbehYeen the categor ies . Political Spectrum Experiment Design Training Phase - Ten prominent world-wide polit i ca l figur es • G eo r ge W. Bu sh, W in sto n Ch urchi ll , Bill C li nt on, Ad olf H il ter, John Ke nn ed y, Tse -t un gM ao, Ka rl Mar x, Ba ra ck Ob am a, J ose ph Sta li n, Ma r ga ret Thatcher. - For each of them, we randomly se lect five speeches or written work s. By random, here we mean, we ju st collected the se speeches fi- om the Internet without p ri or knowledge about th em and without reading them. • Classi f ica t ion Phase - Ob am a's In augu ra l sp eec h - Bu sh 's Second In augural spee ch - Bu sh's Far ewe ll sp ee ch References and Acknowledgments • Beautiful Word Clouds: http://www.wordle.net • Inaugur al Speeches of Presidents of the United States: http://www.ba r tleby.com/124 Thanksto Dr. William R. Wilkerson for his help in directing us to online political speech repositories. • Thanks to the TLTC for printing out the Poster. Na·lve Bayes Text Classifier • Our text classification (TC) system is broken i nto two main components - Tra i ning - Cl assification • Train ing must be done first • We need to map a standard Bayes Theorem i nto a formula for quant ifying the likel i hood that a given text (X) fa l ls i nto a certain category instance (Y). Training - Concordancer • For each text (T j ) in TA j' the concordancer - counts the number of occurrences of each unique word in the text (frequency) - counts the total number of words - calculates the relative frequency of each unique wo rd in the text (frequency / total _ words) - creates an output file concordance (C) containing the above informat i on and the list of unique words c) For a given cat ego ry Y i what is the probability that the words in X appea r in Y i ? X = {W, .W2,""W, } The set of all wor ds in the snip pet of text P(X I Yi ) = P«w, . w, .... w ,) I Vi) , P(X I Yi) = TT P(W J I Yi ) Th e probability of w J is the relative Irequen cy of th e word J"' contained in the metacon cordan ce for cat egor y Y , II wJ fr om X is not prese nt in the Y i then we use a very small numb er for the pr obab ility be cause t he p robab ility of a word not fo und is zero . Multiplying by ze ro de str oys the product. Thi s pr oduct will res ult in an extremely small num be r that may be smaller than a co mputer can properly represen t pr ecise l y. So , we use a trick. Instead we add the logarithms. Trick: log (A • B) = log (A) + log (B) d 10g(P(X I Vi)) = r 10g(P('-"'i I Yi)) j= 1 Pr elim inary Experim en t R es ults {2 .2}: Ranki ng vi a Comparing Th eir spee ch es with G. W. Bu sh's Second In augural Sp eec h (iI {o 1 <:; <) '? '?, <::.. % 'f, 0j % S: '0 ? oJ "1- a '? -< -< t1> <1> r- I a em :;: "' "' ;;: ;;: '" Ql G.W. Bush's Second Inaugural Speech " " Future Work To improve ranking accuracy, we plan to - use variants of Na' ive Bayes and address the poor independent assumption; - explore more linguist ic, rhetorical and stylistical features such as metaphors, analogies , similes , opposition , alliteration, antithesis and parallelism etc. ; - select more representative training datasets ; - conduct more intensive experiments .

Transcript of Ranking Similarity between Political Speeches using Naive ...

Page 1: Ranking Similarity between Political Speeches using Naive ...

Ranking Similarity between Political Speeches using Naive Bayes Text Classification

James Ryder and Sen Zhang

Dept. of Mathematics, Computer Science, & Statistics

State University of New York at Oneonta

[email protected] --- [email protected]

Text Classification Training

• Comprised of two subcomponents - Concordancer

- Metaconcordancer

• Al l training is done offline and prior to any cI a ss ifi catio n

• Training generates f inished metaconcordance files that are used during classification

• A set of metaconcordance files is a set of categories containing N f iles

Training - Metaconcordancer

Combine all concordances (C1 - CM) for a single author (Ai) into a single metaconcordance fi le (MTA)

• • _~'I M eta concordance r

• A, MTA

Fil e

Create a metaconcordance file for author Ai' This is a complete description of this author's texts and is used at Classification time.

Inaugural Speech Experiment Design

• Training Phase

- The Inaugural speeches of the recent ten u.s. presidents: Barack Obama, George W. Bush, Bill Clinton, George Bush, Ronald Reagan, Jimmy Carter, Richard Nixon, Lyndon Johnson, John Kennedy, and Dwight Eisenhower.

- For those who served two term s, only his second inaugural speech w as collected.

• Classification Phase

- Obama's Inaugural speech

- Bush's Second Inaugural speech

- Bush's Farewell speech

Top Frequently used Words of the George W. Bush's Farewell Speech

«\it! took cronomy JUStICe """ 01 cause ,hru .. hort 1 ''''''ghlS gave "'''''' ..."

,~~ hu~~.,~neon e~T"'ihtGod~bctler hard • h 'Ie ars ~h .t'ailroad ~;Iung two jomg hI',,,,,,, /xU'll

show rug t men ouse never . attack ~"I ike

AdemocraCy Laura · whose Americans hers Amencan best liround da YSStates merIt;apast COnfide~:N!f~SSOtffm:' opereme~beJ,,,, \; musibock a Ion President (alk may home Id wl1ether l:: d Jus,,,,

r<omf.: war hdpcOountnr ""' liberty(""A son · , keepanlger Every United fi mmd...l ,h", . . ~./ ",o:"nr, menca Ssaf,..tvar£aq d1y

"'m~" re~~9m wrught C1tl~enS tough,:~always'l:' character future 1· r tIme new d .. .."" gIven "1'1" Y young ;n,dl'li= see lIe """1 eClslons eight honor. times suffmng !CCUli1)' ce

terrorists good pro5J'<f1l}' ahead thank pafilh r,ng "tohl}' ~ meet pea id<dogy i"uruJe .~nJorful f~'h.lJoscd';" h~;gher seven (,I/o" nght

""'mcdtcine'~ ~!I' ~~ t<1T1If

TC Authors Training Example

• Set of N authors - We are given a snippet of text said to be written by one of the authors in the set of authors (categories)

• This TC system should attempt to predict which author is most likely to have written the snippet of text

• In the training phase, we need to obtain samples of the writing for each author in the category (author) set

Prepare for Classification

• Col lect all MTA (category) fi les into one folder.

• Edit the category list f ile by inserting the names of all category files to compare against.

• To be ready to classify some unknown snippet of text, one needs - All category files prepared (MTA)

- The category list file

CategoriesA = {MTAA" MTAA2, ... , MTAAN } is the set of all category files

Prelimin ary Experi ment Results {1.1}: Ranking via Comparing

Their Speech es with Obam a's Inaugural Speech

(iI ~. ~

~ .~ . {o ~ ~ <) ~ ~ ~

/ ~ '?, oJ G ~. ~ ~ ~ ~ ~ ~ ~

"% ~ t. ~ a ? ~ ? "" ~ ?

-< -< ~

~

'" t1> r-I a

em :;: ~ t1> t1> ~

~ ~

~ ;;: ;;: Ql Ql ~

~ Obama's Inaugural Speech " " ~ ~

Top Frequently used Words of the Barack Obama's Inaugural Speech

Joint and Conditional Probability Using variables X andY. P(X, Y) means th e probability that X will take on a spec ific value x and Y will take on a speci fic value y. Th at is. X will occur and Y will occur too .

P(X. Y) = P(Y. X). This idea is known as Joint Probability.

P(Y = Y I X = x) is read the "probability d, at Y will take on the specificva lue y GIV EN THAT X has already taken on the speci fic valu e x". Conditional Probability P(Y I X) .

P(X. Y) = P(Y. X) and P(Y, X) = P(X. Y) P(X. Y) = P(X) P(Y I X) P(y. X) = pry ) P(X I Y)

Th e formula above is read 1he probability tha t X occurs and Y occurs is the same as th e probability th at X occurs and given th at X has occurred, Y occurs" andvice versa on th e right side.

P(X. Y) = P(Y. X) P(X) P(Y I X) = P(Y) P(X I Y)

P(Y I X) = P(Y) P(X I Y) P(X)

Th e formula above is readM(th e chan ce of Y given th at X occurred) is «th e chan ce of Y occurrin g) times (th e chance ol X occurring given thatY occurred)) divided by (th e chance 01 X occurring)

This is the standard Bayes Theorem

TC Author Example Training

• Find many works of literature that each author has written

• For each author, create a single concordance for each of this author's M texts (T)

A = {A" A2, ... ,AN} The set of all N authors TAi = {T" T2, ... , T M} The set of all texts for author Ai

C,

-~II Con corda ncer 11--.. , : CM

Create a concordance file for each of author's texts (TAi), C, through CM'

Mapping Classic Bayes Theorem into Na·ive Bayes Text Classification System

We will use a modi fied (naN e) version 01 Bayes Th eorem to create an ordered list 01 categories l or which a given in puttext may belong. Th e ordering is based upon th e relative like lihoodthatthe text is similar to a category in stance for all categories in th e ca tegory set.

a P(Y I X)

b c = PlY) PIX I Y)

P(X)

d

a) X is th e in puttext ($ource) th at we attemptto classify . Y is a single instance of a category from among a set ol categories bein gconsidered. a is read 'the probability th altext X belongs to category Y". b) is th e probability thatthis is instance ' j' from the category set. If th e category set contain s 1 a categories then P(Y) is 0.10.

c ) is th e probability of all words in th e in putt ext bein g foun d in category in stance YI from th e set o f ca tegories .

d) is th e probability that inpultext X is in laclthe inputtext X. Clearly, this is P(1) and there lore will be discarded in the fin al formula without affectin g relative scoresbehYeen the categories .

Political Spectrum Experiment Design

• Training Phase - Ten prom inent world-wide polit ica l f igures

• George W. Bush, Winston Churchill, Bill Clinton, Adolf Hilter, John Kennedy, Tse-tung Mao, Karl Marx, Barack Obama, Joseph Sta lin, Margaret Thatcher.

- For each of them, we randomly select five speeches or written works. By random, here we mean, we just collected these speeches fi-om the Internet without prior knowledge about them and without reading them.

• Classi f icat ion Phase - Obama's Inaugura l speech - Bush's Second In augural speech - Bu sh's Farewe ll speech

References and Acknowledgments

• Beautiful Word Clouds: http://www.wordle.net

• Inaugu ral Speeches of Presidents of the Un ited States: http://www.ba rtleby.com/124

• Thanksto Dr. Wil liam R. Wilkerson for his help in directing us to online political speech repositories.

• Thanks to the TLTC for printing out the Poster.

Na·lve Bayes Text Classifier

• Our text classification (TC) system is broken into two ma in components - Tra ining

- Classification

• Train ing must be done first

• We need to map a standard Bayes Theorem into a formula for quantifying the likel ihood that a given text (X) fa lls into a certain category instance (Y).

Training - Concordancer

• For each text (Tj ) in TAj' the concordancer

- counts the number of occurrences of each unique

word in the text (frequency)

- counts the total number of words

- calculates the relative frequency of each unique

wo rd in the text (frequency / total_words)

- creates an output file concordance (C) containing

the above information and the list of unique

words

c) For a giv en category Yi what is the probability that the words in X appear in Yi ?

X = {W, .W2,""W, } Th e set of all words in the snippet of text

P(X I Yi ) = P«w, . w, .... w,) I Vi)

, P(X I Yi) = TT P(WJ I Yi ) The probability of wJ is th e relativ e Irequency of th e word

J"' contained in th e metaconcordance for category Y,

II wJ from X is not present in the Yi th en we use a very sm all number for the probability because the probability of a word not found is zero. M ultip lying by zero destroys the prod uct.

This product wi ll result in an extremely small number that may be small er than a computer ca n properly rep resent precisely. So, we use a trick . Instead we add th e logarithms.

Trick: log (A • B) = log (A) + log (B)

d

10g(P(X I Vi)) = r 10g(P('-"'i I Yi)) j= 1

Prelim inary Experim ent Results {2.2}: Ranki ng via Comparing

Th eir speech es with G. W. Bush's Second In augural Speec h

(iI

~ {o 1 <:; ~ <) ~ '? '?, ~ ~.

<::.. ~ % 'f,

~ 0j % S: ~. '0 ?

~ ~ ~ oJ ~

~ ~ "1- a '?

-< -< ~

~ t1> <1> r-I a

em :;: ~ "' "' ~

~ ~

~ ;;: ;;: '" Ql ~

~ G.W. Bush's Second Inaugural Speech " " ~ ~

Future Work

• To improve ranking accuracy, we plan to - use variants of Na'ive Bayes and address the poor

independent assumption;

- explore more linguist ic, rhetorical and stylistical features such as metaphors, analogies, similes, opposition, alliteration, antithesis and parallelism etc.;

- select more representative training datasets;

- conduct more intensive experiments .