Text Data Mining. Speech and Language Processing Daniel Jurafsky and James H. Martin, Prentice Hall,...

Text Data Mining

Speech and Language Processing

Daniel Jurafsky and James H. Martin,

Prentice Hall, 2000.

Introduction

• Dave Bowman: Open the pod bay doors, HAL.• HAL: I’m sorry Dave, I’m afraid I can’t do that.• From the Screenplay of 2001: A Space Odyssey.• What would it take to create at least the language-related parts of

HAL? • Understanding humans via speech recognition and natural language

understanding (and, of course lip-reading), and of communicating with humans via natural language generation and speech synthesis. HAL would also need to be able to do information retrieval (finding out where needed textual resources reside), information extraction (extracting pertinent facts from those textual resources) and inference (drawing conclusions based on known facts).

Language Processing Systems

• Language processing systems range from mundane applications such as word counting to spelling correction, grammar checking, and cutting edge applications such as automated question answering on the web and real-time spoken language translation.

• What distinguishes them from other data processing systems is their use of knowledge of language. Even the unix word count program (wc) has knowledge of what constitutes a word.

Levels of Language (1)

• To determine what Dave is saying, HAL must be able to analyse the incoming audio signal. Similarly HAL must be able to generate an audio signal that Dave can understand. These tasks require knowledge of phonetics (how words are pronounced in terms of individual speech units called phones, listed in the international phonetic alphabet) and phonology (the systematic way that sounds are differently realised in different environments, e.g. cat, cook).

• HAL is capable of producing contractions like I’m and can’t. Producing and recognizing these and other variations of individual words (e.g. recognising that doors is plural) requires knowledge of morphology.

Levels of Language (2)

• HAL has knowledge of syntax, rules for the combination of words. He knows that the sequence I’m I do, sorry that afraid Dave I’m can’t will not make sense to Dave, even though it contains exactly the same words as the original. He has knowledge of lexical semantics (the meanings of words, e.g. the difference between door and window, open and shut. Compare with compile time and run time errors in a computer program.

• Next, despite it’s bad behaviour, HAL knows enough to be polite to Dave, embellishing his responses with I’m sorry and I’m afraid. The appropriate use of polite and indirect language comes under pragmatics. HAL’s correct use of the word that in its response to Dave’s request provides structure in their conversation, which requires knowledge of discourse conventions.

• See “The scope of linguistics”, p14 of “Teach Yourself Linguistics” by Jean Aitchison.

To summarize, the knowledge of language needed to engage in complex behaviour can be separated into six distinct

categories:

• Phonetics and Phonology - the study of linguistic sounds• Morphology - the study of the meaningful components of words• Syntax - the study of the structural relationships between words• Semantics - the study of meaning• Pragmatics - the study of how language is used to accomplish

goals• Discourse - the study of linguistic units larger than a single

utterance• Most or all tasks in speech and language processing can be

viewed as resolving ambiguity at one of these six levels. How many different meanings can you think of for the sentence I made her duck?

Resolving Ambiguity

• Duck can be a verb or noun, while her can be a dative pronoun or a possessive pronoun. Make can mean create or cook or compel. It can also be transitive, taking a single direct object ( I cooked waterfowl belonging to her) or ditransitive, taking two objects, meaning that the first object (her) was made into the second object (duck). In a spoken sentence, there would be another kind of ambiguity. What is it?

• How do we resolve or disambiguate these ambiguities?• deciding whether duck is a verb or a noun can be solved by part-of-

speech tagging• deciding whether make means create, cook or compel can be solved by

word sense disambiguation.• Deciding whether make is transitive or ditransitive is an example of

syntactic disambiguation and can be addressed by probabilistic parsing.

Tokenisation and Stemming

• Tokenisation, e.g:

• Splitting at the word level. Non-trivial, e.g. Data base, database, data-base.

• Splitting at the sentence level. Non-trivial, e.g. 3.21, Dr. Mr.

• Stemming Rules (see Paice’s rules)

Stemming rules

• Rules for the reduction of different grammatical forms of a word to a common canonical form.

• E.g. terror, terrorism, terrorist, terrorists terror.

• The most popular set of rules are Porter’s rules.

• Look at the handout of Paice’s rules.

Word Classes and Part-of-Speech Tagging

• No definitive list, but 146 for the C7 tagset (Garside et al., 1997).• Two broad supercategories: closed class and open class.• Main open classes are nouns (cat, Daniel), verbs (walk), adjectives (green) and

adverbs (slowly).• Main closed classes are:• Prepositions: on, under, over, near, by, at, from, to, with• Determiners: a, an, the• Pronouns: she, who, I, others• Conjunctions: and, but, or, as, if, when• auxiliary verbs: can, may, should, are• particles: up, down, on, off, in, out, at, by• numerals: one, two, three, first, second, third• Tagsets for English, e.g. Penn Treebank• The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ

topics/NNS ./.• Exercise: manual CLAWS tagger with disambiguation.

Context-free grammars for English

• S NP + VP• NP DET + NOUN• VP VERB + NP• DET the• NOUN man | book• VERB took

CFG are also called Phrase-Structure Grammars

• They consist of a set of rules or productions, each of which expresses the ways that symbols of the language can be grouped together, and a lexicon of words or symbols. The symbols that correspond to words in the surface form of the language are called terminal symbols.

• The CFG may be thought of in two ways: as a device for generating sentences (top-down parsing), or as a device for assigning a structure to a given sentence (bottom-up parsing). It is sometimes convenient to represent a parse-tree in bracketed notation (e.g. the Penn Treebank).

• [S [NP [DET the] [NOUN man] ] [VP [VERB took] [NP [DET the] [NOUN book] ] ] ]

SVO triples

• A parser can show the existence of useful subject-verb-object triples, showing two related entities and the relation between them.

• aspirin (treats) headache

• HBOS (takes-over) Halifax

Word Sense Disambiguation (1)

• The WordNet thesaurus lists the range of different senses a word can have, and also the range of relations between related word senses:

• hypernym, e.g. breakfast meal (noun)• hyponym, e.g. meal lunch• has-member e.g. faculty professor• member-of e.g. copilot crew• has-part e.g. table leg• part-of e.g. course meal

Word Sense Disambiguation (2)

• hypernym e.g. fly travel (verb)• troponym e.g. walk stroll• entails e.g. snore sleep• antonym e.g. increase decrease• antonym e.g heavy light (adjective)• antonym e.g. quickly slowly (adverb)• synsets e.g. {chump, fish, fool, gull, mark, patsy,

fall guy, sucker, schlemiel, soft touch, mug }

The DDF Thesaurus

• PENICILLANATE h.t. ANTIBIOTICS

• Penicillanate-sulfone use SULBACTAM

• PENICILLATE h.t. ANTIBIOTICS

• Penicillin use BENZYL-PENICILLIN

The ACAMRIT semantic tagger

• The SEMTAG semantic tagset was originally loosely based on the categories found in the Longman Lexicon of Contemporary English (McArthur, 1981).

• The categories are arranged in a hierarchy, with 21 major discourse fields denoted by an upper case letter (such as E for “emotional actions, states and processes”), then divided and sometimes even subdivided again.

The ACAMRIT semantic tagger

• This is shown using numeric components of the semantic codes such as 4.1.

• Antonyms are identified using the symbols + and -. Thus “happy” is normally tagged E4.1+, and “sad” is normally tagged as E4.1-.

• Comparatives are shown with ++ or --, and superlatives with +++ or ---.

• In some cases, a word type can only represent one possible category. Often, however, a word such as “spring” can have a number of different meanings, each requiring a different semantic tag. In such cases, disambiguation is achieved using six types of additional evidence, as follows:

Additional Evidence for WSD

• The POS tag assigned by CLAWS. For example, if “spring” is a verb, we know it must mean “jump”.

• The general likelihood of a word taking a particular meaning, as found in certain frequency dictionaries.

• Idiom lists are kept. If an entire idiomatic phrase is found in the text being analysed, it is assumed that the idiomatic meaning of each word in the phrase is more likely than individual interpretations of the words.

• The domain of discourse can be an indicator. For example, if the topic of discussion is footwear, then “boot” is unlikely to refer to the boot of a car.

• Special rules have been developed for the auxiliary verbs “be” and “have”.

• Proximity disambiguation. Are any collocates of the word, suggesting a particular interpretation, found in the immediate vicinity?

ACAMRIT uses SEMTAG tags

• The full set of SEMTAG semantic tags can be found on http://www.comp.lancs.ac.uk/ucrel/acamrit/setags.txt.

Some ACAMRIT codes• G1 Government, Politics and

elections• G1.1 Government etc.• G1.2 Politics• G2 Crime, law and order• G2.1 Crime, law and order: Law

and order• G2.2 General ethics• G3 Warfare, defence and the

army; weapons• H1 Architecture and kinds of

houses and buildings• H2 Parts of buildings• H3 Areas around or near houses• H4 Residence• H5 Furniture and household

fittings

• I1 Money generally• I1.1 Money: Affluence• I1.2 Money: Debts• I1.3 Money: Price• I2 Business• I2.1 Business: Generally• I2.2 Business: Selling• I3 Work and employment• I3.1 Work and employment:

Generally• I3.2 Work and employment:

Professionalism• I4 Industry

Text tagged with Part Of Speech and Semantic Code

I_PPIS1_Z8mf went_VVD_M1[i3.2.1 down_RP_M1[i3.2.2 yesterday_RT_T1.1.1 to_II_Z5 the_AT_Z5 Peiraeus_NP1_Z99 with_IW_Z5 Glaucon_NP1_Z99 ,_,_PUNC the_AT_Z5 son_NN1_S4m of_IO_Z5 Ariston_NP1_Z99 ,_,_PUNC to_TO_Z5 pay_VVI_I1.2 my_APPGE_Z8 devotions_NN2_Z99 to_II_Z5 the_AT_Z5 Goddess_NN1_S9/S2.1f ,_,_PUNC and_CC_Z5 also_RR_N5++ because_CS_Z5 I_PPIS1_Z8mf wished_VVD_X7+ to_TO_Z5 see_VVI_X3.4 how_RRQ_Z5 they_PPHS2_Z8mfn would_VM_A7+ conduct_VVI_A1.1.1 the_AT_Z5 festival_NN1_K1/S1.1.3+ since_CS_Z5 this_DD1_Z8 was_VBDZ_A3+ its_APPGE_Z8 inauguration_NN1_Z99 ._._PUNC

Discourse

• Up to now, we have focussed mainly on language pheomena that operate at the word or sentence level. Of course, language does not normally consist of isolated, unrelated sentences, but instead of related groups of sentences. We refer to such a group of sentences as a discourse (e.g. HCI).

• Coherence and reference are discourse phenomena: consider

• John went to Bill’s car dealership to check out an Acura Integra. He looked at it for about an hour.

• Automatic reference resolution depends mainly on proximity rules and constraints on coreference, e.g. agreement in gender and number.

Discoursal Annotation

• (0) The state Supreme Court has refused to release {1[2 Rahway State Prison 2] inmate 1} (1 James Scott 1) on bail. (1 The fighter 1) is serving 30-40 years for a 1975 armed robbery conviction. (1 Scott 1) had asked for freedom while <1 he waits for an appeal decision. Meanwhile {3 <1 his promoter 3], {{ 3 Murad Mohammad 3} said Wednesday <3 he netted only $15,250 for (4 [1 Scott 1] ‘s nationally televised fight against {5 ranking contender 5 } (5 Yaqui Lopez 5) last Saturday 4).

Dialogue Acts (Bunt) or Conversational Moves (Power)

• STATEMENT A claim made by the speaker• INFO-REQUEST A question by the speaker• CHECK A question for confirming information• INFLUENCE-ON-ADDRESSEE (= Searle’s directives)• OPEN-OPTION A weak suggestion or listing of options• ACTION-DIRECTIVE An actual command• INFLUENCE-ON-SPEAKER(= Austin’s commissives)• OFFER Speaker offers to do something (subject to confirmation)• COMMIT Speaker is committed to doing something• CONVENTIONAL Other• OPENING Greetings• CLOSING Farewells• THANKING Thanking and responding to thanks.

Open Sources Automatic Analysis

Chapter 8 of “Text Mining and its Applications to Intelligence, CRM and

Knowledge Management” by Alessandro Zanassi

Introduction (1)

• Information density doubles every 24 months and its costs are halved every 18 months – a lot of information to plough through.

• Terrorism and organised crime will pose hard issues for intelligence because international concern for them will grow just as international capacity to deal with them diminishes. Law enforcement is rooted in the authority of the state and generally defined by geographic units, while international institutions have always been weak.

Introduction (2)

• Intelligence in the Cold War defined its business as secret, where collection was the supreme task. In future its business will be information defined as a high quality understanding of the world using all sources, including open sources, where secrecy matters much less, collection is easier and analysis is the critical challenge.

• Competitive intelligence (CI) is a systematic and ethical program for gathering, analysing, and managing information that can affect your company’s plans, decisions and operations. Ethical implies that CI operates only on open sources of data / information.

Open Sources

• Open source refers to information in the public domain, that may be free (e.g. Internet), or not free (e.g. Derwent Drug File), but are not considered confidential and are easily available to anyone.

• To profit completely from such information, two problems must be solved:

• Information overload. This information is much too large to be analysed quickly by one person.

• Specialised language. The knowledge is present at different, sometimes hidden, levels.

• Some of the most well-known online databanks are: World Patent Index Latest, Reuters, Chemical Abstracts, Medline, Science Citation Index.

Open Sources Analysis Problems (1)

• Taking into account documents coming from different, multimedia sources, with different languages and formats, written in different styles.

• Reducing the complexity dividing the total document mass in topics, without having previously defined what these topics are about.

• Presenting a rough description of what these topics are, with an indication of their volume.

Open Sources Analysis Problems (2)

• Networks: what relations exist among the topics?• Signals of change, as alliances or forecasting, that

may be “drowned” in the text. • The names of actions of new entrants in the

market. • Distinguishing synonyms (different words with

the same meaning) and homographs (same words with different meanings).

Anti-terrorism tasks requiring analysis of a large quantity of text

• High tech terrorism. Terrorist groups using chemical weapons, explosives or other non-conventional weapons, need experts with specific, sophisticated knowledge. These individuals would have had to study, perform research, attend courses and seminars, publish papers and books, leaving behind them electronic traces scattered across the web.

Anti-terrorism tasks (2)

• Names and relationships detection. New groups emerge each week, and new terrorists are recruited every day. Text mining allows the detection of their names, and their connections with other groups or people.

• Vindications analysis. Often the only traces available after a terrorist attack are the emails or communications claiming the act. The analyst must analyse the style, the concepts and feelings express, to establish connections and patterns between documents. Also look for characteristics of author and gender, similarities between these anonymous texts and the style of other, well-known groups of people.

• Identifying lobbying. Analysing similarities in public declarations or statements may reveal links between individuals who do not officially have any connection.

Other criminal activities

• Spam filtering.

• Insider trading. Compare anomalous peaks and troughs in the value of stock with company news. A big peak should coincide with a spate of news about the company. If there is no news to spur a trading peak, that is a suspected insider trading.

Web Crawlers

• Crawlers give a big help in retrieving the internet pages and pages from online data banks that we need.

• A crawler is a program that retrieves Web pages, commonly for use by a search engine or Web cache, but in our case to be analysed by text mining engines.

• It starts with the URL for a given page. It retrieves this page, and then may go on to retrieve pages linked to this one.

• It must carefully decide with URLs to scan and in what order. It must also decide how frequently to visit pages it has already seen, in order to keep its client informed of changes on the web.

Computational Stylometry

5 Approaches to Authorship Attribution

1. Historical Research

• Knowledge intensive, beyond current abilities of the computer. E.g. documents mentioning the Spanish Armada must have been written after 1588, and thus could not have been written by anyone who died before 1588.

• See web page by Tom Reedy and David Kathman “How we know that Shakespeare wrote Shakespeare”.

• the name “William Shakespeare” appears on the plays and poems

• WS was an actor in the company that performed the works of WS

• In 1615 Edward Howes published a list of our “moderne, and present excellent poets”, which included WS.

• Some time before 1623, a monument was erected to WS in Stratford, depicting him as a writer.

2. Cipher-based author identification

• Cipher-based author identification, see shakespeareauthorship.com/bacpenl.html for details of Penn Leary’s Baconian ciphers.

• A=E B=F C=G D=H E=I F=K G=L H=M I=N K=O L=P M=Q N=R O=S P=T Q=V R=Y S=A T=B V=C Y=D

• (J=I, V= U, VV=W, 0 to 9 = A to K)• e.g. subject’s merits doth acfnigb’a qiynba hsbm • that same ignorance bmeb aeqi nlrsyergi• boatswain fsebacenr (first word in “The Tempest”• This technique, supposedly proves Bacon wrote all of

Shakespeare’s works. Critics point out it also proves he wrote Ceasar’s Gallic Wars and Moby Dick.

3. Physical Evidence

• e.g. carbon dating, analysis of handwriting.

• Evidence against Konrad Kujau’s Hitler Diaries:

• the handwriting did not resemble Hitler’s

• the paper, ink and glue of the diaries was manufactured post-WW2.

4. Forensic Linguistics

• Forensic - need not be quantitative. See “Forensic Linguistics”, a review by L J Hurst of Malcolm Coulthard’s work.

• Confessions of the Birmingham Six describe one suspect carrying “one white plastic carrier bag”, and another carrying “two white plastic carrier bags”. These words were those of the police (who were sure that the explosives were in white carrier bags) - no one in ordinary speech would use and repeat those abnormally long phrases. No one spoke that way at the trial, nor could any examples be found in the University of Birmingham COBUILD database

• The Derek Bentley confession repeatedly contains the word “then”, as in “Chris Craig and I then…”, “the policeman then…”. But outside police reports no one puts “then” after the subject of a sentence - it is the police whose reports are obsessed with sequence.

Other examples

• Disputed wills

• Ransom notes

• Suicide nodes

• Anonymous claims for terrorist acts

5. Computational Stylometry

• Computational Stylometry• An attempt to capture the essence of the style of a

particular author by reference to a variety of quantitative criteria, usually lexical, called discriminators.

• Stylometry is really the area in which computationally tractable approaches to author identification have been developed. However, general author identification software packages have not been produced. Many studies have developed analyses of texts by hand, or using simple frequency counting programs, have extracted the relevant data and processed it using statistical software packages or manual statistical analysis.

Statistical Techniques (quantitative)

• require the study of frequently occurring features• traits must be possible to express numerically (e.g. Semitisms in the New

Testament are difficult to classify).• Frequently used features are:• word and sentence length (Frequency spectra, mode = 2 for JS mill, 3 for

Oliver Twist (Mendenhall).• Positions of words within sentences e.g. “gar” as second or third word,

correlation -0.47 with time (Morton’s chronology of Isocrates).• Vocabulary studies: the choice and frequency of words, measures of

vocabulary richness e.g. Yule’s K measure.• Syntactic analysis e.g. phrase structure rewrite rule frequencies used for

measures of vocabulary richness. (crime fiction - cave of shadows).• The ideal situation for authorship studies is when there are • large amounts of undisputed text, • few contenders for the authorship of the disputed text(s).

Experimental Methodology

• The experimental methodology is to build corpora• A - works definitely by author A• B - works definitely by author B• C - works of disputed authorship, but probably written by

A or B.• Then select discriminants and associated measures.• When the technique has been shown to discriminate

effectively between A and B, then try it on C.

A Widow and Her Soldier (1)

• Stylometry and the American Civil War (Holmes, Gordon and Wilson).

• Fifty years after the American Civil war, General George Pickett’s widow, LaSalle Corbell Pickett, published letters purportedly written by her husband, many of them from the field of battle itself. Historians are divided as to their authenticity, and this paper describes a stylometric investigation into the Pickett letters.

• The Burrows technique essentially picks the N most common words in the corpus under investigation and computes the occurrence rate of these N words in each text, thus converting each text into an N dimensional array of numbers. Multivariate statistical techniques are then applied to the data to look for patterns. The two techniques most commonly employed are principal components analysis (PCA) and cluster analysis.


• PCA notes from Homes and Forsyth above: Principal Components Analysis (PCA) aims to transform the observed variables into a new set of variables which are uncorrelated and arranged in decreasing order of importance. These new variables (or components) are combinations of the original variables, and it is hoped that the first few components will account for most of the variation in the original data. If we can account for most of the variation using just two components, we can plot each text on a two dimensional graph.

• …. Typically the data are plotted in the space of just the first two components …


• Samples were taken from:• + LaSalle Pickett’s autobiography• O LaSalle’s personal letters• + George personal pre-war and post-war letters• * George’s war reports • ♦ Inman papers, genuine handwritten letters by George• Δ Walter Harrison’s book “Pickett’s men” (possible source

of material in the letters). • The investigation strongly suggests that LaSalle Pickett,

composed the published letters herself.

-1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

REGR factor score 1

RE

GR

fact

or s

core

2

Text Data Mining. Speech and Language Processing Daniel Jurafsky and James H. Martin, Prentice Hall,...

Documents

Transcript of Text Data Mining. Speech and Language Processing Daniel Jurafsky and James H. Martin, Prentice Hall,...