Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating...

45
Mike Scott, School of English University of Liverpool Keyness 1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield 15 February 2008

Transcript of Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating...

Page 1: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Mike Scott, School of EnglishUniversity of Liverpool

Keyness 1

…challenges in

investigating Keyness

NLP Group Computer Science

University of Sheffield

15 February 2008

Page 2: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Purpose

1. To explore the notion of keyness2. and its implications in corpus-based

study3. and to consider concgrams4. and key concgrams5. all with reference to WordSmith

Keyness 2

Page 3: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Overview

Keyness, as a new territory, looks promising and has attracted colonists and prospectors. It generally appears to give robust indications of the text’s aboutness together with indicators of style.

Keyness 3

Page 4: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

the text’s aboutness

Keyness 4

Page 5: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Issuesthe issue of text section v. text v. corpus

v. sub-corpusstatistical questions: what exactly can

be claimed?how to choose a reference corpushandling related forms such as

antonyms

Keyness 5

Page 6: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Machine and Human KWSRigotti and Rocci (2002) warn that

machine identification of key words omits all interpretation of the writer’s intentions, cannot get at cultural implications and does not spot the congruity of the meanings of each section with the next.

Keyness 6

Page 7: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

metaphors“In our view, a natural language text, slippery

and vague as it may be, is not a stone soup where words float free, tied only to their multiple associations within a Foucoultian discourse” (Rigotti and Rocci, 2002)

Keyness 7

Page 8: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Of course it doesn’t actually understand…

Keyness 8

Page 9: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

… or know what is “correct”

Keyness 9

Page 10: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

… only look at what is found in text

Keyness 10

… or context

… whether marked up or not …

Page 11: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Context?

Keyness 11

Page 12: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Keyness 12

Levels of Context

P hysica l env ironm ent

Page 13: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

If so

what is the status of the “key words” one may identify and what is to be done with them?

Keyness 13

Page 14: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Issues1. the issue of text section v. text v. corpus

v. sub-corpus

2. statistical questions: what exactly can be claimed?

3. how to choose a reference corpus

4. handling related forms such as antonyms

5. what is the status of the “key words” one may identify and what is to be done with them?

Keyness 14

Page 15: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

text section v. text v. corpus v. sub-corpus

text section: levels 1-5text: level 6corpus: levels 7 & 8

Keyness 15

Page 16: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

But these are often not clearly differentiated

“text”, level 6: with or without mark-up, images, sounds?

what do we mean by section, chapter (4) and other non linguistically defined categories?

is text itself mutating?

Keyness 16

Page 17: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Internet text

Keyness 17

Page 18: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Wikipedia homepage (part)

Keyness 18

Page 19: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Wikipedia homepage (part)

Keyness 19

Page 20: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Wikipedia article(3 parts of same article)

Keyness 20

Page 21: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Wikipedia discussionfrom History of the stall articlelatest contributor, “Talk” section

Keyness 21

Page 22: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

statistical issuesp value is a well-established standard, relying

on the notion of chance, random effectsbut

if you run lots of comparisons some will spuriously (by chance) appear significant

if we’re operating at the level of word or cluster, text itself doesn’t consist of randomly ordered words

Keyness 22

Page 23: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Implicationthere is no statistical defence of the whole set

of KWsbut only of each onecomparing KW p values is not advisable

Keyness 23

Page 24: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

the South Utsere farmer3 crops

bananasbarleychick-peas

Keyness 24

3 problemsstormwind

drought

Page 25: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

choosing a reference corpususing a mixed bag RC, the larger the RC the better but

a moderate sized RC may suffice. the keyword procedure is fairly robust. KWs identified even by an obviously absurd RC can be

plausible indicators of aboutness, which reinforces the conclusion that keyword analysis is robust.

genre-specific RCs identify rather different KWsthe aboutness of a text may not be one thing but

numerous different ones.

Scott (forthcoming)

Keyness 25

Page 26: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

related forms such as antonyms

Keyness 26

Page 27: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

status of the “key words”

Keyness 27

Page 28: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Concgram section

Keyness 28

Page 29: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

What is a “concgram”?For years it has been easy to search for or

identify consecutive clusters (n-grams) such asAT THE END OFMERRY CHRISTMASTERM TIME.

It has also been possible to find non-consecutive linkages such as STRONG within the horizons of TEA by adapting searches to find context words.

In such cases we might get...STRONG blah blah blah TEA......TEA blah blah blah blah STRONG...etc.

29

Page 30: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

The concgram procedure takes a whole corpus of text and finds all sorts of combinations like

...STRONG blah blah blah TEA... ...TEA blah blah blah blah STRONG... ...STRONG TEA... ...TEA STRONG...

whether consecutive or not...

a sequence (n-gram)within a concordance span. (“skip-gram” is used (Wilks 2005) to describe non-

contiguous word associations but doesn’t include TEA […] STRONG)

30

Page 31: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Cheng, Greaves & Warren (2006)For our purposes, a ‘concgram’ is all of the

permutations of constituency variation and positional variation generated by the association of two or more words. This means that the associated words comprising a particular concgram may be the source of a number of ‘collocational patterns’ (Sinclair 2004:xxvii). In fact, the hunt for what we term ‘concgrams’ has a fairly long history dating back to the 1980s (Sinclair 2005, personal communication) when the Cobuild team at the University of Birmingham led by Professor John Sinclair attempted, with limited success, to devise the means to automatically search for non-contiguous sequences of associated words.Cheng, Greaves & Warren (2006:414)

31

Page 32: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

ConcGram (©) aims to be"a search-engine, which on top of the

capability to handle constituency variation (i.e. AB, ACB), also handles positional variation (i.e. AB, BA), conducts fully automated searches, and searches for word associations of any size." (2006:413)

WSConcGram is developed in homage to this idea.

32

Page 33: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

The goal

33

Cheng, Greaves & Warren (2006:426)

Page 34: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

A ProblemGreaves (2007) reported that ConcGram

requires months using numerous linked PCs to generate the 5-word concgrams based on a corpus of some 5 million words.

There are a lot of combinations to take into account…

34

Page 35: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Implementation in WordSmithThe plan: to process a corpus of adequate size (say

10 or more million words)find all instances of all frequently co-occurring

wordsco-occurring within a given spanidentify them as potential

pairs (sup ... with)triplets (light ... an ... dark)quadruplets (with ... the ... of ... war)quintuplets (eyes ... are ... full ... of ... tears) etc. (a ... light ... condition ... in ... beauty ... dark)

determine whether they are significantly associated35

Page 36: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Stagesdesign the procedureplan human interface aspects

36

Page 37: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Procedures and Routines

WS3-5’s WordList index function already knew how to process a corpus and identify all instances of each word in it...

37

Page 38: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Index“A kingdom for a stage, princes to act

And monarchs to behold the swelling scene.” (Henry V)

38

Huge file of records containing token data:word_type_number,next_known_token,file_number,file_byte_position

R1, R2, R3 ... RNR1’s word_type_number=1; (a)R2’s word_type_number=2; (kingdom)R3’s word_type_number=3; (for)R4’s word_type_number=1; (a)R5’s word_type_number=4; (stage)R6’s word_type_number=5; (princes)R7’s word_type_number=6; (to)R8’s word_type_number=7; (act)R9’s word_type_number=8; (and)R10’s word_type_number=9; (monarchs)R11’s word_type_number=6; (to)

Huge file of records containing token data:word_type_number,next_known_token,file_number,file_byte_position

R1, R2, R3 ... RNR1’s word_type_number=1; (a)R2’s word_type_number=2; (kingdom)R3’s word_type_number=3; (for)R4’s word_type_number=1; (a)R5’s word_type_number=4; (stage)R6’s word_type_number=5; (princes)R7’s word_type_number=6; (to)R8’s word_type_number=7; (act)R9’s word_type_number=8; (and)R10’s word_type_number=9; (monarchs)R11’s word_type_number=6; (to)

Page 39: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

WSConcGram procedures (1)process the index, looking at each instance of

words above a certain threshold frequency (e.g. 5 instances)

considering all its neighbours within a given span (e.g. 5)

finding all pairs repeated more than a threshold number of times

saving in a file data on where in the corpus each pair is to be found

39

Page 40: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

WSConcGram procedures (2)

sort the file of pairsprocess it, finding overlaps,

e.g. where HOW and MATTER and NOW are all found within the default span

sorting the resulting triplets, quadruplets, etc and

storing them in another file

40

Page 41: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

WSConcGram Filesshakespeare.typesshakespeare.tokensshakespeare.base_pairsshakespeare.base_indexshakespeare.base_index_cg

41

Page 42: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Human Interface Aspectssorting concgrams by frequency and alphabeticallydisplaying the root word typeschoosing concgram formsclustering them in treesfiltering according to

statistical propertiesrequired word(s)

other needscopy, save as .txt, print, print previewconcordance selected concgrams

42

Page 43: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

Problem areascomputing the association statistics correctlyclustering each concgramshowing & hiding parts of a concgramordering them in a tree structure

43

Page 44: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

References Berber Sardinha, Tony, 1999. Using Key Words in Text Analysis: practical

aspects. DIRECT Papers 42, LAEL, Catholic University of São Paulo. Berber Sardinha, Tony, 2004. Lingüística de Corpus. Barueri: Manole. Cheng, Winnie, Chris Greaves & Martin Warren, 2006. “From n-gram to

skipgram to concgram”, International Journal of Corpus Linguistics, Vol. 11, No. 4, pp. 411-433.

Greaves, Chris, 2007. Demonstration of ConcGram. Keyness in Text conference, Certosa di Pontignano, Tuscany, Italy, 26-30 June 2007.

Culpeper, J. ,2002. 'Computers, language and characterisation: An Analysis of six characters in Romeo and Juliet'. In: U. Melander-Marttala, C. Östman and M. Kytö (eds.), Conversation in Life and in Literature: Papers from the ASLA Symposium, Association Suedoise de Linguistique Appliquée (ASLA), 15. Universitetstryckeriet: Uppsala, pp.11-30.

Kemppanen, Hannu 2004. Keywords and Ideology in Translated History Texts: A Corpus-based Analysis. Across Languages and Cultures 5 (1), 89-106

Rigotti, Eddo and Andrea Rocci, 2002. From Argument Analysis to Cultural Keywords (and back again). http://www.ils.com.unisi.ch/articoli-rigotti-rocci-keywords-published.pdf (accessed May 2007). In F. H. van Eemeren et al, Proceedings of the 5th Conference of the International Society for the Study of Argumentation. Amsterdam: SicSat. pp. 903-908.

Scott, M., 1996 with new versions in 1997, 1999, 2004, Wordsmith Tools, Oxford: Oxford University Press.

Scott, M., 1997a. "PC Analysis of Key Words -- and Key Key Words", System, Vol. 25, No. 1, pp. 1-13.

Scott, M., 1997b. "The Right Word in the Right Place: Key Word Associates in Two Languages", AAA - Arbeiten aus Anglistik und Amerikanistik, Vol. 22, No. 2, pp. 239-252. 44

Page 45: Mike Scott, School of English University of Liverpool Keyness1 …challenges in investigating Keyness NLP Group Computer Science University of Sheffield.

References Scott, M., 2000a. ‘Focusing on the Text and Its Key Words’, in L. Burnard & T. McEnery

(eds.), Rethinking Language Pedagogy from a Corpus Perspective, Volume 2. Frankfurt: Peter Lang., pp. 103-122.

Scott, M. 2000b. Reverberations of an Echo, in B. Lewandowska-Tomaszczyk & P.J. Melia (eds.) PALC’99: Practical Applications in Language Corpora. Lodz Studies in Language, Volume 1. Frankfurt: Peter Lang., pp. 49-68.

Scott, M., 2001. ‘Mapping Key Words to Problem and Solution’ in M. Scott & G. Thompson (eds.) Patterns of Text: in honour of Michael Hoey, Amsterdam: Benjamins, pp. 109-127.

Scott, M., 2002. ‘Picturing the key words of a very large corpus and their lexical upshots – or getting at the Guardian’s view of the world’ in B. Kettemann & G. Marko (eds.) Teaching and Learning by Doing Corpus Analysis, Amsterdam: Rodopi, pp. 43-50 and cd-rom within the cover of the book.

Scott, M. 2006. "The Importance of Key Words for LSP" in Arnó Macià, E., A. Soler Cervera & C. Rueda Ramos (eds.), Information Technology in Languages for Specific Purposes: issues and prospects. New York: Springer, pp. 231-243.

Scott. M. (forthcoming) In Search of a Bad Reference Corpus. AHRC Methods Network. Scott, M. & Tribble, C., 2006. Textual Patterns: keyword and corpus analysis in

language education, Amsterdam: Benjamins. Seale C, Charteris-Black J, Ziebland S. 2006. Gender, cancer experience and internet

use: a comparative keyword analysis of interviews and online cancer support groups. Social Science and Medicine. 62, 10: 2577-2590

Tribble, Chris, 1999, "Genres, keywords, teaching: towards a pedagogic account of the language of project proposals" in L. Burnard & A. McEnery (eds.) Rethinking Language Pedagogy from a Corpus Perspective: Papers from the Third International Conference on Teaching and Language Corpora, (Lodz Studies in Language). Hamburg: Peter Lang.

Wilks, Yorick, 2005. REVEAL: the notion of anomalous texts in a very large corpus. Tuscan Word Centre International Workshop. Certosa di Pontignano, Tuscany, Italy, 31 June–3 July 2005 (cited in Cheng et al.)

Keyness 45