Post on 26-Dec-2015
Corpus Linguistic Processing Problems
Mike ScottSchool of English
University of LiverpoolCharles University, Prague 22.5.06
This presentation is at www.lexically.net/downloads/corpus_linguistics
Abstract
This lecture considers the problems of handling and analysing sizeable corpora using standard PC technology. Issues to be addressed include the problem of dealing with text in a variety of formats, what constitutes a text boundary, memory versus disk storage, and retrieval from a hard disk of relevant texts which can be said to be “about” a given topic.
H.G. Wells World Brain (1938)
This World Encyclopaedia would be the mental background of every intelligent man in the world. It would be alive and growing and changing continually under revision, extension and replacement from the original thinkers in the world everywhere. Every university and research institute would be feeding it … its contents would be the standard source of material… (in Witten et al 1999:435)
Issues and Questions
Retrieval – Queries Text formats Text boundaries Storage Finding relevant text data on a hard disk
Part 1: Retrieval
What we do with a Corpus: search it
1. Find texts meeting certain criteria
2. Discover characteristics of text-types
Text Focus Language Focus
1. Find words / phrases / structures meeting certain criteria
2. Discover characteristics of words / phrases / structures
Find text-types with these characteristics
“Query” Operations
List all instances of X
Addition Operations
merge documents
insert into list
Removal Operations
split documents
delete from list
View Operations
re-order
see wider context
Text Attributes Date Authorship Readership / audience Location Participants Length Format (encoding) Language Style Mode Domain Availability Meaning (aboutness) etc.
Simple Query Types
identical to topic/wording X similar to topic X touches on topic X quotes text X quoted by text Y refers/alludes to text X referred to in text Y
Complex Queries
More than 1 simple query type, and/or more than 1 text attribute …
…in Boolean combinations (and, or, not)
Part 2: Text Formats
The chaos of text formats
Character formats Text formats
Characters
“Legacy” formats from the 1980s (e.g. DOS and its fore-runners) Unicode (now at version 5 beta):“Fundamentally, computers just deal with numbers. They store letters and
other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.” http://www.unicode.org/standard/WhatIsUnicode.html
Text Processing
Unix, Windows, Mac – can each handle some aspects of texts differently, e.g. how they process ends of lines
Word .doc v. RTF v. HTML v. XML: extra information built into the text
Prague.doc – 26,064 bytes
Prague.xml = 8,534 bytes
Prague.rtf = 7,893 bytes
Prague.htm = 7,272 bytes
Prague.txt = 8 bytes
Part 3: Text Boundaries
The Colony
“… a colony is a discourse whose component parts do not derive their meaning from the sequence in which they are placed.” (Hoey 1986: 4)
Examples of Colony Texts
“shopping lists, letter pages, dictionaries, hymn books, exam papers, concordances, small ads, class lists, bibliographies (to papers), abstracts (in volume form), constitutions, address books, newspapers, encyclopaedias, cookery books, seminar programmes, journals, certain kinds of reference books (e.g. Films on TV), footnotes to literary works, telephone directories, the Book of Proverbs, the Radio Times (and other TV magazines), gardening columns (sometimes), horoscopes (in newspapers), conference proceedings, menus…” (Hoey 1986:5)
Features of the Colony
1. Meaning not derived from sequence;2. Adjacent units do not form continuous prose;3. There is a framing context;4. No single author and/or anon;5. One component may be used without referring to
the others;6. Components can be reprinted or reused in
subsequent works;7. Components may be added, removed or altered;8. Many of the components serve the same function;9. Alphabetic, numeric or temporal sequencing.(Hoey 1986:20)
“Mainstream” Texts
share some of these features: they
refer to quote allude to share meaning with
other texts.
Part 4: Storage
Corpus Storage
Usually done using folders and sub-folders using some text attribute, often date, as a general key
Sometimes (BNC) the opportunity to make the filename informative has been wasted
But a tree is not the best way to access corpus contents…
Corpus
2001 2002 2003
because
of what we saw in Part 1: there are a number of different text attributes
any of which at different times may guide a given research query
given the unpredictability of research goals
so
a better strategy would be to let the component texts remain wherever they happen to be: in emails in .doc, html, .xml files in previous corpora (.txt usually)
and access them by an index structure
Part 5: Finding
Accessing relevant corpus texts
via the index with a mechanism for determining & then
labelling each text’s format start and end aboutness language, authorship etc.
A database solution.
Conclusions
1. Only a sub-set of retrieval methods are catered for at present
2. Text formats represent a significant problem for corpus builders
3. Text boundaries are often (always?) quite fuzzy if one is interested in meaning
4. Storage has traditionally been organised in discrete corpora
5. But it would be better to organise a discrete index instead.
…which is not very different from…
References: Aston, Guy & Lou Burnard, 1988. The BNC Handbook.
Edinburgh: Edinburgh University Press. Hoey, M. 1986, “The Discourse Colony: a preliminary study of a
neglected discourse type”, in M. Coulthard (ed.) Talking About Text. Birmingham: English Language Research Discourse Analysis Monographs no. 13, pp. 1-26.
Scott, Mike & Chris Tribble (2006) Textual Patterns: key words and corpus analysis in language education. Amsterdam: Benjamins.
Wells, H.G. (1938) World Brain. New York: Doubleday. Witten, I.H, A. Moffat & T.C. Bell, 1999, Managing Gigabytes.
2nd edition. San Francisco: Morgan Kaufman.