Corpora in grammatical studies Corpus Linguistics Richard Xiao [email protected].
Corpora in language variation studies Corpus Linguistics Richard Xiao [email protected].
-
Upload
aaron-fisher -
Category
Documents
-
view
240 -
download
4
Transcript of Corpora in language variation studies Corpus Linguistics Richard Xiao [email protected].
Corpora in language variation studies
Corpus LinguisticsRichard Xiao
Aims of this session
• Lecture– Biber’s (1988) MF/MD approach– Xiao’s (2009) enhanced MDA model– Case study of world Englishes
• Lab session– Using Xaira to explore distribution of passives across
genres in FLOB
Corpora vs. register and genre analysis
• “Register” and “genre” are two terms that are often used interchangeably
• The corpus-based approach is well suited for the study of register variation and genre analysis– A corpus is created using external criteria, which define
different registers and genres– Corpora, especially balanced sample corpora, typically
cover a wide range of registers or genres• Biber’s (1988) MF/MF analytical framework is the
most powerful tool for approaching register variation and genre analysis
Biber’s MF/MD approach• Established in Biber
(1988): Variation across Speech and Writing (CUP)– Factor analysis of 67
functionally related linguistic features
– 481 text samples, amounting to 960,000 running words
• LOB• London-Lund corpus• Brown corpus• A collection of professional
and personal letters
Factor analysis• The key to the multidimensional analysis approach• A common data reduction method available in many
standard statistics packages– e.g. SPSS: “Analyze – Data reduction – Factor analysis”
• Reducing a large number of variables to a manageable set of underlying “factors” (“dimensions”) – e.g. questions + 1st/2nd person pronouns vs. passives +
nominalization• Extensively used in social sciences to identify clusters
of inter-related variables
Methodological overview1. Collect texts with register information2. Collect a set of potential (functionally related) linguistic
features to analyze (usually based on literature review)3. Automatically tag texts with linguistic features, post-
editing where necessary4. Compute frequency of co-occurrence patterns of
linguistic features using factor analysis• Functional interpretation of co-occurrence patterns (i.e.
dimensions of variation) through analysis of co-occurring features5. Sum the factor scores of features on each dimension
• Mean dimension scores for each register are used to analyze similarities and differences
• Two ways of doing MDA in genre analysis– Following Biber’s model and factor scores– Establishing your own MDA model
How does factor analysis work?• Build a correlation matrix of all variables (i.e.
linguistic features)• From this, determine the loading (or weight) of each
linguistic feature– Loading tells us to what degree we can generalize from
this factor to the linguistic feature– Positive loading = positive correlation (likewise for
negative)– A higher absolute value of a feature = the feature is more
representative of a factor/dimension or register/genre• Biber discarded features with absolute value under
the cut-off point 0.35– Features are only kept on the factor they had the highest
loading for (even if they occur on 2+ with scores above 0.35): one feature, one factor/dimension
Biber’s MF/MD approach
• Biber’s seven factors / dimensions– 1) Informational vs. involved production– 2) Narrative vs. non-narrative concerns– 3) Explicit vs. situation-dependent reference– 4) Overt expression of persuasion– 5) Abstract vs. non-abstract information – 6) Online informational elaboration– 7) Academic hedging
Biber’s MF/MD approach
• Factors 1, 3 and 5 are associated with “oral” and “literate” differences in English
• The spoken vs. written distinction is too broad– Spoken and written registers can be similar in some
dimensions but differ in others• “Each dimension is associated with a different set of
underlying communicative functions, and each defines a different set of similarities and differences among genres. Consideration of all dimensions is required for an adequate description of the relations among spoken and written texts.” (Biber 1988: 169)
Biber’s MF/MD approach• The primary motivations for the MDA
approach are the two assumptions (Biber 1995)– Generalizations about register variation in a
language must be based on analysis of the full range of spoken and written registers
– No single linguistic parameter is adequate in itself to capture the range of similarities and differences among spoken and written registers
Biber’s MF/MD approach• Biber’s MF/MD approach has been well received as it
establishes a link between form and function• Influential and widely used
– Synchronic analysis of specific registers / genres and author styles
– Diachronic studies describing the evolution of registers– Register studies of non-Western languages and contrastive
analyses– Research of University English and materials development– Move analysis and study of discourse structure
• Bier’s initial MDA model is largely confined to lexical and grammatical categories
The enhanced MDA model• Xiao (2009) seeks to enhance Biber’s MDA by
incorporating semantic components with grammatical categories– Wmatrix = CLAWS + USAS– A total of 141 linguistic features investigated
• 109 features retained in the final model– Five million words in 2,500 text samples, with one million
words in 500 samples for each of the 5 varieties of English• ICE – GB, HK, India, Singapore, the Philippines• 300 spoken + 200 written samples• 12 registers ranging from private conversation to academic writing
[Xiao, R. (2009) Multidimensional analysis and the study of world Englishes. World English 28(4): 421-450.]
ICE registers and proportionsS1A (20%) Spoken – Private
S1B (16%) Spoken – Public
S2A (14%) Spoken – Monologue – Unscripted
S2B (10%) Spoken – Monologue – Scripted
W1A (4%) Written – Non-printed – Non-professional writing
W1B (6%) Written – Non-printed – Correspondence
W2A (8%) Written – Printed – Academic writing
W2B (8%) Written – Printed – Non-academic writing
W2C (4%) Written – Printed – Reportage
W2D (4%) Written – Printed – Instructional writing
W2E (2%) Written – Printed – Persuasive writing
W2F (4%) Written – Printed – Creative writing
141 linguistic features covered
• A) Nouns: 21 categories, e.g.– nominalisation, other nouns; 19 semantic classes of nouns
(e.g. evaluations, speech acts)• B) Verbs: 28 categories, e.g.
– do as pro-verb, be as main verb, tense and aspect markers, modals, passives, 16 semantic categories of verbs
• C) Pronouns: 10 categories, e.g.– person, case, demonstrative
• D) Adjectives: 11 categories, e.g.– attributive vs. predicative use, 9 semantic categories
141 linguistic features covered• E) Adverbs: 7 categories• F) Prepositions (2 categories)• G) Subordination (3 categories)• H) Coordination (2 categories)• I) WH-questions / clauses (2 categories)• J) Nominal post-modifying clauses (5 categories)• K) THAT-complement clauses (3 categories)• L) Infinitive clauses (3 categories)• M) Participle clauses (2 categories)• N) Reduced forms and dispreferred structures (4
categories)• O) Lexical and structural complexity (3 categories)
141 Linguistic features covered• P) Quantifiers (4 categories)• Q) Time expressions (11 categories)• R) Degree expressions (8 categories)• S) Negation (2 categories)• T) Power relationship (4 categories)• U) Definiteness (2 categories)• V) Helping/hindrance (2 categories)• X) Linear order (1 category)• Y) Seem / Appear (1 category)• Z) Discourse bin (1 category)
Procedure of data analysis• 1) Data clean-up• 2) Grammatical and semantic tagging with Wmatrix• 3) Extracting the frequencies of 141 linguistic features from
2,500 corpus files• 4) Building a profile of normalised frequencies (per 1,000
words) for each linguistic feature• 5) Factor analysis
– Factor extraction (Principal Factor Analysis)– Factor rotation (Pramax)– Optimum structure: 9 factors
• 6) Interpreting extracted factors in functional terms• 7) Computing factor scores of various dimensions/factors• 8) Using the enhanced MDA model in exploration of variation
across registers and language varieties
The enhanced MDA model• Nine factors established in the new model
– 1) Interactive casual discourse vs. informative elaborate discourse
– 2) Elaborative online evaluation– 3) Narrative concern– 4) Human vs. object description – 5) Future projection– 6) Subjective impression and judgement– 7) Lack of temporal / locative focus– 8) Concern with degree and quantity– 9) Concern with reported speech
• Robustness of the model in register analysis
1) Interactive casual discourse vs. informative elaborate discourse
• Private conversation is most interactive and casual• Academic writing is most informative and elaborate• Spoken registers are generally more interactive and less elaborate than
written registers
-60-40-200204060
S-PrivateS-Public
W-Printed-Creative writingS-Mono-Unscripted
W-Nonprinted-CorrespondenceW-Printed-Non-academic writing
W-Nonprinted-Non-prof writingS-Mono-Scripted
W-Printed-Persuasive writingW-Printed-Instructional writing
W-Printed-Reportage W-Printed-Academic writing
ANOVA :
F=775.86p<0.0001R2=77.4%
2) Elaborative online evaluation
• Public dialogue (e.g. broadcast discussion / interview, parliamentary debate) has the most prominent focus on elaborative online evaluation
• Unscripted monologue also involves a high level of elaborative online evaluation• Persuasive writing (e.g. press editorials) may relate to elaborative evaluation but is not
restricted by real-time production• Private conversation is least elaborative even if the evaluation is made online • Evaluation is not a concern in creative writing
-6-4-20246
S-PublicS-Mono-Unscripted
W-Printed-Persuasive writingW-Nonprinted-Non-prof writing
S-Mono-ScriptedW-Printed-Academic writing
W-Printed-Non-academic writing W-Printed-Reportage
W-Printed-Instructional writingW-Nonprinted-Correspondence
S-PrivateW-Printed-Creative writing
F=102.20p<0.0001R2=31.1%
3) Narrative concern
• Unscripted monologue (e.g. demonstrations, presentations, sports commentaries) has a narrative concern
• Unsurprisingly, creative writing is also narrative • Narrative is not a concern in academic writing, non-professional writing
(student essays and exam scripts), and instructional writing (argumentation, instruction)
-8-6-4-20246
S-Mono-UnscriptedW-Printed-Creative writing
S-PrivateS-Public
S-Mono-ScriptedW-Nonprinted-Correspondence
W-Printed-Reportage W-Printed-Persuasive writing
W-Printed-Non-academic writing W-Printed-Instructional writingW-Nonprinted-Non-prof writing
W-Printed-Academic writing
F=134.50p<0.0001R2=37.3%
4) Human vs. object description
• Private conversation is most likely to have a focus on people• Correspondence (social letters and business letters) also involves human description• Instructional writing tends to give concrete descriptions of objects• Academic and non-academic writings can also be concrete when an object or substance is
described
-4-3-2-10123
S-PrivateW-Nonprinted-Correspondence
S-Mono-ScriptedS-Public
W-Printed-Persuasive writingW-Printed-Reportage
W-Nonprinted-Non-prof writingS-Mono-Unscripted
W-Printed-Creative writingW-Printed-Non-academic writing
W-Printed-Academic writingW-Printed-Instructional writing
F=44.03p<0.0001R2=16.3%
5) Future projection
• Persuasive writing (e.g. press editorials, trying to influence people’s future attitudes and actions) has the most prominent focus on future projection
• Correspondence and public dialogue also involve future projection to varying extents
• Academic writing is least concerned with future projection (timeless truth?)
-6-4-20246
W-Printed-Persuasive writingW-Nonprinted-Correspondence
S-PublicS-Mono-Scripted
W-Printed-Instructional writingS-Mono-Unscripted
S-PrivateW-Printed-Reportage
W-Printed-Creative writingW-Printed-Non-academic writing
W-Nonprinted-Non-prof writingW-Printed-Academic writing
F=28.10p<0.0001R2=11.1%
6) Subjective impression / judgement
• Factor score of creative writing is by far greater than any other register– Frequent use of possessive and reflective pronouns, as well as adjectives of judgement / appearance
• Scripted and unscripted monologue, public dialogue and news reportage also tend to avoid expressions of subjective impression and judgement (trying to appear/sound objective and impartial as far as possible)
• Instructional writing, private conversation, and student essays display low scores in this dimension
– They do not have a focus on personal impression and judgement
-4-20246810
W-Printed-Creative writingW-Printed-Non-academic writing
W-Printed-Persuasive writingW-Nonprinted-CorrespondenceW-Nonprinted-Non-prof writing
S-Private
W-Printed-Instructional writingW-Printed-Academic writing
S-Mono-UnscriptedW-Printed-Reportage
S-PublicS-Mono-Scripted
F=126.22p<0.0001R2=35.8%
7) Lack of temporal / locative focus
• Student essays and persuasive writing (argumentation and persuasion) do not have a temporal / locative focus (not concerned with concepts such as when, how long, and where)
• Such specific information is of vital importance in correspondence (social and business letters)
-8-6-4-2024
W-Nonprinted-Non-prof writingW-Printed-Persuasive writingW-Printed-Academic writing
W-Printed-Creative writingS-Public
S-PrivateW-Printed-Non-academic writing
S-Mono-UnscriptedS-Mono-Scripted
W-Printed-Reportage W-Printed-Instructional writingW-Nonprinted-Correspondence
F=89.55p<0.0001R2=28.4%)
8) Concern with degree / quantity
• Non-academic popular writing (e.g. popular science writing) has the greatest concern of degree and quantity
• Persuasive writing also displays a high propensity for expressions of degree and quantity
• In contrast, such expressions tend to be avoided in instructional writing (e.g. administrative documents) and correspondence
-2-10123
W-Printed-Non-academic writing W-Printed-Persuasive writing
S-Mono-ScriptedS-Mono-Unscripted
W-Printed-Academic writingW-Nonprinted-Non-prof writing
S-PublicW-Printed-Reportage
S-PrivateW-Printed-Creative writing
W-Nonprinted-CorrespondenceW-Printed-Instructional writing
F=19.33p<0.0001R2=7.9%
9) Concern with reported speech
• News reportage has the greatest concern with reported speech (both direct and indirect speech)
• Reported speech is also very common in creative writing (fictional dialogue)• Instructional writing and academic prose do not appear to have a concern
with reported speech
-4-3-2-1012345
W-Printed-Reportage W-Printed-Creative writing
S-Mono-ScriptedS-Public
S-PrivateW-Nonprinted-Correspondence
S-Mono-UnscriptedW-Printed-Non-academic writing
W-Printed-Persuasive writingW-Nonprinted-Non-prof writing
W-Printed-Academic writingW-Printed-Instructional writing
F=80.02p<0.0001R2=26.1%
12 registers along 9 factors
• Factor 1 is the dimension along which the 12 registers demonstrate the sharpest contrasts– Interactive casual discourse vs. informative elaborate discourse: a
fundamental aspect of variation across registers• Robustness of the model
-50-40-30-20-10
01020304050
S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F
RegisterF
acto
r sc
ore
Factor 1 Factor 2 Factor 3 Factor 4 Factor 5
Factor 6 Factor 7 Factor 8 Factor 9
Case study summary• Summary
– Seeking to enhance Biber’s MDA model with semantic components
– Introducing the new model in research of World Englishes– Cao, Y. & Xiao, R. (2013) “A multidimensional contrastive study of
English abstracts by native and nonnative writers”. Corpora, 8 (1-2)
• Lab session: Exploring distribution of passives in the FLOB corpus– Andrew H. and Xiao R. (2005) Introduction to Xaira. UCREL
Corpus Research Group, Lancaster, November 2005.Part 1. All about Xaira: www.lancs.ac.uk/staff/xiaoz/papers/crg_xaira_part1.ppt Part 2. Using Xaira to explore corpora: www.lancs.ac.uk/staff/xiaoz/papers/crg_xaira_part2.ppt
Open FLOB in Xaira
Define subcorpora
Define subcorpora
Define subcorpora
Define subcorpora
Define subcorpora
Open subcorpora
Open subcorpora
Query builder
Define scope node
Define 1st search node
Select all tags starting with VB
Define 2nd search node
Select all tags starting with VVN
Define link type
[For demonstration purpose, only passives with the verb BE followed immediately by a past participle will be included]
Random sampling
KWIC versus page mode
Sorted by %