Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email :...

70
Corpora in Linguistic Research 南南南南 南南南 南南025-8443-6787 Email [email protected]

Transcript of Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email :...

Page 1: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Corpora in Linguistic Research

南京大学

李长生

电话: 025-8443-6787

Email : [email protected]

Page 2: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Order of Presentation

I. Corpus Research versus Linguistic Research II. Influential Corpora III. Corpus Analysis IV. More on Statistical Analysis V. Q and maybe A (anytime during presentation)

Page 3: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

I. Corpus Research versus Linguistic Research

Corpus Research=Linguistic Research

Language (features) Learner language (features)

Page 4: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

I. Corpus Research versus Linguistic Research

Corpus Research≠Linguistic Research

(Large,) representative authentic data

Page 5: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

II. Influential Corpora

Native-speaker corpora Learner corpora

Page 6: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Native-speaker Corpora

Page 7: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Collins Corpus/Bank of English

A 2.5-billion word analytical database of English. Contains written material from websites,

newspapers, magazines and books published around the world, and spoken material from radio, TV and everyday conversations. 

New data is fed into the corpus every month, to help the Collins dictionary editors identify new words and meanings from the moment they are first used.

Bank of English: part of the Collins Corpus. Contains 650 million words from a carefully chosen

selection of sources, to give a balanced and accurate reflection of English as it is used every day.

Page 8: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

British National Corpus

Contains approximately 100 million words of written texts (90%) and transcripts of speech (10%) in modern British English.

Can be accessed online remotely using the BNC Online service.

Page 9: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

American National Corpus

Contains 11.5 million words of written and spoken American English data (8.3 million words for writing and 3.2 million words for speech)

Page 10: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Longman/Lancaster Corpus

Contains about 30 million words of published English.

British data takes up 50% and American data 40% while the other 10% represents other varieties such as Australian, African and Irish English.

Page 11: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Learner Corpora

Page 12: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

International Corpus of Learner English

Contains argumentative essays written by advanced learners of English, i.e. university students of English as a foreign language (EFL) in their 3rd or 4th year of study.

Contains over 2.5 million words in the form of 3,640 texts ranging between 500-1,000 words in length written by EFL learners from 11 mother tongue backgrounds, namely, Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish.

Page 13: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

CLEC

Contains one million words from writing produced by Chinese learners of English from five proficiency levels: middle school students, junior and senior non-English majors, and junior and senior English majors.

Annotated with learner errors using an annotation scheme which consists of 61 error types clustered in 11 categories.

Page 14: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

SWECCL

包含我国英语专业大学生的口语和笔语总共约 200万词

Page 15: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

LSECCL

Year 1 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - The Most Unforgettable

Birthday Task 3 - Dialogue - Holiday plan

Recording 2 Task 1 - Retelling Task 2 - Monologue - Whether it is appropriate

for college students to rent apartments outside the campus and live there

Task 3 - Dialogue - Whether exams should be abolished

Page 16: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

LSECCL

Year 2 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - Describe one of your

persons you admire most Task 3 - Dialogue - What gift to buy for a friend -

Lily Recording 2

Task 1 - Retelling Task 2 - Monologue - Make critical comments on

the use of electronic dictionaries among college students

Task 3 - Dialogue - Whether it is a good practice or not to keep one’s own computer in dorm

Page 17: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

LSECCL

Year 3 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - Describe one of your

experiences when you had a great ambition to do something

Task 3 - Dialogue - Talk about ways of relaxation after a month-long preparation for an exam

Recording 2 Task 1 - Retelling Task 2 - Monologue - Do you think it is appropriate

for college students to get married Task 3 - Dialogue - Talk about the necessity of

having certificates

Page 18: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

LSECCL

Year 4 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - The Most Unforgettable

Birthday Task 3 - Dialogue - Holiday plan

Recording 2 Task 1 - Retelling Task 2 - Monologue - Whether it is appropriate

for college students to rent apartments outside the campus and live there

Task 3 - Dialogue - Whether exams should be abolished

Page 19: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

III. Corpus Analysis

(Tagging corpus data) Calculating frequencies and frequency

differences Frequencies of occurrence Frequencies of co-occurrence Frequency differences across registers/corpora/

periods of time (Transferring frequencies) Statistical analysis

Page 20: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Lexis

《大学英语课程教学要求》 (2007) 参考词汇表

Page 21: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Lexis

headwords

Page 22: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Lexis

meanings: deal (Biber et al., 1998)

Page 23: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Lexis

synonyms: utterly, perfectly

Page 24: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Lexis

synonyms: big, large, great (Biber et al., 1998)

Page 25: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Lexis

collocations: system

Page 26: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Lexis

chunks (Qi, 2006)

第一步 : 运行 WordList第二步 : 选定语料库第三步 : 制作索引第四步 : 点击计算 (Compute)Clusters

Page 27: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Grammar

that-clause, to-clause (Biber et al., 1998)

<V* that <CST>to <TO> * <V?I>/to <TO> * <R* * <V?I>/to <TO> * <R* R <* * <V?I>

Page 28: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Grammar

syntactic co-occurrences of try (McEnery and Wilson, 2001)

Page 29: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Learner Language

Frequency differences across corpora Frequency differences across periods of

time

Page 30: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Across Corpora

SWECCL

ICLE

BNC

L1 (NNS-NNS)

L1 (NNS-NS)

Page 31: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Corpus Analysis

Page 32: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Tagging Corpus Data

CLAWS book book_NN1

超级批量文本替换 book_NN1 book <NN1>

Page 33: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Calculating Frequencies and Frequency Differences

passive voice (be done) (Li, 2007a)

* <VB* * <V?N>

Page 34: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Statistical Analysis

差异 两库或三库 1. chi-square

Under Analyze, choose Descriptive Statistics, then Crosstabs. Move one variable into the Row(s) box and the other into the Column(s) box. Click Statistics, and check off Chi-square. Click Cells, and check off Expected.

2. one-way chi-square Under Analyze, choose Nonparametric Tests, then Chi-Square.

Move the variable into the Test Variable List box. Click OK.

Page 35: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Another Example

AWL (Li, 2007a)

+matchlist

Page 36: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Across Periods of Time

LSECCL

Grades (Year 1-Year 2-Year 3-Year 4)

Page 37: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Title

1)    Key terms 3)    Noun phrase 4)    Word limit (<20) 5)    Capitalization

Li (2007b)

Page 38: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Abstract

Summary

Page 39: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Acknowledgments

Specific

Page 40: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Introduction

Motivation for the study, theoretical and practical significance of the study, overall structure

Page 41: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Literature Review

Key terms Theoretical issues Empirical studies Unresolved issues

Page 42: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Literature Review

Bibliographies/Indices/Databases (ERIC, NJU, Google Scholar, corpus4u)

Papers (Chen, 2004) Journals (Applied Linguistics, Language

Learning) Books (FLTRP)

Page 43: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Research Questions

LSECCL

Grades (Year 1-Year 2-Year 3-Year 4)

Page 44: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Corpus Analysis

Page 45: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Tagging Corpus Data

Microsoft Word I think I think <sv> <ip> <cm> <0>

Page 46: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Calculating Frequencies and Frequency Differences

<sv>/<ap>/<dn> <cm>

Page 47: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Transferring Frequencies

Microsoft Excel

=COUNTIF(N1:N5000,"D:\YEAR1\1-2-B02B.TXT")

Page 48: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Statistical Analysis

Changes in frequency differences三次或三次以上数据 Wilcoxon Under Analyze, choose Nonparametric Tests, then 2

Related Samples. Move the variables into the Test Pair(s) List box.

Page 49: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Results and Discussion

Answers to the research questions, and reasons for the answers

Page 50: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Conclusion

Summary of the findings, theoretical and practical implications of the findings, and limitations of the study

Page 51: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

References

Works cited

Page 52: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Appendices

Sample tagged text, etc

Page 53: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

IV. More on Statistical Analysis

Page 54: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Research Questions in Linguistic Research

1. Differences 2. Changes 3. Correlation 4. Effects

Page 55: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Differences (2 groups of subjects, 1 test)

1) independent t-test Entering the data Analyzing the data

Under Analyze, choose Compare Means, then Independent-Samples T Test. Move the dependent variable into the Test Variable box, and the independent variable into the Grouping Variable box. Click Define Groups and type in the values of the two groups.

Tabulating the results Describing the results

Page 56: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

2) Mann-Whitney U Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then 2 Independent Samples. Move the dependent variable into the Test Variable List box, and the independent variable into the Grouping Variable box. Click Define Groups. Check off Mann-Whitney U.

Tabulating the results Describing the results

Page 57: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Differences (3 groups of subjects, 1 test)

1) one-way ANOVA Entering the data Analyzing the data

Under Analyze, choose Compare Means, then One-Way ANOVA. Move the dependent variable into the Dependent List box, and the independent variable into the Factor box. Click Post Hoc, and choose Tukey (equal number of cases in each group) or Bonferroni (unequal number of cases).

Tabulating the results Describing the results

Page 58: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

2) Kruskal-Wallis H Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then K Independent Samples. Move the dependent variable into the Test Variable List box, and the independent variable into the Grouping Variable box. Click Define Range. Check off Kruskal-Wallis H.

Tabulating the results Describing the results

Page 59: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Differences (3 groups of subjects, 2 tests)

MANOVA Entering the data Analyzing the data

Under Analyze, choose General Linear Model, then Multivariate. Move the dependent variables into the Dependent Variables box, and the independent variable into the Fixed Factor(s) box.

Tabulating the results Describing the results

Page 60: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Differences (2 or 3 groups of subjects)

1) chi-square Entering the data Analyzing the data

Under Analyze, choose Descriptive Statistics, then Crosstabs. Move one variable into the Row(s) box and the other into the Column(s) box. Click Statistics, and check off Chi-square. Click Cells, and check off Expected.

Tabulating the results Describing the results

Page 61: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

2) one-way chi-square Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then Chi-Square. Move the variable into the Test Variable List box. Click OK.

Tabulating the results Describing the results

Page 62: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Changes (1 group of subjects, 2 tests)

1) paired t-test Entering the data Analyzing the data

Under Analyze, choose Compare Means, then Paired-Samples T Test. Click on a pair of variables, and move them into the Paired Variables box.

Tabulating the results Describing the results

Page 63: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

2) Wilcoxon Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then 2 Related Samples. Move the variables into the Test Pair(s) List box.

Tabulating the results Describing the results

Page 64: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Changes (1 group of subjects, 3 tests)

1) repeated-measures ANOVA Entering the data Analyzing the data

Under Analyze, choose General Linear Model, then Repeated Measures.

Tabulating the results Describing the results

Page 65: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

2) Wilcoxon Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then 2 Related Samples. Move the variables into the Test Pair(s) List box.

Tabulating the results Describing the results

Page 66: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Correlation (2 or 3 variables)

1) Pearson Entering the data Analyzing the data

Under Analyze, choose Correlate, then Bivariate. Move the variables into the Variables box. Check off Pearson.

Tabulating the results Describing the results

Page 67: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

2) Spearman Entering the data Analyzing the data

Under Analyze, choose Correlate, then Bivariate. Move the variables into the Variables box. Check off Spearman.

Tabulating the results Describing the results

Page 68: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

Effects (2 or 3 variables)

1) linear regression Entering the data Analyzing the data

Under Analyze, choose Regression, then Linear. Enter the dependent and independent variables. Choose an appropriate method (Stepwise or Enter), and click OK.

Tabulating the results Describing the results

Page 69: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

2) categorical regression Entering the data Analyzing the data

Under Analyze, choose Regression, then Optimal Scaling. Enter the dependent and independent variables. Choose an appropriate method (Stepwise or Enter), and click OK.

Tabulating the results Describing the results

Page 70: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com.

V. Q and A