Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email :...
-
Upload
evangeline-shepherd -
Category
Documents
-
view
275 -
download
6
Transcript of Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email :...
Order of Presentation
I. Corpus Research versus Linguistic Research II. Influential Corpora III. Corpus Analysis IV. More on Statistical Analysis V. Q and maybe A (anytime during presentation)
I. Corpus Research versus Linguistic Research
Corpus Research=Linguistic Research
Language (features) Learner language (features)
I. Corpus Research versus Linguistic Research
Corpus Research≠Linguistic Research
(Large,) representative authentic data
II. Influential Corpora
Native-speaker corpora Learner corpora
Native-speaker Corpora
Collins Corpus/Bank of English
A 2.5-billion word analytical database of English. Contains written material from websites,
newspapers, magazines and books published around the world, and spoken material from radio, TV and everyday conversations.
New data is fed into the corpus every month, to help the Collins dictionary editors identify new words and meanings from the moment they are first used.
Bank of English: part of the Collins Corpus. Contains 650 million words from a carefully chosen
selection of sources, to give a balanced and accurate reflection of English as it is used every day.
British National Corpus
Contains approximately 100 million words of written texts (90%) and transcripts of speech (10%) in modern British English.
Can be accessed online remotely using the BNC Online service.
American National Corpus
Contains 11.5 million words of written and spoken American English data (8.3 million words for writing and 3.2 million words for speech)
Longman/Lancaster Corpus
Contains about 30 million words of published English.
British data takes up 50% and American data 40% while the other 10% represents other varieties such as Australian, African and Irish English.
Learner Corpora
International Corpus of Learner English
Contains argumentative essays written by advanced learners of English, i.e. university students of English as a foreign language (EFL) in their 3rd or 4th year of study.
Contains over 2.5 million words in the form of 3,640 texts ranging between 500-1,000 words in length written by EFL learners from 11 mother tongue backgrounds, namely, Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish.
CLEC
Contains one million words from writing produced by Chinese learners of English from five proficiency levels: middle school students, junior and senior non-English majors, and junior and senior English majors.
Annotated with learner errors using an annotation scheme which consists of 61 error types clustered in 11 categories.
SWECCL
包含我国英语专业大学生的口语和笔语总共约 200万词
LSECCL
Year 1 Recording 1
Task 1 - Reading aloud Task 2 - Monologue - The Most Unforgettable
Birthday Task 3 - Dialogue - Holiday plan
Recording 2 Task 1 - Retelling Task 2 - Monologue - Whether it is appropriate
for college students to rent apartments outside the campus and live there
Task 3 - Dialogue - Whether exams should be abolished
LSECCL
Year 2 Recording 1
Task 1 - Reading aloud Task 2 - Monologue - Describe one of your
persons you admire most Task 3 - Dialogue - What gift to buy for a friend -
Lily Recording 2
Task 1 - Retelling Task 2 - Monologue - Make critical comments on
the use of electronic dictionaries among college students
Task 3 - Dialogue - Whether it is a good practice or not to keep one’s own computer in dorm
LSECCL
Year 3 Recording 1
Task 1 - Reading aloud Task 2 - Monologue - Describe one of your
experiences when you had a great ambition to do something
Task 3 - Dialogue - Talk about ways of relaxation after a month-long preparation for an exam
Recording 2 Task 1 - Retelling Task 2 - Monologue - Do you think it is appropriate
for college students to get married Task 3 - Dialogue - Talk about the necessity of
having certificates
LSECCL
Year 4 Recording 1
Task 1 - Reading aloud Task 2 - Monologue - The Most Unforgettable
Birthday Task 3 - Dialogue - Holiday plan
Recording 2 Task 1 - Retelling Task 2 - Monologue - Whether it is appropriate
for college students to rent apartments outside the campus and live there
Task 3 - Dialogue - Whether exams should be abolished
III. Corpus Analysis
(Tagging corpus data) Calculating frequencies and frequency
differences Frequencies of occurrence Frequencies of co-occurrence Frequency differences across registers/corpora/
periods of time (Transferring frequencies) Statistical analysis
Lexis
《大学英语课程教学要求》 (2007) 参考词汇表
Lexis
headwords
Lexis
meanings: deal (Biber et al., 1998)
Lexis
synonyms: utterly, perfectly
Lexis
synonyms: big, large, great (Biber et al., 1998)
Lexis
collocations: system
Lexis
chunks (Qi, 2006)
第一步 : 运行 WordList第二步 : 选定语料库第三步 : 制作索引第四步 : 点击计算 (Compute)Clusters
Grammar
that-clause, to-clause (Biber et al., 1998)
<V* that <CST>to <TO> * <V?I>/to <TO> * <R* * <V?I>/to <TO> * <R* R <* * <V?I>
Grammar
syntactic co-occurrences of try (McEnery and Wilson, 2001)
Learner Language
Frequency differences across corpora Frequency differences across periods of
time
Across Corpora
SWECCL
ICLE
BNC
L1 (NNS-NNS)
L1 (NNS-NS)
Corpus Analysis
Tagging Corpus Data
CLAWS book book_NN1
超级批量文本替换 book_NN1 book <NN1>
Calculating Frequencies and Frequency Differences
passive voice (be done) (Li, 2007a)
* <VB* * <V?N>
Statistical Analysis
差异 两库或三库 1. chi-square
Under Analyze, choose Descriptive Statistics, then Crosstabs. Move one variable into the Row(s) box and the other into the Column(s) box. Click Statistics, and check off Chi-square. Click Cells, and check off Expected.
2. one-way chi-square Under Analyze, choose Nonparametric Tests, then Chi-Square.
Move the variable into the Test Variable List box. Click OK.
Another Example
AWL (Li, 2007a)
+matchlist
Across Periods of Time
LSECCL
Grades (Year 1-Year 2-Year 3-Year 4)
Title
1) Key terms 3) Noun phrase 4) Word limit (<20) 5) Capitalization
Li (2007b)
Abstract
Summary
Acknowledgments
Specific
Introduction
Motivation for the study, theoretical and practical significance of the study, overall structure
Literature Review
Key terms Theoretical issues Empirical studies Unresolved issues
Literature Review
Bibliographies/Indices/Databases (ERIC, NJU, Google Scholar, corpus4u)
Papers (Chen, 2004) Journals (Applied Linguistics, Language
Learning) Books (FLTRP)
Research Questions
LSECCL
Grades (Year 1-Year 2-Year 3-Year 4)
Corpus Analysis
Tagging Corpus Data
Microsoft Word I think I think <sv> <ip> <cm> <0>
Calculating Frequencies and Frequency Differences
<sv>/<ap>/<dn> <cm>
Transferring Frequencies
Microsoft Excel
=COUNTIF(N1:N5000,"D:\YEAR1\1-2-B02B.TXT")
Statistical Analysis
Changes in frequency differences三次或三次以上数据 Wilcoxon Under Analyze, choose Nonparametric Tests, then 2
Related Samples. Move the variables into the Test Pair(s) List box.
Results and Discussion
Answers to the research questions, and reasons for the answers
Conclusion
Summary of the findings, theoretical and practical implications of the findings, and limitations of the study
References
Works cited
Appendices
Sample tagged text, etc
IV. More on Statistical Analysis
Research Questions in Linguistic Research
1. Differences 2. Changes 3. Correlation 4. Effects
Differences (2 groups of subjects, 1 test)
1) independent t-test Entering the data Analyzing the data
Under Analyze, choose Compare Means, then Independent-Samples T Test. Move the dependent variable into the Test Variable box, and the independent variable into the Grouping Variable box. Click Define Groups and type in the values of the two groups.
Tabulating the results Describing the results
2) Mann-Whitney U Entering the data Analyzing the data
Under Analyze, choose Nonparametric Tests, then 2 Independent Samples. Move the dependent variable into the Test Variable List box, and the independent variable into the Grouping Variable box. Click Define Groups. Check off Mann-Whitney U.
Tabulating the results Describing the results
Differences (3 groups of subjects, 1 test)
1) one-way ANOVA Entering the data Analyzing the data
Under Analyze, choose Compare Means, then One-Way ANOVA. Move the dependent variable into the Dependent List box, and the independent variable into the Factor box. Click Post Hoc, and choose Tukey (equal number of cases in each group) or Bonferroni (unequal number of cases).
Tabulating the results Describing the results
2) Kruskal-Wallis H Entering the data Analyzing the data
Under Analyze, choose Nonparametric Tests, then K Independent Samples. Move the dependent variable into the Test Variable List box, and the independent variable into the Grouping Variable box. Click Define Range. Check off Kruskal-Wallis H.
Tabulating the results Describing the results
Differences (3 groups of subjects, 2 tests)
MANOVA Entering the data Analyzing the data
Under Analyze, choose General Linear Model, then Multivariate. Move the dependent variables into the Dependent Variables box, and the independent variable into the Fixed Factor(s) box.
Tabulating the results Describing the results
Differences (2 or 3 groups of subjects)
1) chi-square Entering the data Analyzing the data
Under Analyze, choose Descriptive Statistics, then Crosstabs. Move one variable into the Row(s) box and the other into the Column(s) box. Click Statistics, and check off Chi-square. Click Cells, and check off Expected.
Tabulating the results Describing the results
2) one-way chi-square Entering the data Analyzing the data
Under Analyze, choose Nonparametric Tests, then Chi-Square. Move the variable into the Test Variable List box. Click OK.
Tabulating the results Describing the results
Changes (1 group of subjects, 2 tests)
1) paired t-test Entering the data Analyzing the data
Under Analyze, choose Compare Means, then Paired-Samples T Test. Click on a pair of variables, and move them into the Paired Variables box.
Tabulating the results Describing the results
2) Wilcoxon Entering the data Analyzing the data
Under Analyze, choose Nonparametric Tests, then 2 Related Samples. Move the variables into the Test Pair(s) List box.
Tabulating the results Describing the results
Changes (1 group of subjects, 3 tests)
1) repeated-measures ANOVA Entering the data Analyzing the data
Under Analyze, choose General Linear Model, then Repeated Measures.
Tabulating the results Describing the results
2) Wilcoxon Entering the data Analyzing the data
Under Analyze, choose Nonparametric Tests, then 2 Related Samples. Move the variables into the Test Pair(s) List box.
Tabulating the results Describing the results
Correlation (2 or 3 variables)
1) Pearson Entering the data Analyzing the data
Under Analyze, choose Correlate, then Bivariate. Move the variables into the Variables box. Check off Pearson.
Tabulating the results Describing the results
2) Spearman Entering the data Analyzing the data
Under Analyze, choose Correlate, then Bivariate. Move the variables into the Variables box. Check off Spearman.
Tabulating the results Describing the results
Effects (2 or 3 variables)
1) linear regression Entering the data Analyzing the data
Under Analyze, choose Regression, then Linear. Enter the dependent and independent variables. Choose an appropriate method (Stepwise or Enter), and click OK.
Tabulating the results Describing the results
2) categorical regression Entering the data Analyzing the data
Under Analyze, choose Regression, then Optimal Scaling. Enter the dependent and independent variables. Choose an appropriate method (Stepwise or Enter), and click OK.
Tabulating the results Describing the results
V. Q and A