CALL – A Corpus-based Course in Contrastive Analysis and Learner Language.
Compiling and Analyzing Your Own Learner Corpus
-
Upload
tobias-gregory -
Category
Documents
-
view
18 -
download
0
description
Transcript of Compiling and Analyzing Your Own Learner Corpus
![Page 1: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/1.jpg)
Compiling and Analyzing Your Own Learner Corpus
Xiaofei LuCALPER 2012 Summer Workshop
July 16, 2012
![Page 2: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/2.jpg)
2
Workshop outlineOpening discussion and corpora overviewGraphic Online Language Diagnostic (GOLD)
overviewSample GOLD (and related) projectsGOLD (or related tool) project labGOLD (or related tool) project discussionsConcluding discussion
![Page 3: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/3.jpg)
3
Opening discussionBrief introduction of your professional/language
background and teaching/research interestsPrior experience with corpus linguisticsPrimary challenges you are dealing withPrimary purposes and goals for taking this
workshop and for learning about corpus linguistics in general
Any other relevant information
![Page 4: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/4.jpg)
4
Corpora overviewWhat is a corpusTypes of corporaCorpus design and compilationCorpus annotationCorpus querying and analysisLearner corpora and L2 developmentResources
![Page 5: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/5.jpg)
5
What is a corpus? Leech (1992):
an unexciting phenomenon, a helluva lot of text, stored on a computer
Sinclair (1991):a collection of naturally-occurring language text, chosen
to characterize a state or a variety of languageSinclair (2004):
a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research
![Page 6: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/6.jpg)
6
Types of corporaGeneral-purpose vs. specialized corpora
British National Corpus & Russian National CorpusMichigan Corpus of Academic Spoken English
Native vs. learner corpora International Corpus of Learner EnglishSpanish Learner Language Oral Corpora
Monolingual vs. parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer
![Page 7: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/7.jpg)
7
Types of corpora (cont.)Corpora representing one or diverse varieties
International Corpus of English Synchronic vs. diachronic corpora
The Corpus of Historical American EnglishSpoken vs. written corpora
Michigan Corpus of Upper-Level Student Papers
![Page 8: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/8.jpg)
8
Corpus designPurpose and type of corpus
Spoken/written; cross-sectional/longitudinal
External criteria for content selectionCommunicative function of a textMode, medium, interaction, domain, topic
Representativeness, balance, size, samplingDesign of the BNC
![Page 9: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/9.jpg)
9
Corpus design (cont.)Encoding meaningful metadata information
Learner: L1, gender, program level, discipline … Sample: date, mode, task, genre, rating …Facilitates contrastive and longitudinal studies
MICASE speaker and transcript attributes Corpus markup: The ICE example
![Page 10: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/10.jpg)
10
Corpus annotationWhy annotateLevels of corpus annotationDifficulties for corpus annotationStandards and encoding
![Page 11: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/11.jpg)
11
Why annotateRaw text vs. annotated text: How do you…
Count the number of words in a Chinese text?Calculate the lexical density of an English text?Count the frequency of can as a modal verb?Know how many T-units in a text are complex?Extract all imperative sentences from a text?Know whether a syntactic structure is used in a text?
![Page 12: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/12.jpg)
12
Levels of corpus annotationSentence and word segmentationPart-of-speech (POS) tagging and lemmatizationSyntactic parsingSemantic, pragmatic, and discourse annotation Learner corpora: error annotationProject-specific annotation
![Page 13: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/13.jpg)
Sentence and word segmentationWhy is this non-trivial?
I went to the shops in Jones St. Saturday afternoon with Mr. Smith.I can’t remember whether it’s a second- or third-grade book.
克林顿在讲话中指出 Clinton pointed out in his speech (that…) 克林顿 在 讲话 中 指出
Clinton at speech middle point-out 克林顿 在 讲话 中指 出
Clinton at speech middle-finger out
![Page 14: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/14.jpg)
POS taggingThe what and whyWhat are the difficulties?
Ambiguity: 48% tokens in the Brown CorpusUnknown words: neologism
Tagsets: overspecificatin vs. underspecificationPenn Treebank Tagset vs. CLAWS7 Tagset
![Page 15: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/15.jpg)
LemmatizationCounting linguistic items
Types – number of different wordsTokens – number of words
What constitutes a different word type?go, went, gone, goes, going?differ, difference, different, differently?can as a noun, verb, and modal verb?
![Page 16: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/16.jpg)
16
Demos and tools: Part 1Xerox morphological analyzer (demo only)ICTCLAS for Chinese segmentation and POS taggingQuerying POS-tagged corpora and Stanford POS tagger for EnglishTree Tagger for multiple languages
![Page 17: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/17.jpg)
Chunking and parsingPartial/full structural analysis of each sentence
My dog likes eating sausage.(ROOT (S (NP (PRP$ My) (NN dog))
(VP (VBZ likes) (S
(VP (VBG eating) (NP (NN sausage)))))
(. .)))
![Page 18: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/18.jpg)
Chunking and parsing (cont’d)What is it useful for?
Retrieving examples of grammatical patternsGrammar checking, syntactic complexity analysisNLP applications that require syntactic analysis
DifficultiesUngrammatical sentencesAmbiguities, e.g., PP attachmentErrors from preprocessing steps
![Page 19: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/19.jpg)
19
Semantic and discourse analysisSemantic and discourse featuresWord sense disambiguationPropositional idea densityCoherence and cohesion
![Page 20: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/20.jpg)
20
Annotation standards and encodingUseful standards
Separable, linguistically consensualDocumentation, compatibility with existing standards
Encoding Simple encoding: present_JJ XML-style: <w type=“JJ">present</w>Format varies, depending on level of annotation
Manual, computer-aided, and automatic annotationEfficiency, scale, reliabilityUAM CorpusTool
![Page 21: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/21.jpg)
21
Demos and tools: Part 2Stanford parser for Arabic, Chinese and EnglishWord sense disambiguation demoComputerized Propositional Idea Density RaterCoh-Metrix for text coherence analysisCHILDES and CLANComputerized ProfilingWMatrix
![Page 22: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/22.jpg)
22
Corpus querying and analysisManual analysis?Corpus-specific online interfaces
Raw: MICASE and MICUSPPOS-tagged: Corpora @ BYUGrammatically and semantically tagged: RNC
General-purpose online interfaces: GOLDWindows-based querying/concordancing tools
WordSmith Tools & AntConc
![Page 23: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/23.jpg)
23
Corpus querying and analysisNatural language processing tools
Good for processing annotated corporaExtracting occurrences of grammatical patterns Examples: Stanford parser and Tregex
![Page 24: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/24.jpg)
24
ResourcesBooks and journals
Hunston (2002): Corpora in Applied LinguisticsMcEnery (2006): Corpus-Based Language Studies International Journal of Corpus LinguisticsCorpus Linguistics and Linguistic TheoryCorpora
Websites and mailing listsBookmarks for corpus-based linguistsLinguistic data consortiumThe corpora list; corpus in deliciousStanford Natural Language Processing Group
![Page 25: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/25.jpg)
25
DiscussionWhat kind of corpus do you intend to compile
and/or use? For what purpose?What are the design issues?How do you intend to format, organize and store
your files?Do you intend to annotate your corpus in some
way? How?How do you intend to search/query your corpus?
![Page 26: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/26.jpg)
26
Learner corpora and L2 developmentSamples from same students at different times
Did (targeted) language development take place?Was a particular pedagogical intervention effective?
Samples from different studentsWhat areas do students show different levels of
development?What factors affect students’ language development?
![Page 27: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/27.jpg)
27
Graphic Online Language DiagnosticA free online tool for teachers to assess their
students’ language developmentDeveloped at CALPER, Penn State, funded by DOEProject co-directors: Xiaofei Lu and Michael McCarthy
Teachers can use GOLD toCompile, upload, and manage their own corporaShare corpora with each otherSearch and analyze corpora
Demonstration
![Page 28: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/28.jpg)
28
Corpus compilationA user can compile a corpus by
Directly compiling and uploading an XML fileUsing the easy-to-use guided XML creation interface
An uploaded corpus can be easily managedDocuments can be added or deletedThe whole corpus can be deletedContent and metadata of individual documents can be
easily accessed
![Page 29: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/29.jpg)
29
Corpus sharingGOLD facilitates easy data sharingA corpus may be set to be
Private, shared, or public
Corpus owner may give other users right to View, add, edit, or delete corpora
Demonstration
![Page 30: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/30.jpg)
30
Basic corpus informationWord count
Alphabetic or numeric orderCan be downloaded as a text file
Corpus and document statisticsMean sentence lengthMean word lengthType-token ratio
Demonstration
![Page 31: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/31.jpg)
31
Corpus searchSelect one or more corpora to searchSpecify key words or phrases
May use the wildcard character, e.g. book*
Specify contextsSize of context windowContext words and their positions
Specify metadata conditions
![Page 32: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/32.jpg)
32
Corpus search resultsDisplay of search results
Sortable KWIC display of search resultsSortable graphic display of search results
Demonstration
![Page 33: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/33.jpg)
33
Lexical bundle/collocation searchProcedure
Select one or more corpora to searchSpecify search wordSpecify contextsSpecify metadata conditions
Search resultsSortable list of n-grams found in selected corpora
Demonstration
![Page 34: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/34.jpg)
34
Summary of featuresDifference from other online tools
Can create, share, and search multiple corporaCan easily search subsets of dataCan work with any language
Summary of corpus analysis functionsWord listCorpus and document statistics: mean sentence length,
mean word length, type-token ratioCorpus search and collocation search
![Page 35: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/35.jpg)
35
Sample questions to askWith data from an individual student, one can
either describe or track development in Patterns of usages of words and phrases – frequency,
underuse, overuse, etc.Lexical and syntactic complexityAppropriate usage of words and phrases in contextPatterns of usages of lexical bundles
![Page 36: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/36.jpg)
36
Sample questions to ask (cont.)With data from different (groups of) students,
one can compare similarities or differences among different (groups of) students in terms of Patterns of usages of words and phrases – frequency,
underuse, overuse, etc.Lexical and syntactic complexityAppropriate usage of words and phrases in contextPatterns of usages of lexical bundles
![Page 37: Compiling and Analyzing Your Own Learner Corpus](https://reader030.fdocuments.net/reader030/viewer/2022032709/56813047550346895d95f0d0/html5/thumbnails/37.jpg)
37
Future enhancementsCorpora for benchmarkingMultilingual natural language processingSuggestions on desirable functions welcome