PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies...

Post on 29-Dec-2015

217 views 0 download

Transcript of PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies...

PowerConc: An R-gram Based Corpus Analysis Tool

Jiajin Xu & Yunlong JiaBeijing Foreign Studies University

2

PowerConc• National Research Centre for Foreign Language E

ducation, Beijing Foreign Studies University• A general purpose tool for corpus analysis• Developed in Delphi• can deal with any ANSI encoded texts

– E.g. on a Simplified Chinese OS– works well with Simplified/Trad. Chinese texts,

(un)tokenised or raw/POS-tagged, as well as raw/POS-tagged English texts

3

• Size: 1.5MB, compressed package less than 1MB

• Installation: Doesn’t require any installation.

• OS: Works only on Windows now.

PowerConc

Design principles for PowerConc

5

Ideally• Most powerful, can do anything that a concor

dancer can do and cannot do.• involves least effort in learning to use it

• Doing MORE with less• Reductionism in software design

6

Less buttons and/or tabs

Frequencycount

SearchList

7

8

9

Freq. Count

Concordance N-gram list

Collocation &Colligation Key n-gram list

10

More possibilities in tool develop’t

• Corpus-informed/related ‘grammars’– Pattern grammar (local grammar)– Collostruction– Lexical grammar (natural grammar, real grammar)– Lexical priming (textual colligation)– Longman grammar: Biber et al. grammar register

variation• Tool development lags behind

11

From phraseology to R-gram

• Many of the ‘grammars’ as some sort of phraseology

• We coined a technical term ‘R-gram’.– An operational parallel to phraseology– The unit of language can be words, lemmata,

phrases, POS, POS sequence, and combination of all these.

– Can be linguistic structures with uncertain words or categories (e.g. be passive/get passive).

12

• a * of: collocational framework• It be ADJ that: evaluative construction• Noun noun compounds• Bi-nominal constructions• Passive constructions: be/get ADV. V-EN• All these could be matched with Regular

Expressions.• But Regex is too difficult for lay users.

13

Easy search with enhanced hits

• Smart Input• Three meta-characters in Smart Input syntax,

the simplest grammar ever.

• @be returns all inflectional forms of ‘be’

• #n returns all nouns

• * refers to any single word

14

• a * of => a * of• It be ADJ that => It @be #adj that• Noun noun compound => #n #n• Bi-nominal => #n and #n• Passive => \S+_VB\S+\s(\S+_[RXPJDN]\S+\s)*\

S+_V\S*N

15

Limitation

• speed• A concordancer without applying indexing• can't process texts larger than a few million

words anyway.

16

Download PowerConc

•www.fleric.org.cn/powerconc/• http://www.bfsu-corpus.org/channels/tools

Thank you!