Fex Feature Extractor - v2. Topics Vocabulary Syntax of scripting language –Feature functions...

Post on 28-Dec-2015

246 views 2 download

Transcript of Fex Feature Extractor - v2. Topics Vocabulary Syntax of scripting language –Feature functions...

Fex Feature Extractor - v2

Topics

• Vocabulary• Syntax of scripting language

– Feature functions– Operators

• Examples– POS tagging

• Input Formats

Vocabulary• example

– A list of active records for which Fex produces a single SNOW example. Usually a sentence.

• record – a single position in an example (sentence). – Contains a list of fields, each of which holds a different info: e.g. NLP: Word, Tag, Vision: color, etc.

• Raw input to Fex – A list of valid example, (raw sentences, tagged corpora, etc. )

• Fex’s Output – Lexical features written to the lexicon file. – Their corresponding numeric ID’s are written to the example file.

• feature function – A relation among one or more records.

Example: Feature Functions

Script Syntax• A Fex script file contains a list of definitions, each of

which will rewrite the given observation into a set of active features.

• Definition format, terms in ()’s optional:• target (inc) (loc): FeatureFunc ([left, right])

• target - Target index or word. To treat each record in the observation as a target, use -1. This is a macro for “all words”.

• inc - Include target word instead of placeholder (*) in some features.

• loc - Generate features with location relative to target.

• FeatureFunc - A feature function defined in terms of certain unary and n-ary relations, and operators.

• left - Left offset of scope for generating features. Negative values are left of the target, positive to the right.

• right - Right offset of scope.

Basic Feature Functions• Type Def Fex Notation Interpretation Output to

Lexicon Labellab produces a label feature lab[target word]lab(t) lab[target tag]

Word w Active if word(s) in current w[current word] record is within scope

Tag (pos) t Active if tag(s) in current t[current tag] record is within scope

Vowel v Active if the word(s) in v[initial vowel]

current record begin with a vowel.

Prefix pre Active if the word(s) in the pre[active prefix]current record begins witha prefix in a given list.

Type Def Fex Notation Interpretation Output to Lexicon

Suffix suf Active if the word(s) in suf[the active suffix] the current record begins

with a prefix in a given list

Baseline base Active if a baseline tag from base[baseline tag]a prepared list exists for the word(s) in the current record

Lemma lem Active if a lemma from the lem[active lemma]WordNet database exists forthe word(s) in the current

record

Example

• Sentence = “(DET The) (NN dog) (V is) (JJ mad)”method 1

Script Def Output to lexicon Output to example filedog: w [-1,1] 10001 w[The] 10001, 10002, 10003, 10004:

10002 w[is]dog: t [1,2] 10003 t[V] 10004 t[JJ]

method 2Script Def Output to lexicon Output to example file -1: lab 10001 w[The] 1, 10001, 10002, 10003, 10004:-1: w [-1,1] 10002 w[is]-1: t [1,2] 10003 t[V] 10004 t[JJ]

Operators & Complex Functions

• (X) operator - Indicate that a feature is active without any specific instantiation.

Script Def Output to Lexicon

dog: v(X) [-1,1] 10001 v[]

• (x=y) operator – Creates an active feature iff the active instantiation matches the given

argument.Script Def Output to Lexicon

dog: w(x=is) 10001 w[is]

Sentence = “(DET The) (NN dog) (V is) (JJ mad)”

• & operator - conjunct two features:

producing a new feature which is active iff record fulfills both constituent features.

Script Def Output to Lexicon

dog: w&t [-1,-1] 10001 w[The]&t[DET]

• | operator - disjunction of two feature:

outputting a feature for each term of the

disjunction that is active in the current record.Script Def Output to Lexicon

dog: w|t [-1,-1] 10001 w[The] 10002 t[DET]

Sentence = “(DET The) (NN dog) (V is) (JJ mad)”

Operators & Complex Functions

• coloc function - Consecutive feature function: takes two or more features as arguments to produce a

consecutive collocation over two or more records. The order of the arguments is preserved in the active feature.Script Def Output to Lexicon mad: coloc(w, t) [-3,-1] 10001 w[The]-t[NN]

10002 w[dog]-t[V]

• scoloc function –Sparse Consecutive feature function: operates similarly to coloc, except that active colocations need not be consecutive. However, the order of the arguments is still preserved in determining whether a feature is active.Script Def Output to Lexicon mad: scoloc(w,t) [-3,-1] 10001 w[The]-t[NN]

10002 w[dog]-t[V] 10003 w[The]-t[V]

Operators & Complex Functions

Example: POS tagging

• Useful features for POS tagging:– The preceding word is tagged c.

– The following word is tagged c.

– The word two before is tagged c.

– The word two after is tagged c.

– The preceding word is tagged c and the following word is tagged t.

– The preceding word is tagged c and the word two before is tagged t

– The following word is tagged c and the word two after is tagged t.

– The current word is w.

– The most probable part of speech for the current word is c.

• Given the sentence:– (t1 The) (t2 dog) (t3 ran) (t4 very) (t5 quickly)

• The following Fex script will produce the features from the last slide.

-1: lab(t) -1 loc: t [-2,2] -1: coloc(t,t,t) [-2,2] -1 inc: w[0,0] -1: base[0,0]

• To do POS tagging, an example needs to be generated for each word in observation.

• For the third word, “ran”, the script produces the following output:– Script: Lexicon Output:

-1: lab(t) 1 lab[t3]

-1 loc: t [-2,2] 10001 t[t1_*]10002 t[t2*]

10003t[*t4]10004 t[*_t5] -1:

coloc(t,t,t) [-2,2] 10005 t[t1]-t[t2]-*10006 t[t2]-*-

t[t4] 10007 *-t[t4]-t[t5] -1 inc: w [0,0] 10008

w[ran] -1: base [0,0]10009 base[V]

• And an example in the example file:– 1, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009:

Input Formats• Fex can presently accept data in two formats:

– w1 w2 w3 w4 …

– (t1 w1) (t2 w2) (t3 w3) (t4 w4) …

– w1 (t2 w2) (t3 t3a; w3) (t4; w4 w4a) …

Using Fex (command line)

fex [options] script-file lexicon-file corpus-file example-file

Options:

• -t: target file– do not have any empty line in your file!!!

– Each target in a separate line

• -r: test mode– Does not create new features

• -h, -I– Creates a histogram of active features

Using Fex (command line)

• Target file= targ: Script file = script: dog -1 : lab

cat -1 : w [-1,-1]

-1 : t [-1,-1]

Corpus file = corpus (DET The) (NN dog) (V is) (JJ mad)

Lexicon file =lexicon

Example file=example

fex –t targ script lexicon corpus example

SNoW

Word representation

0.75 11.51 2

join

as will the NOUN_ VERB-to_modal

say

_"

2 0.251.25

Restrictions on the learning approach

• Multi- Class

• Variable number of features– per class– per example

• Efficient learning

• Efficient evaluation

SNoW• Network of threshold gates• Target nodes represent class labels• Input nodes (features) and links are allocated in a data

driven way (Order of 105 input features for many target nodes)

• Each sub-network (target nodes) is learned autonomously as a function of the features

• An example presented is positive to one network negative to others (depends on the algorithm)

• Allocations of nodes (features) and links is Data-Driven

(a link between feature fi and target tj is created only when fi was active with any target tj)

0.75 11.51 2

join

as will the NOUN_ VERB-to_modal

say

_"

2 0.251.25

Word prediction using SNoW

• Target nodeseach word in the set of candidates words is atarget node

• Input nodesan input node for feature fi is allocated only if that feature fi was active with any target

• Decision task we need to choose one target among all

possible candidates

0.75 11.51 2

join

as will the NOUN_ VERB-to_modal

say

_"

2 0.251.25

SNoW (Command line)

snow –train –I inputfile –F networkfile [-ABcdePrsTvW]

snow –test –I inputfile –F networkfile [-bEloRvw]

ArchitectureWinnow: -W [, , , init weight] :targets

Perceptron: -P [, , init weight] :targets

NB: -B :targets

SNoW parameters (training)

-d <none | abs:<k> | rel > : discarding method

-e <i> : eligibility threshold

-r <i> : number of cycles

output modes-c <i> : interval for network snapshot

-v < off | min | med | max > :details for the output

to the screen

SNoW parameters (testing)

-b <k> : smoothing for NB

-w <k> : smoothing for W, P

output modes-E : error file

-o < accuracy | winners | allpred | allact | allboth > :details for the output

-R : results file (stdout)

File Format (Example file)

6, 10034, 10141, 10151, 10158, 10179:

177, 10034, 10035, 10047:

With weights:

6, 10034(1), 10141(1.5), 10151(0.4), 10158(2), 10179(0.1):

177, 10034(2), 10035(4), 10047(0.6):

Only active feature appear in an example !!!

File Format (Network file)

NBtarget 111 0 1 135 1 naivebayes 0 0.1 0.5

111 : 0 : 10020 : 4 0 -3.518980417

111 : 0 : 10021 : 1 0 -4.905274778

Winnow

target 111 1 1 135 1562 winnow 0 1.1 0.9 15 1

111 : 0 : 10020 : 4 1 1.1

111 : 0 : 10021 : 1 0 1

Perceptron

target 111 2 1 270 1 perceptron 0 0.1 4 0.2

111 : 0 : 10020 : 4 1 0.3

111 : 0 : 10021 : 1 0 0.2

File Format (Error file)

Algorithms:Perceptron: (1, 30, 0.05) Targets: 3, 53, 73 Ex: 8 Prediction: 3 Label: 533: 0.586653: 0.2592*73: 0.1192

Ex: 15 Prediction: 3 Label: 733: 0.598773: 0.001229*53: 0.0002248