Fast Methods for Kernel-based Text Analysis

1

Fast Methods for Kernel-based Text Analysis

Taku Kudo 工藤　拓Yuji Matsumoto 松本　裕治NAIST (Nara Institute of Science and Technology)

41st Annual Meeting of the Association for Computational Linguistics , Sapporo JAPAN

2

Background

Kernel methods (e.g., SVM) become popularCan incorporate prior knowledge independently from the machine learning algorithms by giving task dependent kernel (generalized dot-product) High accuracy

3

Problem

Too slow to use kernel-based text analyzers to the real NL applications (e.g., QA or text mining) because of their inefficiency in testingSome kernel-based parsers run only at 2 - 3 seconds/sentence

4

Goals

Build fast but still accurate kernel- based text analyzersMake it possible to use them to wider range of NL applications

5

Outline

Polynomial Kernel of degree d Fast Methods for Polynomial kernel PKI PKE

Experiments Conclusions and Future Work

6

Outline

Polynomial Kernel of degree d Fast Methods for Polynomial kernels PKI PKE


7

Kernel Methods

No need to represent example in an explicit feature vector

Complexity of testing is O(L ・ |X|)

L

iii

L

iii

XXK

XXXf

1

1

),(

)φ()φ()(

},,,{ 21 LXXXT Training data

8

Kernels for Sets (1/3)

FXXXX T

iiiF

jL

N

},,,,{

},,,{

21

21

Focus on the special case where examples are represented as sets

The instances in NLP are usually represented as sets (e.g., bag-of-words)

Feature set:

Training data:

9

Kernels for Sets (2/3)},,,{ ,},,,{ 21 edbaXdcbaX

Combinations (subsets) of features

}},,{{

}},{},,{},,{{

dba

dbdaba

3 |},,{| || ),( 2121 dbaXXXXK

Simple definition:

2nd order

3rd order

10

Kernels for Sets (3/3)

I ate a cake PRP VBD DT NN

Dependent (+1) or independent (-1) ?

head modifier

Head-word: ateHead-POS: VBDModifier-word: cakeModifier-POS: NN

X=

Head-word: ateHead-POS: VBDModifier-word: cakeModifier-POS: NNHead-POS/Modifier-POS: VBD/NNHead-word/Modifier-POS: ate/NN …

X=

Subsets (combinations) of basic features are critical to improve overall accuracy in many NL tasks

Previous approaches select combinations heuristically

Heuristic

selection

11

Polynomial Kernel of degree d

..}2,1,0{1||),( 2121 dXXXXK dd 　

Implicit form

|)(|)(),(0

2121

d

rrdd XXPrcXXK

Explicit form

r

rm

lmrd

rld m

rm

l

drc )1()(

is a set of all subsets of with exactly elements in it

is prior weight to the subsets with size

)(XPr X

r )(rcd

r

(subset weight)

12

Example (Cubic Kernel d=3 )

},,,{ ,},,,{ 21 edbaXdcbaX

64)13(1||),( 3321213 　XXXXK

Implicit form:

}},,{{)( ,6)3(

}},{},,{},,{{)( ,12)2(

}}{},{},{{)( ,7)1(

}{)( ,1)0(

2133

2123

2113

2103

dbaXXPc

dbdabaXXPc

dbaXXPc

XXPc

64163123711),( 213 XXK

Explicit form:

Up to 3 subsets are used as new

features

13

Outline



14

Toy Example

{a, b, c}{a, b, d}{b, c, d}

10.5-2

Xα

X={a,c,e}

123

Feature Set: F={a,b,c,d,e}

Examples:

Test Example:

Kernel: 　321213 1||),( XXXXK

j

#SVs L =3

j

15

PKB (Baseline)

{a, b, c}{a, b, d}{b, c, d}

10.5-2

Xα

Test Example X={a,c,e}

K(X,X’) = (|X∩X’|+1)３

123

f(X) = 1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 15

Complexity is always O(L ・ |X|)

３３３

K(Xj,X)

j

16

PKI (Inverted Representation)

{a, b, c}{a, b, d}{b, c, d}

10.5-2

Xjα

K(X,X’) = (|X∩X’|+1)３

123

a b c d

{1,2}{1,2,3}{1,3}{2,3}

Test Example X= {a, c, e}

f(X)=1 ・ (2+1) + 0.5 ・ (1+1) - 2 (1+1) = 15３３３

Average complexity is O(B ・ |X|+L) Efficient if feature space is sparse Suitable for many NL tasks

Inverted Index

B = Avg. size

17

PKE (Expanded Representation)

L

iii XXKXf

1

),()(

L

iii

L

iii

XX

XX

1

1

)φ( )φ(

)φ()φ(

ww

Convert into linear form by calculating vector w projects X into its subsets space)φ(X

18


K(X,X’) = (|X∩X’|+1)

c3(0)=1, c3(1)=7,c3(2)=12, c3(3)=6

{a, b, c} {a, b, d} {b, c, d}

10.5-2

Xjαj

123

φ{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}

-0.5 10.5-3.5-7-10.5 18 12 6-12-18-24 6 3 0-12

C w

1

12

7

6

W (Expansion Table)3

F(X)= - 0.5 + 10.5 – 7 + 12 = 15

Test Example X={a,c,e}

{φ,{a},{c}, {e}, {a,c},{a,e}, {c,e},{a,c,e}}

Complexity is O(|X| ) , independent of the number of SVs (L)

Efficient if the number of SVs is large

d

w({b,d}) = 12 (0.5 – 2 ) = -18

19

PKE in Practice

Hard to calculate Expansion Table exactlyUse Approximated Expansion TableSubsets with smaller |w| can be removed, since |w| represents a contribution to the final classification Use subset mining (a.k.a. basket mining) algorithm for efficient calculation

20

Subset Mining Problemid set

1234

{ a c d } { a b c } { a b d } { b c e }

Transaction Database

{a}:3 {b}:3 {c}:3 {d}:2 {a b}:2 {b c}: 2 {a c}:2 {a d}: 2

Results

Extract all subsets that occur in no less than sets of the transaction database

and no size constraints → NP-hard Efficient algorithms have been proposed

(e.g., Apriori, PrefixSpan)

2

1

21

Feature Selection as Mining

• Can efficiently build the approximated table • σ controls the rate of approximation

{a, b, c} {a, b, d} {b, c, d}

10.5-2

Xiαi

123

Direct generation with subset mining

{a}{d}{a,b}{a,c}{b,c}{b,d}{c,d}{b,c,d}

10.5-10.5 12 12 -12-18-24-12

W φ{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}

σ=10

-0.5 10.5-3.5-7-10.5 12 12 6-12-18-24 6 3 0-12

s w

Exhaustive generation and testing

→ Impractical!

s

22

Outline



23

Experimental Settings

Three NL tasks English Base-NP Chunking (EBC) Japanese Word Segmentation (JWS) Japanese Dependency Parsing (JDP)

Kernel Settings Quadratic kernel is applied to EBC Cubic kernel is applied to JWS and JDP

24

Results (English Base-NP Chunking)

Time(Sec./Sent.)

Speedup Ratio

F-score

PKB .164 1.0 93.84PKI .020 8.3 93.84PKE (σ=.01) .0016 105.2 93.79PKE (σ=.005) .0016 101.3 93.85PKE (σ=.001) .0017 97.7 93.84PKE (σ=.0005) .0017 96.8 93.84

25

Results (Japanese Word Segmentation)

Time(Sec./Sent.)

Speedup Ratio

Accuracy (%)

PKB .85 1.0 97.94PKI .49 1.7 97.94PKE (σ=.01) .0024 358.2 97.93PKE (σ=.005) .0028 300.1 97.95 PKE (σ=.001) .0034 242.6 97.94 PKE (σ=.0005) .0035 238.8 97.94

26

Results (Japanese Dependency Parsing)

Time(Sec./Sent.)

Speedup Ratio

Accuracy (%)

PKB .285 1.0 89.29PKI .0226 12.6 89.29PKE (σ=.01) .0042 66.8 88.91PKE (σ=.005) .0060 47.8 89.05 PKE (σ=.001) .0086 33.3 89.26PKE (σ=.0005) .0090 31.8 89.29

27

Results

2 - 12 fold speed up in PKI 30 - 300 fold speed up in PKE Preserve the accuracy when we set an appropriate σ

28

Comparison with related work

XQK [Isozaki et al. 02] Same concept as PKE Designed only for the Quadratic Kernel Exhaustively creates the expansion

table

PKE Designed for general Polynomial Kernels Uses subset mining algorithms to create

the expansion table

29

Conclusions

Propose two fast methods for the polynomial kernel of degree d PKI (Inverted) PKE (Expanded)

2-12 fold speed up in PKI, 30-300 fold speed up in PKEPreserve the accuracy

30

Future Work

Examine the effectiveness in a general machine learning dataset Apply PKE to other convolution kernels Tree Kernel [Collins 00]

Dot-product between trees Feature space is all sub-tree Apply sub-tree mining algorithm [Zaki 02]

31

English Base-NP ChunkingExtract Non-overlapping Noun Phrase from text[NP He ] reckons [NP the current account deficit ] will narrow to[NP only # 1.8 billion ] in [NP September ] .

BIO representation (seeing as a tagging task) B: beginning of chunk I: non-initial chunk O: outside

Pair-wise method to 3-class problem

training: wsj15-18, test: wsj20 (standard set)

32

Japanese Word Segmentation

太郎は花子に本を読ませた ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

Sentence:Boundaries:

},,,,,{ 321,12 iiiiiii ccccccX

Distinguish the relative position Use also the character types of Japanese Training: KUC 01-08, Test: KUC 09

If there is a boundary between and i 1i1iY , otherwise 1iY

Taro made Hanako read a book

33

Japanese Dependency Parsing

私は　　ケーキを　　食べるI-top cake-acc. eat

Identify the correct dependency relations between two bunsetsu (base phrase in English)

Linguistic features related to the modifier and head (word, POS, POS-subcat, inflections, punctuations, etc)

Binary classification (+1 dependent, -1 independent)

Cascaded Chunking Model [kudo, et al. 02]

Training: KUC 01-08, Test: KUC 09

I eat a cake

34

Kernel Methods (1/2)

L

iii XXXf

1

)φ()φ()(

X : example to be classified Xi: training examples : weight for examples : a function to map examples to another vectorial

spaceφ

i

Suppose a learning task: }1,1{: Xg

))(sgn()( XfXg

},{ 1 LXXT training examples

35


L

i

d

rirdi XXPrcXf

1 0

|)(|)()(

If we calculate in advance ( is the indicator function)

))((|)(|)(1

||

L

iisdi XPsIscsw

for all subsets

)(

)()(Xs d

swXf

d

r rd FPFs0

)()(

d

r rd XPX0

)()(

I

36

TRIE representation

{a}{d}{a,b}{a,c}{b,c}{b,d}{c,d}{b,c,d}

10.5-10.5 12 12 -12-18-24-12

w

a db

b c c d

c

d

d

root

10.5

12 12

-10.5

-24-18-12

-12

Compress redundant structures Classification can be done by simply

traversing the TRIE

37

Kernel Methods

No need to represent example in an explicit feature vector

Complexity of testing is O(L |X|)

L

iii

L

iii

XXK

XXXf

1

1

),(

)φ()φ()(

},,,{ 21 LXXXT Training data

Fast Methods for Kernel-based Text Analysis

Documents

Transcript of Fast Methods for Kernel-based Text Analysis