Text classification using Text kernels

Text Classification Using String Kernels

Presented byDibyendu Nath & Divya Sambasivan

CS 290D : Spring 2014

Huma Lodhi, Craig Saunders, et alDepartment of Computer Science, Royal Holloway, University of London

Intro: Text Classification• Task of assigning a document to one

or more categories.

• Done manually (library science) or algorithmically (information science, data mining, machine learning).

• Learning systems (neural networks or decision trees) work on feature vectors, transformed from the input space.

• Text documents cannot readily be described by explicit feature vectors. lingua-systems.eu

Problem Definition• Input : A corpus of documents.

• Output : A kernel representing the documents. • This kernel can then be used to classify, cluster etc.

using existing algorithms which work on kernels, eg: SVM, perceptron.

• Methodology : Find a mapping and a kernel function so that we can apply any of the standard kernel methods of classification, clustering etc. to the corpus of documents.

Overview

• Motivation

• Kernel Methods

• Algorithms - with increasingly better

efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

Motivation

• Text documents cannot readily be described by explicit feature vectors.

• Feature Extraction - Requires extensive domain knowledge- Possible loss of important information.

• Kernel Methods – an alternative to explicit feature extraction

Overview

• Motivation

• Kernel Methods


efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

The Kernel Trick• Map data into feature space via mapping ϕ. • The mapping may be assessed via a kernel

function.• Construct a linear function in feature space

slide from Huma Lodhi

Kernel Function

slide from Huma Lodhi

Kernel Function – Measure of Similarity, returns the inner product between mapped data points

K(xi, xj) = < Φ(xi), Φ(xj)>

Example –

Kernels for Sequences• Word Kernels [WK] - Bag of Words- Sequence of characters followed by

punctuation or space

• N-Grams Kernel [NGK]• Sequence of n consecutive substrings• Example : “quick brown”

3-gram - qui, uic, ick, ck_, _br, bro, row, own

• String Subsequence Kernel [SSK]• All (non-contiguous) substrings of n-symbols

Word Kernels• Documents are mapped to very high

dimensional space where dimensionality of the feature space is equal to the number of unique words in the corpus.

• Each entry of the vector represents the occurrence or non-occurrence of the word.

• Kernel - inner product between mapped sequences give a sum over all common (weighted) words

fish tank sea

Doc 1 2 0 1

Doc 2 1 1 0

String Subsequence KernelsBasic IdeaNon-contiguous substrings :

substring “c-a-r”

card – length of sequence = 3

custard – length of sequence = 6

The more subsequences (of length n) two strings have in common, the more similar they are considered

Decay FactorSubstrings are weighted according to the degree of contiguity in a string by a decay factor λ ∊ (0,1)

Example

c-a c-t a-t c-r a-r

car

cat

car cat

Documents we want to compare

λ2 λ2λ30 0

λ2 λ2λ3 0 0

K(car, car) = 2λ4 + λ6

K(cat, cat) = 2λ4 + λ6

n=2

K(car, cat) = K(car, cat) = λ4

Overview

• Motivation

• Kernel Methods


efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

Algorithm Definitions

• AlphabetLet Σ be the finite alphabet

• StringA string is a finite sequence of characters from alphabet with length |s|

• SubsequenceA vector of indices ij, sorted in ascending order, in a string ‘s’ such that they form the letters of a sequence

Eg: ‘lancasters’ = [4,5,9] Length of subsequence = in – i1 +1 = 9 - 4 + 1 = 6

Algorithm Definitions• Feature Spaces

• Feature MappingThe feature mapping φ for a string s is given by

defining the u coordinate φu(s) for each u ∈ Σn

These features measure the number of occurrences of subsequences in the string s weighting them according to their lengths.

String Kernel• The inner product between two mapped strings

is a sum over all the common weighted subsequence

λ2 λ2λ30 0

λ2 λ2λ3 0 0

K(car, cat) = λ4

Intermediate Kernel

c-a c-t a-t c-r a-r

car

cat

λ2 λ2λ30 0

λ2 λ2λ3 0 0

λ3

λ3

Count the length from the beginning of the sequence through the end of the strings s and t.

K’

Recursive Computation

Null sub-string

Target string is shorter than search sub-string

c-a c-t a-t c-r a-r

car

cat

λ2λ30 0

λ2λ3 0 0

λ3

λ3

c-a c-t a-t c-r a-r

cart3

cat λ2λ3 0 0λ3

s

t

sx

t

λ4 λλ40 0

K’(car,cat) = λ6

K’(cart,cat) = λ7

λ3λ4

+λ7+λ5

K’

K’

λ2 λ2λ30 0

λ2 λ2λ3 0 0

K(car,cat) = λ4

s

t

c-a c-t a-t c-r a-r

cart

cat λ2

λ2

λ3

λ4

λ2

λ3

0

λ3 λ2

0

K(cart,cat) = λ4

sx

t

+λ7 +λ5

K

K

Recursive ComputationNull sub-string

Target string is shorter than search sub-string

O(n |s||t|2) O(n |s||t|)Dynamic

ProgrammingRecursion

Efficiency

O(|Σ|n)

O(n |s||t|)

O(n |s||t|2)

All subsequences of length n.

Kernel Normalization

Setting Algorithm Parameters

Overview

• Motivation

• Kernel Methods


efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

Kernel Approximation

Suppose, we have some training points (xi, y

i)∈ X × Y , and

some kernel function K(x,z) corresponding to a feature

space mapping φ : X → F such that K(x, z) = ⟨φ(x), φ(z)⟩.

Consider a set S of vectors S = {si

∈ X }.

If the cardinality of S is equal to the dimensionality of the

space F and the vectors φ(si) are orthogonal

(i.e. K(si,s

j) = Cδ

ij)*, then the following is true:

*

Kernel ApproximationIf instead of forming a complete orthonormal basis, the cardinality of S Q ⊆ S is less than the dimensionality of X or the vectors si are not fully orthogonal, then we can construct an approximation to the kernel K:

If the set S Q is carefully constructed, then the production of a Gram matrix which is closely aligned to the true Gram matrix can be achieved with a fraction of the computational cost.

Problem : Choose the set S Q to ensure that the vectors φ(si) are orthogonal.

Selecting Feature SubsetHeuristic for obtaining the set S Q is as follows:

1.We choose a substring size n.

2.We enumerate all possible contiguous strings of length n.

3.We choose the x strings of length n which occur most frequently in the dataset and this forms our set S Q.

By definition, all such strings of length n are orthogonal (i.e. K(si,sj) = Cδij for some constant C) when used in conjunction with the string kernel of degree n.

Kernel Approximation Results

Overview

• Motivation

• Kernel Methods


efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

EvaluationDataset : Reuters-21578, ModeApte Split

Categoried Selected:Precision = relevant documents categorized relevant / total documents categorized relevant

Recall = relevant documents categorized relevant/total relevant documents

F1 = 2*Precision*Recall/Precision+Recall

Evaluation

Evaluation Effectiveness of Sequence Length

[k = 7] [k = 5]

[k = 6] [k = 5]

[k = 5]

[k = 5][k = 5]

[k = 5]

EvaluationEffectiveness of Decay Factor

λ = 0.3

λ = 0.03

λ = 0.05

λ = 0.03

Overview

• Motivation

• Kernel Methods


efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

Follow Up• String Kernel using sequences of words rather than

characters, less computationally demanding, no fixed decay factor, combination of string kernels

Cancedda, Nicola, et al. "Word sequence kernels." The Journal of Machine Learning Research 3 (2003): 1059-1082.

• Extracting semantic relations between entities in natural language text, based on a generalization of subsequence kernels.

Bunescu, Razvan, and Raymond J. Mooney. "Subsequence kernels for relation extraction." NIPS. 2005.

Follow Up

•Homology – Computational biology method to identify the ancestry of proteins.

Model should be able to tolerate upto m-mismatches. The kernels used in this method measure sequence similarity based on shared occurrences of k-length subsequences, counted with up to m-mismatches.

Overview

• Motivation

• Kernel Methods


efficiency

• Approximation

• Evaluation

• Follow Up

• Conclusion

ConclusionKey Idea: Using non-contiguous string subsequences to compute similarity between documents with a decay factor which discounts similarity according to the degree of contiguity

•Highly computationally intensive method – authors reduced the time complexity from O(|Σ|n) to O(n|s||t|) by a dynamic programming approach

•Still less intensive method – Kernel Approximation by Feature Subset Selection.

•Empirical estimation of k and λ, from experimental results

•Showed promising results only for small datasets

•Seems to mimic stemming for small datasets

Any Q?Thank You :)

Text classification using Text kernels

Data & Analytics

Transcript of Text classification using Text kernels