Processing of Large Document Collections 1

Processing of Large Document Collections 1

Helena Ahonen-MykaUniversity of Helsinki

Organization of the courseClasses: 17.9., 22.10., 23.10., 26.11.

lectures (Helena Ahonen-Myka): 10-12,13-15 exercise sessions (Lili Aunimo): 15-17 required presence: 75%

Exercises are given (and returned) each week required: 75%

Exam: 4.12. at 16-20, AuditorioPoints: Exam 30 pts, exercises 30 pts

Schedule17.9. Character sets, preprocessing of

text, text categorization22.10. Text summarization23.10. Text compression26.11. … to be announced…

self-study: basic transformations for text data, using linguistic tools, etc.

In this part...Character setspreprocessing of texttext categorization

1. Character setsAbstract character vs. its graphical

representationabstract characters are grouped into

alphabets each alphabet forms the basis of the

written form of a certain language or a set of languages

Character setsFor instance

for English: uppercase letters A-Zlowercase letters a-zpunctuation marksdigits 0-9common symbols: +, =

ideographic symbols of Chinese and Japanese phonetic letters of Western languages

Character setsTo represent text digitally, we need a

mapping between (abstract) characters and values stored digitally (integers)

this mapping is a character setthe domain of the character set is

called a character repertoire (= the alphabet for which the mapping is defined)

Character setsFor each character in the character repertoire,

the character set defines a code value in the set of code points

in English: 26 letters in both lower- and uppercase ten digits + some punctuation marks

in Russian: cyrillic lettersboth could use the same set of code points (if not

a bilingual document)in Japanese: could be over 6000 characters

Character setsThe mere existence of a character set

supports operations like editing and searching of text

usually character sets have some structure e.g. integers within a small range all lower-case (resp. upper-case) letters

have code values that are consecutive integers (simplifies sorting etc.)

Character sets: standarsCharacter sets can be arbitrary, but

in practice standardization is needed for interoperability (between computers, programs,...)

early standards were designed for English only, or for a small group of languages at a time

Character sets: standards

ASCIIISO-8859 (e.g. ISO Latin1)UnicodeUTF-8, UTF-16

ASCIIAmerican Standard Code for Information

InterchangeA seven bit code -> 128 code pointsactually 95 printable characters only

code points 0-31 and 128 are assigned to control characters (mostly outdated)

ISO 646 (1972) version of ASCII incorporated several national variants (accented letters and currency symbols)

ASCIIWith 7 bits, the set of code points is too

small for anything else than American English

solution: 8 bits brings more code points (256) ASCII character repertoire is mapped to the

values 0-127 additional symbols are mapped to other

values

Extended ASCIIProblem:

different manufacturers each developed their own 8-bit extensions to ASCIIdifferent character repertoires ->

translation between them is not always possible

also 256 code values is not enough to represent all the alphabets -> different variants for different languages

ISO 8859Standardization of 8-bit character setsIn the 80´s: multipart standard ISO 8859 was

produceddefines a collection of 8-bit character sets,

each designed for a group of languages the first part: ISO 8859-1 (ISO Latin1)

covers most Western European languages 0-127: identical to ASCII, 128-159 (mostly) unused,

96 code values for accented letters and symbols

Unicode256 is not enough code points

for ideographically represented languages (Chinese, Japanese…)

for simultaneous use of several languagessolution: more than one byte for each

code valuea 16-bit character set has 65,536 code

points

Unicode16-bit character set, e.g. 65,536

code pointsnot sufficient for all the characters

required for Chinese, Japanese, and Korean scripts in distinct positions CJK-consolidation: characters of these

scripts are given the same value if they look the same

UnicodeCode values for all the characters used to

write contemporary ’major’ languages also the classical forms of some languages Latin, Greek, Cyrillic, Armenian, Hebrew,

Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan

Chinese, Japanese, and Korean ideograms, and the Japanese and Korean phonetic and syllabic scripts

Unicode punctuation marks technical and mathematical symbols arrows dingbats (pointing hands, stars, …) both accented letters and separate diacritical

marks (accents, tildes…) are included, with a mechanism for building composite characterscan also create problems: two characters that look

the same may have different code values->normalization may be necessary

UnicodeCode values for nearly 39,000 symbols

are providedsome part is reserved for an expansion

method (see later)6,400 code points are reserved for

private use they will never be assigned to any character

by the standard, so they will not conflict with the standard

Unicode: encodingsEncoding is a mapping that transforms a code

value into a sequence of bytes for storage and transmission

identity mapping for a 8-bit code? it may be necessary to encode 8-bit characters as

sequences of 7-bit (ASCII) characters e.g. Quoted-Printable (QP)

code values 128-255 as a sequence of 3 bytes1: ASCII code for ’=’, 2 & 3: hexadecimal digits of the value233 -> E9 -> =E9

Unicode: encodingsUTF-8

ASCII code values are likely to be more common in most text than any other valuesin UTF-9 encoding ASCII characters are sent

themselves (high-order bit 0)other characters (two bytes) are encoded

using up to six bytes (high-order bit is set to 1)

Unicode: encodingsUTF-16: expansion method

two 16-bit values are combined to a 32-bit value -> a million characters available

2. Preprocessing of textText cannot be directly interpreted by the

many document processing applicationsan indexing procedure is needed

mapping of a text into a compact representation of its content

which are the meaningful units of text?how these units should be combined?

usually not ”important”

Vector model

A document is usually represented as a vector of term weights

the vector has as many dimensions as there are terms (or features) in the whole collection of documents

the weight represents how much the term contributes to the semantics of the document

Vector modelDifferent approaches:

different ways to understand what a term is

different ways to compute term weights

TermsWords

typical choice set of words, bag of words

phrases syntactical phrases statistical phrases usefulness not yet known?

TermsPart of the text is not considered as terms

very common words (function words):articles, prepositions, conjunctions

numeralsthese words are pruned

stopword listother preprocessing possible

stemming, base words

Weights of termsWeights usually range between 0 and 1binary weights may be used

1 denotes presence, 0 absence of the term in the document

often the tfidf function is used higher weight, if the term occurs often in the

document lower weight, if the term occurs in many

documents

StructureEither the full text of the document or

selected parts of it are indexede.g. in a patent categorization application

title, abstract, the first 20 lines of the summary, and the section containing the claims of novelty of the described invention

some parts may be considered more important e.g. higher weight for the terms in the title

Dimensionality reductionMany algorithms cannot handle high

dimensionality of the term space (= large number of terms)

usually dimensionality reduction is applieddimensionality reduction also reduces

overfitting classifier that overfits the training data is good at

re-classifying the training data but worse at classifying previously unseen data

Dimensionality reductionLocal dimensionality reduction

for each category, a reduced set of terms is chosen for classification that category

hence, different subsets are used when working with different categories

global dimensionality reduction a reduced set of terms is chosen for the

classification under all categories

Dimensionality reductionDimensionality reduction by term selection

the terms of the reduced term set are a subset of the original term set

Dimensionality reduction by term extraction the terms are not the same type of the terms

in the original term set, but are obtained by combinations and transformations of the original ones

Dimensionality reduction by term selectionGoal: select terms that, when used for

document indexing, yields the highest effectiveness in the given application

wrapper approach the reduced set of terms is found iteratively and

tested with the applicationfiltering approach

keep the terms that receive the highest score according to a function that measures the ”importance” of the term for the task

Dimensionality reduction by term selectionMany functions available

document frequency: keep the high frequency termsstopwords have been already removed50% of the words occur only once in the

document collection e.g. remove all terms occurring in at most 3

documents

Dimensionality reduction by term selectionInformation-theoretic term selection

functions, e.g. chi-square information gain mutual information odds ratio relevancy score

Dimensionality reduction by term extractionTerm extraction attempts to

generate, from the original term set, a set of ”synthetic” terms that maximize effectiveness

due to polysemy, homonymy, and synonymy, the original terms may not be optimal dimensions for document content representation

Dimensionality reduction by term extractionTerm clustering

tries to group words with a high degree of pairwise semantic relatedness

groups (or their centroids) may be used as dimensions

latent semantic indexing compresses document vector into vectors of a

lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence

3. Text categorizationText classification, topic

classification/spotting/detectionproblem setting:

assume: a predefined set of categories, a set of documents

label each document with one (or more) categories

Text categorizationTwo major approaches:

knowledge engineering -> end of 80’smanually defined set of rules encoding expert

knowledge on how to classify documents under the given gategories

machine learning, 90’s ->an automatic text classifier is built by

learning, from a set of preclassified documents, the characteristics of the categories

Text categorizationLet

D: a domain of documents C = {c1, …, c|C|} : a set of predefined categories T = true, F = false

The task is to approximate the unknown target function ’: D x C -> {T,F} by means of a function : D x C -> {T,F}, such that the functions ”coincide as much as possible”

function ’ : how documents should be classifiedfunction : classifier (hypothesis, model…)

We assume...Categories are just symbolic labels

no additional knowledge of their meaning is available

No knowledge outside of the documents is available all decisions have to be made on the basis of

the knowledge extracted from the documents metadata, e.g., publication date, document

type, source etc. is not used

-> general methodsMethods do not depend on any

application-dependent knowledge in operational applications all kind of

knowledge can be usedcontent-based decisions are necessarily

subjective it is often difficult to measure the

effectiveness of the classifiers even human classifiers do not always agree

Single-label vs. multi-labelSingle-label text categorization

exactly 1 category must be assigned to each dj D

Multi-label text categorization any number of categories may be assigned to

the same dj DSpecial case of single-label: binary

each dj must be assigned either to category ci or to its complement ¬ ci

Single-label, multi-labelThe binary case (and, hence, the

single-label case) is more general than the multi-label an algorithm for binary classification

can also be used for multi-label classification

the converse is not true

Category-pivoted vs. document-pivotedTwo different ways for using a text

classifiergiven a document, we want to find all the

categories, under which it should be filed -> document-pivoted categorization (DPC)

given a category, we want to find all the documents that should be filed under it -> category-pivoted categorization (CPC)

Category-pivoted vs. document-pivotedThe distinction is important, since the sets

C and D might not be available in their entirety right from the start

DPC: suitable when documents become available at different moments in time, e.g. filtering e-mail

CPC: suitable when new categories are added after some documents have already been classified (and have to be reclassified)

Category-pivoted vs. document-pivotedSome algorithms may apply to one

style and not the other, but most techniques are capable of working in either mode

Hard-categorization vs. ranking categorizationHard categorization

the classifier answers T or FRanking categorization

given a document, the classifier might rank the categories according to their estimated appropriateness to the document

respectively, given a category, the classifier might rank the documents

Applications of text categorizationAutomatic indexing for Boolean

information retrieval systemsdocument organizationtext filteringword sense disambiguationhierarchical categorization of Web

pages

Automatic indexing for Boolean IR systems

In an information retrieval system, each document is assigned one or more keywords or keyphrases describing its content keywords belong to a finite set called controlled

dictionaryTC problem: the entries in a controlled

dictionary are viewed as categories k1 x k2 keywords are assigned to each

document document-pivoted TC

Document organizationIndexing with a controlled vocabulary is

an intance of the general problem of document base organization

e.g. a newspaper office has to classify the incoming ”classified” ads under categories such as Personals, Cars for Sale, Real Estate etc.

organization of patents, filing of newspaper articles...

Text filteringClassifying a stream of incoming

documents dispatched in an asynchronous way by an information producer to an information consumer

e.g. newsfeed producer: news agency; consumer: newspaper the filtering system should block the delivery

of documents the consumer is likely not interested in

Word sense disambiguationGiven the occurrence in a text of an

ambiguous word, find the sense of this particular word occurrence

E.g. Bank of England the bank of river Thames ”Last week I borrowed some money

from the bank.”

Word sense disambiguationIndexing by word senses rather than by wordstext categorization

documents: word occurrence contexts categories: word senses

also resolving other natural language ambiguities context-sensitive spelling correction, part of speech

tagging, prepositional phrase attachment, word choice selection in machine translation

Hierarchical categorization of Web pages

E.g. Yahoo like web hierarchical catalogues

typically, each category should be populated by ”a few” documents

new categories are added, obsolete ones removed

usage of link structure in classificationusage of the hierarchical structure

Knowledge engineering approachIn the 80´s: knowledge engineering

techniques building manually expert systems capable

of taking text categorization decisions expert system: consists of a set of rules

wheat & farm -> wheatwheat & commodity -> wheatbushels & export -> wheatwheat & winter & ~soft -> wheat

Knowledge engineering approachDrawback: rules must be manually

defined by a knowledge engineer with the aid of a domain expert any update necessitates again human

intervention totally domain dependent -> expensive and slow process

Machine learning approachA general inductive process (learner)

automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified under ci or ci by a domain expert

from these characteristics the learner gleans the characteristics that a new unseen document should have in order to be classified under ci

supervised learning (= supervised by the knowledge of the training documents)

Machine learning approachThe learner is domain independent

usually available ’off-the-shelf’the inductive process is easily repeated, if the

set of categories changesmanually classified documents often already

available manual process may exist

if not, it still easier to manually classify a set of documents than to build and tune a set of rules

Training set, test set, validation setInitial corpus of manually classified

documents let dj belong to the initial corpus for each pair <dj, ci> it is known if dj

should be filed under ci

positive examples, negative examples of a category

Training set, test set, validation setThe initial corpus is divided into two sets

a training (and validation) set a test set

the training set is used to build the classifier

the test set is used for testing the effectiveness of the classifiers each document is fed to the classifier and the

decision is compared to the manual category

Training set, test set, validation setThe documents in the test are not used

in the construction of the classifieralternative: k-fold cross-validation

k different classifiers are built by partitioning the initial corpus into k disjoint sets and then iteratively applying the train-and-test approach on pairs, where k-1 sets construct a training set and 1 set is used as a test set

individual results are then averaged

Training set, test set, validation setTraining set can be split to two partsone part is used for optimising

parameters test which values of parameters yield

the best effectivenesstest set and validation set must be

kept separate

Inductive construction of classifiers

A ranking classifier for a category ci definition of a function that, given a

document, returns a categorization status value for it, i.e. a number between 0 and 1

documents are ranked according to their categorization status value

Inductive construction of classifiersA hard classifier for a category

definition of a function that returns true or false, or

definition of a function that returns a value between 0 and 1, followed by a definition of a thresholdif the value is higher than the threshold ->

trueotherwise -> false

LearnersProbabilistic classifiers (Naïve Bayes)decision tree classifiersdecision rule classifiersregression methodson-line methodsneural networksexample-based classifiers (k-NN)support vector machines

Rocchio methodLinear classifier methodfor each category, an explicit profile

(or prototypical document) is constructed benefit: profile is understandable even

for humans

Rocchio methodA classifier is a vector of the same

dimension as the documentsweights:

classifying: cosine similarity of the category vector and the document vector

}{}{ |||| ijij NEGd i

kj

POSd i

kjki

NEGw

POSww

Processing of Large Document Collections 1

Documents

Transcript of Processing of Large Document Collections 1