Text Mining. Lecture Outline Introduction Data Representations For Text Preprocessing Dimensionality...

Text Mining

Lecture Outline Introduction Data Representations For Text Preprocessing Dimensionality Reduction Text-data Analysis

IE IR & IF Clustering Categorization

Introduction The importance of text is directly associated

with the fact that it currently expresses a vast, rich range of information in an unstructured or semi-structured format.

However, information text is difficult to retrieve automatically.

Text mining is defined as the process of automatically extracting useful knowledge from enormous collections of natural text documents (referred to as document collections) which are very dynamic and contain documents from various sources.

One of the main reasons for the rapid growth in the sizes of those collections is the vast and continuous spread of information over the Internet

Text mining emerged as an independent research area from the combination of older research areas like machine learning, natural language processing, and information retrieval

It is sometimes viewed as an adapted form of a very similar research field, namely data mining.

Data mining deals with structured data represented in relational tables or multidimensional cubes.

Being based on machine learning, both fields share a great deal of ideas and algorithms; however, each deals with a different type of data and thus has to adapt its ideas and algorithms to its own perspective.

Data Representations For Text Text mining deals with text data which

are unstructured (or semi-structured in the case if XML) in nature.

In order to alleviate this problem, indexing is utilized.

Indexing is the process of mapping a document into a structured format that represents its content.

It can be applied to the whole document or to some parts of it, though the former option is the rule.

In indexing, usually the terms occurring in the given collection of documents are used to represent the documents.

Documents contain a lot of terms that are frequently repeated or that have no significant relation to the context in which they exist.

Using all the terms would certainly result in high inefficiency that could be eliminated with some preprocessing steps, as we shall later.

The Vector Space Model One very widely used indexing model is

the vector space model which is based on the bag-of-words or set-of-words approach.

This model has the advantages of being relatively computationally efficient and having conceptual simplicity.

However, it suffers from the fact that it loses important information about the original text, such as information about the order of the terms in the text or about the frontiers between sentences or paragraphs.

Each document is represented as a vector, the dimensions of which are the terms in the initial document collection.

The set of terms used as dimensions is referred to as the term space.

Each vector coordinate is a term having a numeric value representing its relevance to the document. Usually, higher values imply higher relevance.

The process of giving numeric values to vector coordinates is referred to as weighting.

From an indexing point of view, weighting is the process of giving more emphasis to more important terms.

Three popular weighting schemes have been discussed in the literature: binary, TF, and TF*IDF

For a term t in document d, the binary scheme records binary coordinate values, where a 1-value is given to t if it occurs at least once in d, and a 0-value is given otherwise.

The term frequency (TF) scheme records the number of occurrences of t in d. Usually, TF measurements are normalized to help overcome the problems associated with document sizes.

Normalization may be achieved by dividing all coordinate measurements for every document by the highest coordinate measurement for that document.

The term frequency by inverse document frequency (TF*IDF) simply weights TF measurements with a global weight, the IDF measurement.

The IDF measurement for a term t is defined as log2 (N/Nt), where N is the total number of documents in the collections, and Nt is the total number of documents containing at least one occurrence of t.

Note that the IDF weight increases as Nt decreases, i.e., as the uniqueness of the term among the documents increases, thus giving the term a higher weight.

As in TF, normalization is usually done here.

More Sophisticated Representations A number of experiments were conducted

in an attempt to find representations that are more sophisticated than the term-based scheme.

Some tried using phrases for indexing instead of terms, while others used string kernels

One noteworthy model which has been applied mainly in document categorization is the Darmstadt Indexing Approach (DIA)

DIA considers properties of terms, documents, categories, or any pair-wise combination of any of those as dimensions.

An example of property of a term is its IDF measurement property of a document is its length property of a relationship between a term t

and a document d is the TF measurement of t in d or the location of t in d.

For every considered pair-wise combination of dimensions, every possible value is collected in a relevance description vector, rd (di,dj), where di,dj is a selected pair-wise combination of dimensions.

An example is the (document, category) pair where we collect the (document, category) values of all possible combinations and store them in the corresponding relevance description vector.

Sophisticated representations may appear to have more superior qualities, however most of the conducted experiments did not yield any significant improvement over the traditional term-based schemes

There are suggestions that the use of a combination of terms and phrases together might improve results

Technically, using single terms as the dimensions of the vectors used to describe documents is referred to as post-coordination of terms, while using compound words for the same purpose is referred to as pre-coordination of terms.

Preprocessing Before indexing is performed, a sequence of

preprocessing steps is usually performed in an attempt to optimize the indexing process

The main target is to reduce the number of terms used, thus leading to more optimal text-based applications later.

Case folding is the process of converting all the characters in a document into the same case, either all upper case or lower case.

For example, the words “Did,” “Did,” “DiD,” and “dID” are all converted to “did” or “DID” depending on the chosen case.

This step has the advantage of speeding up comparisons in the indexing process

Stemming is the process of removing prefixes and suffixes from words so that all words are reduced to their stems or original forms. For example, the words “Computing,”

“Computer,” and “Computational” all map to “Compute.”

This step has the advantage of eliminating suffixes and prefixes indicating tag-of-speech and verbal or plural inflections.

Stemming algorithms employ a great deal of linguistics and are language dependent.

Stop words are words having no significant semantic relation to the context in which they exist.

Stop words can be terms that occur frequently in most of the documents in a document collection, i.e., have low uniqueness and thus low IDF measurements.

A list containing all the stop words in a system or application is referred to as a stop list. Stop words are not included as indexing terms.

For example, the words “the,” “on,” and “with” are usually stop words. Also the word “blood” would probably be a stop word in a collection of articles addressing blood infections, but not in a collection of articles describing the events of the 2002 FIFA World Cup.

An N-gram representation uses n-character slices of longer strings–referred to as n-grams–as dimensions.

The string “MINE” can be represented by the tri-grams _MI, MIN, INE, and NE_

or the quad-grams _MIN, MINE, and INE_, where the underscore character is a space.

This representation offers an alternative to stemming and stop words.

The advantage of this representation is that it is less sensitive to grammatical and typographical errors and requires no linguistics knowledge like stemming.

Therefore, it is more language independent; however, in practice, it is not very effective in dimensionality reduction.

Dimensionality Reduction

Perhaps one of the most evident problems in dealing with text is the high dimensionality of its representations

As a matter of fact, some claim that text mining came to light as a major research area only in the last decade when computer hardware became very affordable.

Dimensionality reduction is the process through which the number of vector coordinates (or dimensions) in the vector space model is reduced from T to T’, where T’<<T.

The new set of dimensions is referred to as the reduced term set. The issue of dimensionality reduction might be application dependent, i.e., text-clustering applications might use different schemes for dimensionality reduction than text-categorization applications.

The rationale behind this application dependency is that after dimensionality reduction is performed, some information is lost, regardless of the significance of this information.

To reduce the effect of this loss, each application uses dimensionality reduction schemes that ensure that the lost information has minimal significance on the application at hand.

However, many generic schemes have been developed that apply to most text-mining applications.

The general reduction schemes can be classified into two categories: Reduction by term selection (term space

reduction or TSR) Reduces the number of terms from T to T1

(T1<<T) such that when T1 is used for indexing instead of T, T1 yields the most effective results when compared to any other subset of T, T2.

One simple approach to perform TSR is to randomly select a set of terms from T, apply the algorithm of the application, and compare the results to those received while using all the terms in T.

Then repeat this process a fixed number of times, each time with another random set from T.

Finally, we select the subset that yields the best results.

Another approach to determine T1 is to select a set of terms T that have the highest score according to some function.

One function might be the IDF measurement, i.e., select the terms having the highest IDF weights.

Reduction by term extraction attempts to generate from T another set

T1 where the terms in T1 do not necessarily exist in T.

A term in T1 can be a combination of terms from T or a transformation of some term (or group of terms) in T.

One approach is to use term clustering where all the terms in T are clustered such that terms in the same cluster have a high degree of semantic relatedness.

The centers of the clusters are then used as the terms in T1.

This approach helps solve the problem of synonymy: terms having similar meaning with different spelling

Text-data Analysis

Text mining is a very broad process which is usually refined into a number of tasks.

Today, the number of tasks available in text mining exceeds those available in data mining which is based entirely on three pillars: association rule mining, classification, and clustering.

In text mining, we have: Information Extraction Information Retrieval and Information

Filtering Document Clustering Document Categorization

Information Extraction Considered to be the most

important text-mining task Some even go to the extreme of

using the two terms interchangeably, though this is not the case

IE has emerged as a joint research area between text-mining and natural language processing (NLP)

It is the process of extracting predefined information about objects and relationships among those objects from streams of documents and usually storing this information in pre-designed templates.

It is very important to associate information extracting with streams of documents rather than static collections.

As an example, some might extract information like promotions and sales from a stream of documents.

The information extracted might be the event, companies involved, or dates.

These kinds of systems are usually referred to as news-skimming systems.

IE proceeds in two steps divides a document into relevant and irrelevant

parts, fills the predefined templates with the information

extracted from the relevant parts. Simple IE tasks, such as extracting proper

names or companies from text, can be performed with high precision;

however, high precision is still not case in more complex tasks, like determining sequences of events from a document.

In complex tasks, IE systems are usually defined over and applied on very restricted domains, and transforming the system from one domain to another needs a lot of work and requires the support of domain experts.

In short, IE systems scan streams of documents in order to transform the associated documents into much smaller bits of extracted relevant information which are easier to be apprehended.

IE applications: template filling

fills templates by information extracted from a document.

templates are then stored in structured environments such as databases for fast information retrieval later

Question-answering a variant of template filling

can answer questions like, “Where is Lebanon?”

cannot answer complicated questions such as, "Which country had the lowest inflation in 1999?"

Summarization maps documents into extracts which are

machine-made summaries, as opposed to abstracts which are man-made summaries.

Applications usually extract a group of highly relevant sentences and present them as a summary.

Term extraction extracts the most important terms from a

set of documents those extracted terms can be used for

efficient indexing of the documents, as opposed to using all the terms in the document

Information Retrieval and Information Filtering Information retrieval (IR) and information

filtering (IF) are two separate processes having equivalent underlying goals

They deal with the problem of information seeking

IR is viewed as the ancestor of IF The reason for this view is that IR is older, and IF

bases a lot of its foundations on IR Most of the research done in IR has been used

and/or adapted to fit IF Many differences exist between the two that

render them as two separate application areas.

Given some information need represented by the user in a suitable manner, they are concerned with giving back a set of documents that satisfy that need.

Information need is represented via queries in IR systems and via profiles in IF systems

Differences Goal

For IR the primary goal is to collect and organize documents that match a given query according to some ranking function. The primary goal of IF is to distribute newly received documents to all users with matching profiles.

Users IR systems are usually used once by a single user–by

a one-time query user –while IF systems are repeatedly used by the same user with some profile.

Usage This fact makes IR systems more suitable to serve

users with short-term needs as opposed to IF systems which serve users having needs that are relatively static over long periods of time.

Information Needs IR systems are developed to tolerate some

inadequacies in the query representation of the information need. On the other hand, profiles are assumed be highly accurate in IF systems.

Data types In terms of data, IR systems usually operate

over static collections of documents, while IF systems deal with dynamic streams of documents. In IF, the timeliness of a document is of great significance unlike in IR, where this is can be negotiated.

Evaluation The precision and recall measures are

used. Precision is the percentage of retrieved

documents that are relevant to the query or profile.

It is calculated as the number of relevant documents retrieved divided by the total number of retrieved documents.

Recall measures the percentage of relevant documents retrieved to the total number of relevant documents in the database.

Roughly speaking, those two measures have an inverse relationship.

As the number of results returned increases, the probability of having higher precision increases; but at the same time, the probability of returning wrong answers also increases.

Ideally, applications aim to maximize both.

IR IR refers to the retrieval of a set of documents from

a collection of documents that match a certain query posed a user

The retrieved documents are then rank ordered and presented to the user.

The query is simply the representation of the user’s information need in a language understood by the system.

This representation is considered as an approximation due to the difficulty associated with representing information needs accurately. The query is then matched against the documents, which are organized into text surrogates.

The collection of text surrogates can be viewed as a summarized structured representation of unstructured text data, such as lists of keywords, titles, or abstracts.

they provide an alternative to original documents as they take far less time to examine and at the same time encode enough semantic cues to be used in matching instead of the original documents.

As a result of matching, a set of documents would be selected and presented to the user.

The user either uses those documents or gives some feedback to the system resulting in modifications in the query and original information need or, in rare cases, in text surrogates

This interactive process goes on until the user is satisfied or until the user leaves the system.

IF IF systems deal with large streams of incoming

documents usually broadcasted via remote sources.

It is sometimes referred to as document routing. The system maintains profiles created by users to

describe their long-term interests. Profiles may describe what the user likes or

dislikes. New incoming documents are removed from the stream routed to some user if those documents do not match the user’s profile.

As a result, the user only views what is left in the stream after the mismatching documents have been removed;

an email filter, for example, removes all “junk” email

The first step in using an IF system is to create a profile.

A profile represents a user’s or a group of users’ information need, which is assumed to be stable over a long period of time.

Whenever a new document is received through the data stream, the system represents it as text surrogates and compares it against every profile stored in the system.

If the document matches a profile, it will be routed to the corresponding user. The user can then use the received documents and/or provide feedback.

The feedback provided may lead to modifications in the profile and the information need.

Constructing IR and IF systems A number of models have been utilized

to construct IR and IF systems. Most of those models were initially

developed for the purpose of IR. With the advent of IF, IR methods were

adapted to fit IF needs. IF stimulated more research which

resulted in the development of more new models that are currently used in both areas.

In the literature, IR and IF models are categorized into two groups: traditional and modern The string-matching traditional model:

The user specifies his/her information needs by a string of words.

A document would match the information need of a user if the user-specified string exists in the document.

This method is one of the earliest and simplest approaches.

It suffers from three problems: Homonymy: the meaning of a word depends on

the context in which it appears Synonymy: different words having the same

meaning bad response time.

However, it requires no space overhead.

The Boolean traditional model a modification of the earlier method where the user

can combine words using primitive Boolean operators such as AND, OR, and NOT.

This method gives the user a tool to better express his/her information need, but at the same time requires some skills on behalf of the user.

The user has to be very familiar with the Boolean primitives, especially in cases of complex queries.

It also retains the problems of the string-matching models along with its simplicity and space efficiency.

A lot of web-based search engines are based on this method and include Alta Vista, Excite, and Lycos

The vector-space traditional model uses term statistics to match between information

needs and documents. each document is represented as a vector of n

dimensions, where every dimension is a term pulled from the set of terms used to identify the content of all documents and profiles or queries.

Profiles and queries are represented in the same way.

Each term is then given a weight depending on its importance in revealing the content of its associated document, profile, or query.

This method has the advantages of requiring minimal user intervention and being robust; however, it still suffers from homonymy and synonymy.

The thesaurus traditional model employs the use of thesauri to solve the problems

of vocabulary like synonymy. A thesaurus is a set of terms (words or phrases)

with relations among each other. The purpose of the thesaurus is to

apply either simple word-to-word substitutions (replace all the terms having a relation in the thesaurus (i.e., having the same meaning) by the same term), or

apply concept hierarchy substitutions such as generalization substitutions where all terms are generalized to the suitable higher-level terms according to the concept hierarchies described in the thesaurus.

Though the use of thesauri helps solve the problem of synonymy to a large extent, it method still suffers from homonymy.

The signature file model a traditional model in which each document is

represented by a fixed-length bit string, which is referred to as the signature of the document.

To get this signature, hashing and superimposed coding techniques are applied on the words of the documents

The signatures of all documents are stored sequentially in a file, the signature file, which is much smaller than the original set of documents.

comparison between queries or profiles and documents is done using the signature file. T

his method has the advantages of being simple and of being able to tolerate typos and other spelling errors.

The inversion traditional model represents each document by a set of

keywords that describe the document’s content.

Those keywords are stored (usually alphabetically) in an index file.

Another file, the postings file, is created to hold the set of pointers from each keyword to the documents that it represents.

The list of keywords in the index file and the set of pointers in the postings files are used for matching against queries or profiles.

This method suffers from storage overhead (up to 300% of the original data

size) cost of maintaining the index updated in a

dynamic environment, which makes it more suitable for IR systems than for IF systems.

On the other hand, it is easy to implement, fast, and supports synonyms

The traditional model described thus far use only a limited portion of the information associated with a document.

This limited information is not enough.

Modern methods attempt to be more sophisticated in that regard.

The latent semantic indexing uses the singular value decomposition (SVD)

method to decrease the number of dimensions. LSI attempts to organize documents into semantic

structures. Instead of using terms to reveal content, terms are associated with semantic structures, which are in turn used for content description.

Documents, queries, and profiles are represented as vectors of those semantic structures which results in a more efficient representation capable of handling semantics quite well.

The document-profile or document-query matching is done by calculating the cosine of the angle between the two corresponding vectors.

The connectionist model uses a neural network where each node

represents a keyword from the set of documents.

To query a set of documents, the activity levels of the input nodes are set to a specific level.

This activity is then propagated to other layer/s, eventually reaching the outer layer which points out the matching documents.

In cases where the network is adequately trained, this method has shown good results revealing its ability to handle semantics properly. One disadvantage of using neural networks is their inability to explain to the user the reasoning for the results produced.

Document Clustering The grouping of representations of similar

documents into partitions documents within the same partition exhibit a

higher degree of similarity among each other than to any other document in any other partition.

Clusters are usually mutually exclusive and collectively exhaustive.

Some examples of applications that use clustering include browsing a collection of documents, organizing documents returned by a search engine for some query , and automatically generating hierarchical clusters of documents.

The two most popular techniques for document clustering are Hierarchical clustering

produce a hierarchy of partitions with one single partition including all documents, at one end and singleton clusters, each made of an individual document, at the other end.

The tree depicting the hierarchy of clusters is referred to as a dendogram.

Each cluster along the partition hierarchy is viewed as a combination of two clusters from the next lower or higher level, depending on the type of hierarchical clustering used, i.e. divisive or agglomerative, respectively.

Agglomerative clustering starts with the set of all clusters, each including

one document as the root of the tree, combines the most similar pair of clusters

together at every tree level until it forms one cluster containing all documents at the leaf level.

Divisive clustering starts with a single cluster containing all

documents at the root level splits one cluster into two clusters at every level

of the tree until it forms a set of clusters at the leaf level, each containing one and only one document.

K-mean clustering creates a set of k clusters distributes the set of given documents among

those clusters using the similarity between the document vectors and the cluster means.

A mean is the average vector of all document vectors in the respective cluster.

Every time we add a document to a cluster, we recalculate the mean of that cluster.

Similarity between a document and a mean can be calculated using the cosine similarity between the corresponding vectors.

it is almost always the case that a mean does not correspond to an actual document.

A variant of k-means, called k-mediods, requires the mean to be an actual document vector.

the document vector in the cluster that is closer to mean is selected as the mean of the cluster.

In general, it is believed that hierarchical clustering produces better cluster quality than k-means clustering while suffering from quadratic time complexity.

On the other hand, k-means clustering and its variants have complexities linear to the number of documents, but produce clusters with less quality

The entropy and the F-measure schemes are two very popular schemes used in evaluating the quality of the clusters produced by a clustering technique

Document Categorization Given a set of predefined categories, text

categorization is the process of labeling unlabeled text documents by their corresponding category/categories based entirely on their content.

Sometimes referred to as topic spotting or text classification

The process considers endogenous knowledge (knowledge extracted from the document) and not exogenous knowledge (knowledge extracted from external resources).

Categories are chosen to correspond to the topics or themes of the documents.

The ultimate purpose of text categorization is automatic organization.

Some categorizations systems (or classifiers) return one category for every document, while others return multiple categories.

A classifier might also return no category or some categories with very low confidence. In these cases, the document is usually assigned to a category labeled unknown for manual categorization later.

The early days of text categorization were mainly driven by knowledge engineering. Given a set of predefined categories, a set of

rules for each category is manually defined by experts.

Those rules specify the conditions that a document must satisfy in order to belong to the corresponding category.

In the 1990s, machine learning started to gain popularity and take over the categorization process.

Machine learning categorization proved to be as accurate as expert-driven categorization

it is faster and requires no experts

Text categorization applications can be classified in different ways.

One classification scheme is related to the output produced by the application: Single-label categorization

assigns each document to one and only one category

non-overlapping categories Multi-label categorization

returns a set of k categories, where k is predefined, to which a document can belong

overlapping categories.

Another classification scheme for text categorization applications

Category-pivoted categorization (CPC) finds all documents that can be filed under a specific

category used when new categories can be added dynamically

because a set of documents does not seem to be correctly categorized under any of the given categories.

Document-pivoted categorization (DPC) finds all the categories under which a certain document

can be filed used in applications where the documents come as a

stream, i.e. not available at the same time. An example is email filtering where newly received

emails are directly assigned to specific predefined categories based on user-defined criteria.

much more popular.

Applications Document organization

used for the purpose of organizing documents into categories for personal or corporate use

For example, newspapers receive a number of ads everyday and would benefit from an automatic system that can assign those ads to their corresponding categories, such as Real Estate or Autos.

An important consequence of document organization is ease of search

Text filtering Highly used in the IF field where documents in

streams are either forwarded to specific users or filtered out based on the user profiles.

The source of the documents is referred to as the producer and the user as the consumer.

For every consumer, the system blocks the delivery of any document in the stream in which the consumer is not interested.

The system uses the user profile to categorize the document as being interesting or not to the user.

Single-label binary text categorization is used for this purpose with the two possible categories being relevant and irrelevant.

The system might also be advanced in the sense that it is able to categorize all forwarded documents into a set of predefined categories in order to achieve document organization.

The filter may either sit on the producer’s side where it routes documents to all non-blocking consumers, or it can sit on the consumer’s side where it blocks unwanted documents.

An example might be an email filter that blocks unwanted messages and organizes the received ones to user-defined categories

Word sense disambiguation (WSD) the process through which a system is able

to find the sense of an ambiguous word (i.e. having more than one meaning) based on its occurrence in the text.

For example, the word “class” a place where students study a category.

Given a text containing an ambiguous word, WSD application returns which of the meanings associated with the word is intended by the given text.

WSD is usually used in natural language processing (NLP) and in indexing documents by using word senses.

Hierarchical categorization of web pages categorizes web pages or web sites to a

predefined, hierarchical set of categories hosted by portals.

Instead of using search engines to find documents across the Internet, one can instead follow an easier path through the hierarchy of document categories to search for his/her target in the specified category only.

CPC is used to enable dynamic adding and/or deleting of new and unwanted subjects, respectively.

Dimensionality reduction (DR) in text categorization Can be achieved by using the same text-

mining dimensionality reduction techniques presented earlier.

However, in categorization applications, dimensionality reduction can be applied in two ways

local dimensionality reduction: a set of terms T’, where T’<<T and T is the set of terms for the category under consideration, is chosen for categorization under every category.

global dimensionality reduction: a set of terms T’, where T’<<T and T is the set of terms for all categories, is chosen for categorization under all categories.

Global DR is more popular.

Types of Classifiers A text classifier can be either eager or

lazy. Eager categorization

a set of pre-categorized documents is used to build the classifier.

This data set is divided into a training set, TR, and testing set, TS.

The two partitions need not be equal in size. The TR is used to build or train the classifier,

and then the TS is used to test the accuracy of the classifier.

This process is repeated until an acceptable classifier is attained. On the other hand

Lazy categorization (example-based) do not build a classifier ahead of time. Whenever a new unlabeled sample n is

received, the classifier finds the k most similar labeled samples, where k is predefined, to decide on the label of n.

Many algorithms that fit the above two categories have been devised. A general review of most of those algorithms follows next.

Eager Algorithms for Text Categorization Probabilistic classifiers

represent the probability that a document dx belongs to category cy by P(cy|dx)

compute it by P(cy|dx) = P(cy)P(dx|cy)/P(dx), where P(cy) is the probability that a randomly

selected document belongs to category cy P(dx) is the probability that a randomly selected

document has dx as a vector representation P(dx|cy) is the probability that category cy contains

document dx. The calculation of P(dx|cy) is simplified by using

the independence assumption or Naïve Bayesian assumption, which states that any two coordinates of a vector are statistically independent.

The category with the highest probability to contain the unlabeled document d we are categorizing is chosen as the corresponding label for d.

Decision tree classifiers (DTC) overcome the difficulty of interpreting

probabilistic classifiers, which are quantitative, by using symbolic representations.

internal nodes represent terms edges represent tests on term weights leaf nodes representing categories. Categorizes a document by traversing a

path to the appropriate leaf node, i.e., category

Decision rule classifiers create a disjunctive normal (DNF) rule

for every category ci. The premise of a rule is composed of

literals signifying either the presence or the absence of a keyword in a document dj,

the clause denotes the decision of categorizing dj to ci

Regression classifiers An example is the linear least squares

fit (LLSF) Each document dj is associated with an

input vector I(dj) of T weighted terms output vector O(di) of C weighted categories

Categorization is achieved by producing the output vector of a sample given its input vector.

A CxT matrix M is constructed to accomplish this task such that M(dj)=O(dj).

Rocchio classifiers Each category is represented by a profile, a

prototypical document, Create profiles using the Rocchio method

which gives appropriate weights to the terms in the terms space such that the resulting profile vector is most useful for discriminating the corresponding category.

This profile is then used for categorizing unlabeled documents.

Neural network (NN) classifiers Have a set of nodes divided into layers.

The first layer is the input layer followed by zero or more middle layers followed by an output layer.

Each node receives an input weight and produces an output.

The input nodes are used to represent all the terms, and the output nodes are used to represent the set of available categories.

To categorize a document dj, the term weights representing dj are fed into the input nodes.

This process will propagate through the layers of the network until finally the result is received by the nodes in the output layer.

Support vector machine (SVMs) classifiers First applied in text categorization in the

late 1990s SVMs divide the term space into

hyperplanes or surfaces separating the positive and negative training samples

sometimes the surfaces are referred to as decision surfaces.

They represent the set of predefined categories and the surface that provides the widest separation–i.e., widest possible margin between positive and negative samples–is selected as the corresponding label.

Lazy Algorithms for Text Categorization Example-based classifiers are lazy

classifiers. No classifier is built ahead of time, but the

set of all labeled samples are used to categorize new unlabeled samples.

One such classifier is the k-nearest neighbors (kNN) classifier given k and a new sample dj finds k documents that are the most similar to

dj among the given set samples given some similarity function, such as Euclidean distance.

Then a process–such as plurality voting–is performed by the selected neighbors to give dj the most appropriate label.

Classifier committees

A classifier committee is a classifier that embeds a number of other classifiers uses all of them to do the

categorization, compares their outcomes selects the most appropriate result by

using plurality voting, for instance

A Final note on Performance

It is worth noting that almost all current text classifiers suffer from performance considerations related to accuracy and speed.

Perhaps this fact is the main reason for the huge amount of research currently being invested in this promising area.

Text Mining. Lecture Outline Introduction Data Representations For Text Preprocessing Dimensionality...

Documents

Transcript of Text Mining. Lecture Outline Introduction Data Representations For Text Preprocessing Dimensionality...