Download - Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence

Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence

Mike Thelwall

Professor of Information Science

University of Wolverhampton

Contents

Introduction to Scientific Web Intelligence

Introduction to the Vector Space Model Vocabulary Spectral Analysis Low frequency words

Part 1

Scientific Web Intelligence

Scientific Web Intelligence

Applying web mining and web intelligence techniques to collections of academic/scientific web sites

Uses links and text Objective: to identify patterns and visualize

relationships between web sites and subsites Objective: to report to users causal

information about relationships and patterns

Academic Web Mining

Step 1: Cluster domains by subject content, using text and links

Step 2: Identify patterns and create visualizations for relationships

Step 3: Incorporate user feedback and reason reporting into visualization

This presentation deals with Step 1, deriving subject-based clusters of academic webs from text

analysis

Part 2

Introduction to the Vector Space Model

Overview

The Vector Space Model (VSM) is a way of representing documents through the words that they contain

It is a standard technique in Information Retrieval

The VSM allows decisions to be made about which documents are similar to each other and to keyword queries

How it works: Overview

Each document is broken down into a word frequency table

The tables are called vectors and can be stored as arrays

A vocabulary is built from all the words in all documents in the system

Each document is represented as a vector based against the vocabulary

Example

Document A– “A dog and a cat.”

Document B– “A frog.”

a dog and cat

2 1 1 1

a frog

1 1

Example, continued

The vocabulary contains all words used– a, dog, and, cat, frog

The vocabulary needs to be sorted– a, and, cat, dog, frog

Example, continued

Document A: “A dog and a cat.”

– Vector: (2,1,1,1,0)

Document B: “A frog.”

– Vector: (1,0,0,0,1)

a and cat dog frog

2 1 1 1 0

a and cat dog frog

1 0 0 0 1

Measuring inter-document similarity For two vectors d and d’ the cosine similarity

between d and d’ is given by:

Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding frequencies together

The cosine measure calculates the angle between the vectors in a high-dimensional virtual space

'

'

dd

dd

Stopword lists

Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing– E.g. “in”, “a”, “the”

Normalised term frequency (tf)

A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document

This is known as the tf factor. Document A: raw frequency vector:

(2,1,1,1,0), tf vector: (1, 0.5, 0.5, 0.5, 0)

Inverse document frequency (idf)

A calculation designed to make rare words more important than common words

The idf of word i is given by

Where N is the total number of documents and ni is the number that contain word i

ii n

Nidf log

tf-idf

The tf-idf weighting scheme is to multiply the tf factor and idf factors for each word

Words are important for a document if they are frequent relative to other words in the document and rare in other documents

Part 3

Vocabulary Spectral Analysis

Subject-clustering academic webs through text similarity 1

1. Create a collection of virtual documents consisting of all web pages sharing a common domain name in a university.

– Doc. 1 = cs.auckland.ac.uk 14,521 pgs– Doc. 2 = www.auckland.ac.nz 3,463 pgs– …– Doc. 760 = www.vuw.ac.nz 4,125 pgs

Subject-clustering academic webs through text similarity 22. Convert each virtual document into a tf-idf

word vector3. Identify clusters using k-means and VSM

cosine measures4. Rank words for importance in each ‘natural’

cluster Cluster Membership Indicator5. Manually filter out high-ranking words in

undesired clusters Destroys the natural clustering of the data to

uncover weaker subject clustering

Cluster Membership Indicator

Cn

w

C

w

iCcmi Cjij

Cjij

),(

For a cluster C of documents and tdf-idf weights wij

The next slide shows the top CMI weights for an undesirednon-subject cluster

Word Frequency Domains CMI

massey 32991 364 0.30587

palmerston 9023 305 0.09137

and 1883534 674 0.0794

the 3605107 689 0.0746

of 2263812 683 0.06782

in 1317941 655 0.06556

north 21348 414 0.06431

students 127178 550 0.05753

research 186161 546 0.05687

a 1254004 659 0.05616

Eliminating low frequency words

Can test whether removing low frequency words increases or decreases subject clustering tendency– E.g. are spelling mistakes?

Need partially correct subject clusters Compare similarity of documents within

cluster to similarity with documents outside cluster

Eliminating low frequency words

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

2 3 4 5 6 7 8 9

10

20

40

80

16

0

32

0

64

0Minimum domains containing word

Intr

a-s

ub

jec

t a

ve

rag

e c

orr

ela

tio

n m

inu

s in

ter-

su

bje

ct

ave

rag

e c

orr

ela

tio

nLaw

Psychology

Architecture

Sport

Maths

Planning

Social studies

Engineering

Languages

Physics

Chemistry

Business

Education

Medicine

Env. Sci.

Food

Computing

Biology

General

Arts

Summary

For text based academic subject web site clustering:– need to select vocabularies to break natural

clustering and allow subject clustering– consider ignoring low frequency words because

they do not have high clustering power– Need to automate the manual element as far as

possible The results can then form the basis of a

visualization that can give feedback to the user on inter-subject connections