Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval...

Readability:

An application in Information RetrievalLorna KaneIntelligent Information Retrieval Group,School of Computer Science and Informatics,University College Dublin

Traditional IR

The core aim in Information Retrieval is to match a user’s information need to the most relevant documents

Traditionally, IR systems have concentrated on topical relevance

For example, the query “HIV AIDS” should return documents that treat the topic of HIV and the AIDS virus

IR Systems are now doing a good job at finding topically relevant items

Traditional IR systemDocument collection Information need

Query formulation

User relevance assessment

Query Processing

Document Representation Files / IndexesSystem Functions

Topical Matchinge.g. Term Frequency

Document Indexing

So…why is it hard?

Semantics Bank of Ireland, bank note, bank

(flight manoeuvre), bank (of a river) Semantics of an image? Semantics of video? Semantics of music?

Natural Language Paris is the capital of France Bordeaux is the wine making capitol

of France. Opinion

George Bush is honest. Geri Haliwell is talented. Big Brother is interesting. Van Gogh’s self portrait represents

happiness Context

The users context will influence how they formulate their query

Information Gap User may be unable to

unambiguously articulate information need

User may under-specify query

Talented?

Interesting?

Honest?

Document and Query Processing

1. Begin with Natural Language

2. Term Normalisation Remove case Remove Punctuation Alphabetise

Twinkle, twinkle, little bat. How I wonder what you’re at?

Up above the world you fly, like a tea-tray in the sky.

a above at bat fly how like little

sky tea the tray twinkle twinkle

up what wonder world you youre

Document and Query Processing (contd.)

3. Stopword removal Remove words that

carry little meaning such as connectives, articles and prepositions

Words in english follow Zipf distribution, i.e. a few words appear very frequently, a medium number of words appear with a medium frequency and many words appear infrequently

High Frequency words are useless because they describe too many objects

Very low frequency words may be too rare to be of value

a above at bat fly how like littlesky tea the tray twinkle twinkle up what wonder world you youre

25 WORDS

twinkle twinkle little bat wonderWorld high like tea tray sky

11 WORDS


4. Stemming reduction of morphological variants of

a word to a common stem Will generate some errors but reduces

size index files and provides a way to find variants of search terms

Over stemming: organisation organ

CreateCreativeCreationCreating

Automatic Conflatione.g. Porter Stemming Algorithm

Creat


5. Term Weighting• TF (term frequency)

within a single document gives high values for frequent terms e.g. our document is mostly about “twinkl” because

that term occurs most frequentlytfdi = numi

• IDF (inverse document frequency) Throughout the document collection gives high values for infrequent terms e.g. in a collection of medical articles the word

“pathology” will occur in most documents and therefore does not distinguish documents

Idfi = log (N / dfi)N = number documents in collectiondfi = number of documents that contain the term


6. We can combined tf and idf to get term weight for each document:

weightdi = idfi * tfdi

7. Document Matrix• Each document / query is a vector of term

weights

Twinkl Littl Bat Tea

Doc 1 0.068 0.02 0.02 0.01

Doc 2 0 0 0 0.001

Doc 3 0 0.022 0.11 0.002

Doc 4 0.042 0 0.04 0

Vector Space Model

Document and Query matrices are represented as vectors in n-dimensional space

(N = num unique words in collection)term 1

term 3

term 2

term 4

term 5

Doc 1

Doc 2

Doc 3

Query

Finally, Retrieval The closer a query is to a document the

better the query matches that document Ranked list of topically relevant documents

computed using distance metrics and returned to user

term 1term 3

term 2

term 4

term 5

Doc 1

Doc 2

Doc 3

Query

The Relevance Melting Pot

However! Relevance has been shown to be a multi-faceted concept.

The relevance of a document to a given query is influenced by the users context.

The user judges relevance by a number of criteria aside from topic

Readability as relevance criteria

Relevance Criteria listed in various user studies… Cool et al. (1993)

Topic: deep / superficial Content: explanation, level of detail Presentation: userstandability, simplicity / complexity, technicality

Barry (1998) Users judgement that he/she will be able to understand or follow

the information presented The extent to which information is presented in a clear or

readable manner The extent to which information presented is novel to the user

Schamber (1998) Information is specific to user’s need; has sufficient detail or depth Information is presented clearly, little effort to read or understand

Relevance in Context

So… we can conclude that users want documents that they can understand and have the right amount of detail as well being topically relevant.

For example, someone who knows very little about AIDS may not be able to understand the following excerpt, i.e. it is irrelevant in their context“The development of OIs during HIV disease not only indicates the degree of immunosuppression, but may alsoinfluence disease progression itself. When stratified by CD4 counts, patients with prior histories of OIs have highermortality rates than those without prior histories of OIs”

Zones of Learnability

This follows Walter Kintsch’s 1994 “zones of learnability” hypothesis,

“If a student’s knowledge overlaps too much with an instructional text, there is simply not enough for the student to learn from the text. If there is no overlap, or almost no overlap, there can be no learning either: the necessary hooks in the students’ knowledge, onto which new information is hung, are missing. ”

As such an IR system should try to match a user with a given level of domain knowledge to documents that they can learn the most from, documents that have the optimum balance of redundant and new information

How can we achieve such a match?

The ideas thus presented relate closely to the concept of Readability

Readability is a characteristic of text documents.. “the sum total of all those elements within a

given piece of printed material that affect the success a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949)

“ease of understanding or comprehension due to the style of writing” (Klare, 1963)

How can we measure Readability?

A number of traditional readability formulas use simplistic measures.. Sentence length Word frequency lists Number of syllables Common Formulas: Flesch-Kincaid, Dale-Chall,

Gunning-Fog In order to…

Categorise educational texts for grade levels Help authors write for a target audience

These formulae have been criticised because they only measure surface level statistics

E.g the word “quark” has only one syllable but is a difficult concept to comprehend

How can we measure Readability?

Readability encompasses a number of document characteristics… Legibility of the text:

how physically readable the text is, i.e. font size, paper color, bullet points, graphical representations (treatment of legibility out of scope of thesis – for now anyway!)

Syntactic complexity of the text: grammatical arrangement of words within a sentence, (e.g. active / passive

sentences have been shown to affect readability) Simple/compound sentence/complex sentences

Organization of text rhetorical structure

Function of statements in text; evidence, antithesis For example, the word “but” or phrase “on the other hand” can signal an antithesis to

a previous statement. textual cohesion

logical linkage between textual units, as indicated by overt formal markers of the relations between texts

“Trees are green and have leaves. When many grow in the same place they make up a forest.”

The words “many” and “they” refer to “trees” in the first sentence, thus making the two sentences cohesive.

Semantic complexity of the text the difficulty of the concepts/ideas represented in the text abstractness / concreteness of concepts represented in the text

Does readability exist in a vacuum?

Document CharacteristicsLegibility

Syntactic ComplexitySemantic Complexity

OrganisationRhetorical Structure

Coherence

User CharacteristicsDomain Knowledge

Reading AbilityLearning Style

MotivationTask

INTERACTION!

Using Readability to improve relevance

QUERY IR SYSTEM

TOPICALLY RELEVANT SET

FEATURE EXTRACTION

READABILITY CLASSIFIER

RERANK

INFERENCES ABOUT USER’S READABILITY

PREFERENCE

CONTEXTUALLY & TOPICALLY RELEVANT

SET

Syntactic ComplexitySemantic ComplexityOrganisationRhetorical StructureCoherence

Feature Extraction: Syntactic Complexity

Syntactic Complexity operationalised by measuring POS statistics and natural language parse tree

This tells us the function of words in a sentence and the complexity of the sentence

SENTENCE

NOUN PHRASE VERB PHRASE

PROPER NOUN VERB NOUN PHRASE

PROPER NOUN

SUSAN HIT MICHAEL

Feature Extraction: Semantic Complexity

Operationalised using various external information sources, e.g. Roget’s Thesaurus

The higher up in the thesaurus structure the term appears the more abstract the word is

The lower the term appears in the thesaurus structure the term appears the more specific the concept is

WordNet lexical resource gives a “familiarity” score to nouns and verbs smililar to word frequency lists

Feature Extraction: Rhetorical Structure

This is where NLP gets very difficult… Rhetorical Structure is still mostly done manually Presence of cue words in text signal a relation Only at most 50% relations are signalled Shallow rhetorical structure analysis will be

performed – deep analysis not easily automated (yet)

Examples of relations: Evidence Background Antithesis

Feature Extraction: Rhetorical Structure

Feature Extraction: Lexical Cohesion

How well a text fits together, a measure of the coherence of the text

Operationalised by computing the number and density of lexical chains – repititions, synonyms, anaphora etc.

A lexical chain is a sequence of related words in the text, spanning short (adjacent words or sentences) or long distances (entire text).

For example: Plane Fly Its Wing Pilot

The plane could not continueTo fly. There was a problem With its wing. The pilot made

An emergency landing.

Readability Classifier

Novel machine learning approach Classify the topically relevant set of

documents returned using traditional IR model

Re-rank the topically relevant set boosting documents with the appropriate level of readability

C5.0: Decision Tree Learner

A set of documents, pre-classified by readability are given to C5.0

C5.0 is given the feature set for these documents E.g. Doc001 contains 5% prepositions,

20% adjectives…, contains 3 lexical chains, contains 15 terms that represent complex ideas and 14 statements of evidence.

C5.0: Decision Tree Learner

The classifier examines the values given for each feature and returns a set of rules that tell us how to classify the document

E.G: IF proportion of adjectives > 15%

and Number lexical chains >= 4 THEN: Document is easily readable

Work in Progress

No significant existing corpus annotated with readability data

Large scale modern study to find best feature set to classify for readability that is not domain specific

How best to infer a user’s level of domain knowledge: implicit / explicit?

How best to incorporate readability into an IR environment without compromising topically relevant set

Initial Experiments

Machine learning on 2394 “easy” and “difficult” documents using POS statistics (to obtain a measure of syntactic complexity) and traditional Flesch Forumula.

Fold POS Flesch Combined

0 12.0 9.8 6.9

1 12.4 9.7 6.4

2 14.0 9.6 6.5

3 12.5 9.5 6.8

4 12.5 9.6 6.9

5 10.8 9.5 7.0

6 11.4 9.8 6.2

7 11.8 9.8 6.2

8 11.8 10.0 6.9

9 12.5 9.7 7.2

Mean 12.2% 12.2% 6.7%

SE 0.3% 0.3% 0.1%

Thanks

Questions?

Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval...

Documents

Transcript of Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval...