Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval...
-
date post
19-Dec-2015 -
Category
Documents
-
view
222 -
download
2
Transcript of Readability: An application in Information Retrieval Lorna Kane Intelligent Information Retrieval...
Readability:
An application in Information RetrievalLorna KaneIntelligent Information Retrieval Group,School of Computer Science and Informatics,University College Dublin
Traditional IR
The core aim in Information Retrieval is to match a user’s information need to the most relevant documents
Traditionally, IR systems have concentrated on topical relevance
For example, the query “HIV AIDS” should return documents that treat the topic of HIV and the AIDS virus
IR Systems are now doing a good job at finding topically relevant items
Traditional IR systemDocument collection Information need
Query formulation
User relevance assessment
Query Processing
Document Representation Files / IndexesSystem Functions
Topical Matchinge.g. Term Frequency
Document Indexing
So…why is it hard?
Semantics Bank of Ireland, bank note, bank
(flight manoeuvre), bank (of a river) Semantics of an image? Semantics of video? Semantics of music?
Natural Language Paris is the capital of France Bordeaux is the wine making capitol
of France. Opinion
George Bush is honest. Geri Haliwell is talented. Big Brother is interesting. Van Gogh’s self portrait represents
happiness Context
The users context will influence how they formulate their query
Information Gap User may be unable to
unambiguously articulate information need
User may under-specify query
Talented?
Interesting?
Honest?
Document and Query Processing
1. Begin with Natural Language
2. Term Normalisation Remove case Remove Punctuation Alphabetise
Twinkle, twinkle, little bat. How I wonder what you’re at?
Up above the world you fly, like a tea-tray in the sky.
a above at bat fly how like little
sky tea the tray twinkle twinkle
up what wonder world you youre
Document and Query Processing (contd.)
3. Stopword removal Remove words that
carry little meaning such as connectives, articles and prepositions
Words in english follow Zipf distribution, i.e. a few words appear very frequently, a medium number of words appear with a medium frequency and many words appear infrequently
High Frequency words are useless because they describe too many objects
Very low frequency words may be too rare to be of value
a above at bat fly how like littlesky tea the tray twinkle twinkle up what wonder world you youre
25 WORDS
twinkle twinkle little bat wonderWorld high like tea tray sky
11 WORDS
Document and Query Processing (contd.)
4. Stemming reduction of morphological variants of
a word to a common stem Will generate some errors but reduces
size index files and provides a way to find variants of search terms
Over stemming: organisation organ
CreateCreativeCreationCreating
Automatic Conflatione.g. Porter Stemming Algorithm
Creat
Document and Query Processing (contd.)
5. Term Weighting• TF (term frequency)
within a single document gives high values for frequent terms e.g. our document is mostly about “twinkl” because
that term occurs most frequentlytfdi = numi
• IDF (inverse document frequency) Throughout the document collection gives high values for infrequent terms e.g. in a collection of medical articles the word
“pathology” will occur in most documents and therefore does not distinguish documents
Idfi = log (N / dfi)N = number documents in collectiondfi = number of documents that contain the term
Document and Query Processing (contd.)
6. We can combined tf and idf to get term weight for each document:
weightdi = idfi * tfdi
7. Document Matrix• Each document / query is a vector of term
weights
Twinkl Littl Bat Tea
Doc 1 0.068 0.02 0.02 0.01
Doc 2 0 0 0 0.001
Doc 3 0 0.022 0.11 0.002
Doc 4 0.042 0 0.04 0
Vector Space Model
Document and Query matrices are represented as vectors in n-dimensional space
(N = num unique words in collection)term 1
term 3
term 2
term 4
term 5
Doc 1
Doc 2
Doc 3
Query
Finally, Retrieval The closer a query is to a document the
better the query matches that document Ranked list of topically relevant documents
computed using distance metrics and returned to user
term 1term 3
term 2
term 4
term 5
Doc 1
Doc 2
Doc 3
Query
The Relevance Melting Pot
However! Relevance has been shown to be a multi-faceted concept.
The relevance of a document to a given query is influenced by the users context.
The user judges relevance by a number of criteria aside from topic
Readability as relevance criteria
Relevance Criteria listed in various user studies… Cool et al. (1993)
Topic: deep / superficial Content: explanation, level of detail Presentation: userstandability, simplicity / complexity, technicality
Barry (1998) Users judgement that he/she will be able to understand or follow
the information presented The extent to which information is presented in a clear or
readable manner The extent to which information presented is novel to the user
Schamber (1998) Information is specific to user’s need; has sufficient detail or depth Information is presented clearly, little effort to read or understand
Relevance in Context
So… we can conclude that users want documents that they can understand and have the right amount of detail as well being topically relevant.
For example, someone who knows very little about AIDS may not be able to understand the following excerpt, i.e. it is irrelevant in their context“The development of OIs during HIV disease not only indicates the degree of immunosuppression, but may alsoinfluence disease progression itself. When stratified by CD4 counts, patients with prior histories of OIs have highermortality rates than those without prior histories of OIs”
Zones of Learnability
This follows Walter Kintsch’s 1994 “zones of learnability” hypothesis,
“If a student’s knowledge overlaps too much with an instructional text, there is simply not enough for the student to learn from the text. If there is no overlap, or almost no overlap, there can be no learning either: the necessary hooks in the students’ knowledge, onto which new information is hung, are missing. ”
As such an IR system should try to match a user with a given level of domain knowledge to documents that they can learn the most from, documents that have the optimum balance of redundant and new information
How can we achieve such a match?
The ideas thus presented relate closely to the concept of Readability
Readability is a characteristic of text documents.. “the sum total of all those elements within a
given piece of printed material that affect the success a group of readers have with it. The success is the extent to which they understand it, read it at an optimal speed, and find it interesting.” (Dale & Chall, 1949)
“ease of understanding or comprehension due to the style of writing” (Klare, 1963)
How can we measure Readability?
A number of traditional readability formulas use simplistic measures.. Sentence length Word frequency lists Number of syllables Common Formulas: Flesch-Kincaid, Dale-Chall,
Gunning-Fog In order to…
Categorise educational texts for grade levels Help authors write for a target audience
These formulae have been criticised because they only measure surface level statistics
E.g the word “quark” has only one syllable but is a difficult concept to comprehend
How can we measure Readability?
Readability encompasses a number of document characteristics… Legibility of the text:
how physically readable the text is, i.e. font size, paper color, bullet points, graphical representations (treatment of legibility out of scope of thesis – for now anyway!)
Syntactic complexity of the text: grammatical arrangement of words within a sentence, (e.g. active / passive
sentences have been shown to affect readability) Simple/compound sentence/complex sentences
Organization of text rhetorical structure
Function of statements in text; evidence, antithesis For example, the word “but” or phrase “on the other hand” can signal an antithesis to
a previous statement. textual cohesion
logical linkage between textual units, as indicated by overt formal markers of the relations between texts
“Trees are green and have leaves. When many grow in the same place they make up a forest.”
The words “many” and “they” refer to “trees” in the first sentence, thus making the two sentences cohesive.
Semantic complexity of the text the difficulty of the concepts/ideas represented in the text abstractness / concreteness of concepts represented in the text
Does readability exist in a vacuum?
Document CharacteristicsLegibility
Syntactic ComplexitySemantic Complexity
OrganisationRhetorical Structure
Coherence
User CharacteristicsDomain Knowledge
Reading AbilityLearning Style
MotivationTask
INTERACTION!
Using Readability to improve relevance
QUERY IR SYSTEM
TOPICALLY RELEVANT SET
FEATURE EXTRACTION
READABILITY CLASSIFIER
RERANK
INFERENCES ABOUT USER’S READABILITY
PREFERENCE
CONTEXTUALLY & TOPICALLY RELEVANT
SET
Syntactic ComplexitySemantic ComplexityOrganisationRhetorical StructureCoherence
Feature Extraction: Syntactic Complexity
Syntactic Complexity operationalised by measuring POS statistics and natural language parse tree
This tells us the function of words in a sentence and the complexity of the sentence
SENTENCE
NOUN PHRASE VERB PHRASE
PROPER NOUN VERB NOUN PHRASE
PROPER NOUN
SUSAN HIT MICHAEL
Feature Extraction: Semantic Complexity
Operationalised using various external information sources, e.g. Roget’s Thesaurus
The higher up in the thesaurus structure the term appears the more abstract the word is
The lower the term appears in the thesaurus structure the term appears the more specific the concept is
WordNet lexical resource gives a “familiarity” score to nouns and verbs smililar to word frequency lists
Feature Extraction: Rhetorical Structure
This is where NLP gets very difficult… Rhetorical Structure is still mostly done manually Presence of cue words in text signal a relation Only at most 50% relations are signalled Shallow rhetorical structure analysis will be
performed – deep analysis not easily automated (yet)
Examples of relations: Evidence Background Antithesis
Feature Extraction: Rhetorical Structure
Feature Extraction: Lexical Cohesion
How well a text fits together, a measure of the coherence of the text
Operationalised by computing the number and density of lexical chains – repititions, synonyms, anaphora etc.
A lexical chain is a sequence of related words in the text, spanning short (adjacent words or sentences) or long distances (entire text).
For example: Plane Fly Its Wing Pilot
The plane could not continueTo fly. There was a problem With its wing. The pilot made
An emergency landing.
Readability Classifier
Novel machine learning approach Classify the topically relevant set of
documents returned using traditional IR model
Re-rank the topically relevant set boosting documents with the appropriate level of readability
C5.0: Decision Tree Learner
A set of documents, pre-classified by readability are given to C5.0
C5.0 is given the feature set for these documents E.g. Doc001 contains 5% prepositions,
20% adjectives…, contains 3 lexical chains, contains 15 terms that represent complex ideas and 14 statements of evidence.
C5.0: Decision Tree Learner
The classifier examines the values given for each feature and returns a set of rules that tell us how to classify the document
E.G: IF proportion of adjectives > 15%
and Number lexical chains >= 4 THEN: Document is easily readable
Work in Progress
No significant existing corpus annotated with readability data
Large scale modern study to find best feature set to classify for readability that is not domain specific
How best to infer a user’s level of domain knowledge: implicit / explicit?
How best to incorporate readability into an IR environment without compromising topically relevant set
Initial Experiments
Machine learning on 2394 “easy” and “difficult” documents using POS statistics (to obtain a measure of syntactic complexity) and traditional Flesch Forumula.
Fold POS Flesch Combined
0 12.0 9.8 6.9
1 12.4 9.7 6.4
2 14.0 9.6 6.5
3 12.5 9.5 6.8
4 12.5 9.6 6.9
5 10.8 9.5 7.0
6 11.4 9.8 6.2
7 11.8 9.8 6.2
8 11.8 10.0 6.9
9 12.5 9.7 7.2
Mean 12.2% 12.2% 6.7%
SE 0.3% 0.3% 0.1%
Thanks
Questions?