Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

31
Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Page 1: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Evidence from Metadata

LBSC 796/CMSC 828o

Session 6 – March 1, 2004

Douglas W. Oard

Page 2: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Agenda• Questions

• Controlled vocabulary retrieval

• Generating metadata

• Metadata standards

• Putting the pieces together

Page 3: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Page 4: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Problems with “Free Text” Search

• Homonymy– Terms may have many unrelated meanings– Polysemy (related meanings) is less of a problem

• Synonymy– Many ways of saying (nearly) the same thing

• Anaphora– Alternate ways of referring to the same thing

Page 5: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Behavior Helps, But not Enough

• Privacy limits access to observations

• Queries based on behavior are hard to craft– Explicit queries are rarely used– Query by example requires behavior history

• “Cold start” problem limits applicability

Page 6: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

A “Solution:” Concept Retrieval

• Develop a concept inventory– Uniquely identify concepts using “descriptors”– Concept labels form a “controlled vocabulary”– Organize concepts using a “thesaurus”

• Assign concept descriptors to documents– Known as “indexing”

• Craft queries using the controlled vocabulary

Page 7: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Two Ways of Searching

Write the documentusing terms to

convey meaning

Author

Content-BasedQuery-Document

Matching Document Terms

Query Terms

Construct query fromterms that may

appear in documents

Free-TextSearcher

Retrieval Status Value

Construct query fromavailable concept

descriptors

ControlledVocabulary

Searcher

Choose appropriate concept descriptors

Indexer

Metadata-BasedQuery-Document

Matching Query Descriptors

Document Descriptors

Page 8: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Boolean Search Example

• Canine AND Fox– Doc 1

• Canine AND Political action– Empty

• Canine OR Political action– Doc 1, Doc 2

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

VolunteerismPolitical action

FoxCanine 0

011

1100

Descriptor Doc

1D

oc 2

[Canine][Fox]

[Political action][Volunteerism]

Page 9: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Applications

• When implied concepts must be captured– Political action, volunteerism, …

• When terminology selection is impractical– Searching foreign language materials

• When no words are present– Photos w/o captions, videos w/o transcripts, …

• When user needs are easily anticipated– Weather reports, yellow pages, …

Page 10: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Yahoo

Page 11: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Text Categorization

• Goal: fully automatic descriptor assignment

• Machine learning approach– Assign descriptors manually for a “training set”– Design a learning algorithm find and use patterns

• Bayesian classifier, neural network, genetic algorithm, …

– Present new documents• System assigns descriptors like those in training set

Page 12: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Supervised Learningf1 f2 f3 f4 … fN

v1 v2 v3 v4 … vN Cv

w1 w2 w3 w4 … wN Cw

Learner

Classifier

New example

x1 x2 x3 x4 … xNCx

Labelled training examples

Cw

Page 13: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Example: kNN Classifier

Page 14: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Machine Assisted Indexing

• Goal: Automatically suggest descriptors– Better consistency with lower cost

• Approach rule-based expert system– Design thesaurus by hand in the usual way– Design an expert system to process text

• String matching, proximity operators, …

– Write rules for each thesaurus/collection/language– Try it out and fine tune the rules by hand

Page 15: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Machine Assisted Indexing Example

//TEXT: scienceIF (all caps) USE research policy USE community programENDIFIF (near “Technology” AND with “Development”) USE community development USE development aidENDIF

near: within 250 wordswith: in the same sentence

Access Innovations system:

Page 16: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Thesaurus Design

• Thesaurus must match the document collection– Literary warrant

• Thesaurus must match the information needs– User-centered indexing

• Thesaurus can help to guide the searcher– Broader term (“is-a”), narrower term, used for, …

Page 17: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Challenges• Changing concept inventories

– Literary warrant and user needs are hard to predict

• Accurate concept indexing is expensive– Machines are inaccurate, humans are inconsistent

• Users and indexers may think differently– Diverse user populations add to the complexity

• Using thesauri effectively requires training– Meta-knowledge and thesaurus-specific expertise

Page 18: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Named Entity Tagging

• Machine learning techniques can find:– Location– Extent– Type

• Two types of features are useful– Orthography

• e.g., Paired or non-initial capitalization

– Trigger words• e.g., Mr., Professor, said, …

Page 19: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.
Page 20: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Normalization

• Variant forms of names (“name authority”)– Pseudonyms, partial names, citation styles

• Acronyms and abbreviations– Organizations, political entities, projects, …

• Co-reference resolution– References to roles or objects rather than names– Anaphoric pronouns for an antecedent name

Page 21: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Example: Bibliographic References

Page 22: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Types of Metadata Standards• What can we describe?

– Dublin Core

• How can we convey it?– Resource Description Framework (RDF)

• What can we say?– LCSH, MeSH, …

• What does it mean?– Semantic Web

Page 23: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Dublin Core• Goals:

– Easily understood, implemented and used

– Broadly applicable to many applications

• Approach:– Intersect several standards (e.g., MARC)

– Suggest only “best practices” for element content

• Implementation:– 16 optional and repeatable “elements”

• Refined using a growing set of “qualifiers”

– “Best practice” suggestions for content standards

Page 24: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Dublin Core Elements

Content• Title• Subject• Description• Type• Audience• Coverage• Related resource• Rights

Instantiation• Date• Format• Language• Identifier

Responsibility• Creator• Contributor• Source• Publisher

Page 25: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Resource Description Framework

• XML schema for describing resources

• Can integrate multiple metadata standards – Dublin Core, P3P, PICS, vCARD, …

• Dublin Core provides a XML “namespace”– DC Elements are XML “properties

• DC Refinements are RDF “subproperties”

– Values are XML “content”

Page 26: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

A Rose By Any Other Name …

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">

<rdf:Description rdf:about="http://media.example.com/audio/guide.ra"> <dc:creator>Rose Bush</dc:creator> <dc:title>A Guide to Growing Roses</dc:title> <dc:description>Describes process for planting and nurturing different kinds of rose bushes.</dc:description> <dc:date>2001-01-20</dc:date> </rdf:Description> </rdf:RDF>

Page 27: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Semantic Web

• RDF provides the schema for interchange

• Ontologies support automated inference– Similar to thesauri supporting human reasoning

• Ontology mapping permits distributed creation– This is where the magic happens

Page 28: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Adversarial IR

• Search is user-controlled suppression– Everything is known to the search system– Goal: avoid showing things the user doesn’t want

• Other stakeholders have different goals– Authors risk little by wasting your time– Marketers hope for serendipitous interest

• Metadata from trusted sources is more reliable

Page 29: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Index Spam

• Goal: Manipulate rankings of an IR system

• Multiple strategies:– Create bogus user-assigned metadata– Add invisible text (font in background color, …)– Alter your text to include desired query terms– “Link exchanges” create links to your page

Page 30: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Putting It All Together

Free Text Behavior Metadata

Topicality

Quality

Reliability

Cost

Flexibility

Page 31: Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.

Before You Go!

On a sheet of paper, please briefly answer the following question (no names):

What was the muddiest point in today’s lecture?