INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor...

34
INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Dr. Xia Lin Assistant Professor Assistant Professor College of Information Science and College of Information Science and Technology Technology Drexel University Drexel University

Transcript of INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor...

Page 1: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

INFO624 -- Week 8Subject Indexing & Knowledge

Representation

Dr. Xia LinDr. Xia LinAssistant ProfessorAssistant Professor

College of Information Science and TechnologyCollege of Information Science and Technology

Drexel UniversityDrexel University

Page 2: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Effective Information Retrieval

Data StructuresData Structures Knowledge RepresentationKnowledge Representation

From Document representation to From Document representation to Knowledge representationKnowledge representation

User Interface and User InteractionUser Interface and User Interaction

Page 3: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Document Representation Vocabulary Vocabulary Semantics Semantics Implementation Implementation

Page 4: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Vocabulary Controlled Vocabulary Controlled Vocabulary

A list of terms selected for index purpose. A list of terms selected for index purpose. The terms are processed to reduce The terms are processed to reduce

inconsistence and ambiguity. inconsistence and ambiguity. Established selection rules and indexing Established selection rules and indexing

rulesrules Uncontrolled vocabularyUncontrolled vocabulary

Subject keywordsSubject keywords Metadata Metadata

Page 5: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Example: ACM record

Page 6: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Meta Data Data about dataData about data Descriptive DataDescriptive Data

External to the meaning of the documentExternal to the meaning of the documentDublin Core Metadata Element SetDublin Core Metadata Element SetAuthor, title, publisher, etc.Author, title, publisher, etc.

Semantic MetadataSemantic Metadata Subject keywordsSubject keywords

Challenge: automatic generation of Challenge: automatic generation of metadata for documentsmetadata for documents

Page 7: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Semantics Semantics is the study of meaningSemantics is the study of meaning

Relational semantics Relational semantics Synonymy, hierarchical, etc. Synonymy, hierarchical, etc.

Referential semanticsReferential semanticsHomonyms, techniques used to limited the Homonyms, techniques used to limited the

meanings or referents of termsmeanings or referents of terms Category semanticsCategory semantics

Facets or other participationsFacets or other participations

Page 8: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Example:

Mercury?Mercury? Mercury (car)Mercury (car) Mercury (planet)Mercury (planet) Mercury (metal)Mercury (metal) Mercury (Greek god)Mercury (Greek god)

Page 9: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Implementation Standards Standards

AACR2AACR2 ISO Standard for Indexing (ISO 5963)ISO Standard for Indexing (ISO 5963) ISO Standard for Thesaurus Construction ISO Standard for Thesaurus Construction

(ISO 2788)(ISO 2788) RulesRules

Classification rulesClassification rules Evaluation rules Evaluation rules

Page 10: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Subject Indexing

A human analytic process for identifying, A human analytic process for identifying, selecting, and representing document conceptsselecting, and representing document concepts Create indexing languagesCreate indexing languages

Using standardized, limited vocabularies for Using standardized, limited vocabularies for index purposes.index purposes.

Assign indexing terms to documentsAssign indexing terms to documentsUsing only the terms in the index language Using only the terms in the index language

selected.selected.

Page 11: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Basic Processes of Subject Indexing Identifying concepts which represent the subject Identifying concepts which represent the subject

and purpose of a document. and purpose of a document. Deciding which of these concepts are important Deciding which of these concepts are important

for retrieval of this document for retrieval of this document Expressing concepts needed for retrieval in the Expressing concepts needed for retrieval in the

indexing languages used indexing languages used Using uncontrolled vocabulary for concepts not Using uncontrolled vocabulary for concepts not

represented or represented insufficiently represented or represented insufficiently specifically in the indexing languages. specifically in the indexing languages.

Page 12: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Controlled Vocabulary Goals:Goals:

To permit easy locations of documents by To permit easy locations of documents by topic.topic.

To define topic areas, and hence relate one To define topic areas, and hence relate one document to another.document to another.

to provide multiple access pointers to to provide multiple access pointers to documentsdocuments

to enforce a uniformity throughout an to enforce a uniformity throughout an information retrieval system information retrieval system

Page 13: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Controlled Vocabulary Formats:Formats:

Hierarchical Classified list Hierarchical Classified list hierarchical subject descriptorshierarchical subject descriptorsassociative cross referencesassociative cross referencesclassification notation (codes)classification notation (codes)

Alphabetical listAlphabetical listinclude both descriptors and other include both descriptors and other

lead-in termslead-in terms

Page 14: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Main Componentsin a Controlled Vocabulary

Keyword/ Descriptor

Synonymous Term

Broader Term

Narrower Term

Related Term

Page 15: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Example

CancerMalignancy Malignant tumor Cancer morphology

Diseases

Neoplasms

Malignant neoplasm of skins Breast Cancer Primary malignant neoplasm of liver

Abdominal Neoplasms Hyperplasia Seminoma

Broader Terms

Related Terms

Narrower Terms

Synonyms

Page 16: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Example: MeSH – Medical Subject Headings MeSH – Medical Subject Headings

22,568 descriptors 22,568 descriptors 139,000 headings (Supplementary Concept 139,000 headings (Supplementary Concept

Records)Records) thousands of cross-references thousands of cross-references

i.e., Vitamin C see Ascorbic Acid. i.e., Vitamin C see Ascorbic Acid. Used t indexing MEDLINEUsed t indexing MEDLINE

MeSHMeSH Browser Browser

Page 17: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

MeSH Tree Structures - 20041.1. Anatomy [A] Anatomy [A] 2.2. Organisms [B] Organisms [B] 3.3. Diseases [C] Diseases [C] 4.4. Chemicals and Drugs [D] Chemicals and Drugs [D] 5.5. Analytical, Diagnostic and Therapeutic Techniques and Analytical, Diagnostic and Therapeutic Techniques and

Equipment [E] Equipment [E] 6.6. Psychiatry and Psychology [F] Psychiatry and Psychology [F] 7.7. Biological Sciences [G] Biological Sciences [G] 8.8. Physical Sciences [H] Physical Sciences [H] 9.9. Anthropology, Education, Sociology and Social Phenomena [I] Anthropology, Education, Sociology and Social Phenomena [I] 10.10. Technology and Food and Beverages [J] Technology and Food and Beverages [J] 11.11. Humanities [K] Humanities [K] 12.12. Information Science [L] Information Science [L] 13.13. Persons [M] Persons [M] 14.14. Health Care [N] Health Care [N] 15.15. Geographic Locations [Z] Geographic Locations [Z]

Page 18: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

ERIC Thesaurus

more than 10,000 terms or subject headings more than 10,000 terms or subject headings used in indexing and searching ERIC records. used in indexing and searching ERIC records.

A supplemental list of over 55,000 terms or A supplemental list of over 55,000 terms or subject headings includingsubject headings including proper names (e.g., geographic, personal, proper names (e.g., geographic, personal,

institutional, project, equipment, test, etc., institutional, project, equipment, test, etc., names) or names) or

concepts not yet represented by the concepts not yet represented by the controlled vocabulary of the ERIC controlled vocabulary of the ERIC Thesaurus. Thesaurus.

Page 19: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Controlled Vocabulary Examples:Examples:

Case studies: Case studies: DescriptorDescriptorSN: Details analyses, usually focusing on a SN: Details analyses, usually focusing on a

particular problem of an individual, group, particular problem of an individual, group, or organization (note: do not confuse with or organization (note: do not confuse with “medical case histories”“medical case histories”

NT: NT: Cross sectional studiesCross sectional studies

Longitudinal studiesLongitudinal studies

Page 20: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Examples (Case Studies)

BTBT

Evaluation methodsEvaluation methods

ResearchResearch RTRT

Case recordsCase records

CounselingCounseling

Qualitative researchQualitative research

Page 21: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Advantages of Subject Indexing Facilitates concept searchFacilitates concept search

search by topics/subjects, not just by wordssearch by topics/subjects, not just by words link related documents by subject termslink related documents by subject terms Make implicit information explicitMake implicit information explicit

Provides a standard terminology to index and Provides a standard terminology to index and search documents.search documents. Use small indexing vocabularyUse small indexing vocabulary Help the searcher find related termsHelp the searcher find related terms

Page 22: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Disadvantages of Subject Indexing Expensive manual operationsExpensive manual operations

To construct the controlled vocabularyTo construct the controlled vocabulary To assign terms to documentsTo assign terms to documents

Difficult to keep up to dateDifficult to keep up to date Terminology changes very fastTerminology changes very fast New terms are added daily.New terms are added daily.

Inconsistent process of human indexingInconsistent process of human indexing Same documents are assigned different indexing Same documents are assigned different indexing

terms by different indexersterms by different indexers The user may not use the same terms to find The user may not use the same terms to find

documents as the indexer would use to index the documents as the indexer would use to index the documents.documents.

Page 23: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Document Representation Inverted IndexingInverted Indexing

Represent a document as a list of terms Represent a document as a list of terms occurred in the documentoccurred in the document

computer-based indexingcomputer-based indexing statistical-based indexingstatistical-based indexing

Subject IndexingSubject Indexing Represent a document as a list of subject Represent a document as a list of subject

terms occurred in a controlled vocabulary. terms occurred in a controlled vocabulary.

Page 24: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Considerations of Document Representation Any format of document representation Any format of document representation

needs to maintain a balance of itsneeds to maintain a balance of its Discriminating powerDiscriminating power DescriptivenessDescriptiveness Similarity identificationSimilarity identification Conciseness Conciseness

Page 25: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Considerations of DR

Discriminating powerDiscriminating power to identify a document uniquelyto identify a document uniquely to reduce ambiguity to reduce ambiguity

Examples:Examples:

• ISBN number for bookISBN number for book

• bar codes for productsbar codes for products

Page 26: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Considerations of DR

DescriptivenessDescriptiveness describe all the information as complete as describe all the information as complete as

possible possible fulltextfulltextabstractsabstractsextractsextractsreviewsreviews

Completeness and correctnessCompleteness and correctness

Page 27: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Considerations of DR Similarity IdentificationSimilarity Identification

to group similar documents to group similar documents keywords or subject indexingkeywords or subject indexingbook classification numbersbook classification numbers

Difficulty for the computer to assign Difficulty for the computer to assign keywords, subject descriptors, or keywords, subject descriptors, or classification numbers to documentsclassification numbers to documents

Page 28: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Considerations of DR

ConcisenessConciseness simple and clearsimple and clear reduce process time and storage spacereduce process time and storage space Examples:Examples:

authors and titlesauthors and titles

Page 29: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Relationships of four considerations

Higher discrimination power may lower the Higher discrimination power may lower the capability of identifying similarities among capability of identifying similarities among documents.documents.

Good descriptiveness may defeat the concisenessGood descriptiveness may defeat the conciseness What’s good for the computer may not always be What’s good for the computer may not always be

good for the user.good for the user. A good representation should seek a balance of the A good representation should seek a balance of the

four, and take consideration of both the computer four, and take consideration of both the computer and the user.and the user.

Page 30: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

What’s missing in DR? Intelligent Reasoning!Intelligent Reasoning! Knowledge-baseKnowledge-base

OntologyOntology Semantic NetworksSemantic Networks

Uncertainty(impreciseness)-handlingUncertainty(impreciseness)-handling

Page 31: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Knowledge Representation

encoding human knowledge - in all its encoding human knowledge - in all its various forms - in such a way that the various forms - in such a way that the knowledge can be used. knowledge can be used. A successful representation of some A successful representation of some

knowledge must be in a form that is knowledge must be in a form that is understandable by humans, and must understandable by humans, and must cause the system using the knowledge cause the system using the knowledge to behave as if it knows it. to behave as if it knows it.

Page 32: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Knowledge Representation A knowledge representation (KR) is most A knowledge representation (KR) is most

fundamentally a surrogate, a substitute for the fundamentally a surrogate, a substitute for the thing itself.thing itself.

It is a set of ontological commitments, i.e., an It is a set of ontological commitments, i.e., an answer to the question: In what terms should I answer to the question: In what terms should I think about the world? think about the world?

It is a fragmentary theory of intelligent reasoning, It is a fragmentary theory of intelligent reasoning, expressed in terms of three components: (i) the expressed in terms of three components: (i) the representation's fundamental conception of representation's fundamental conception of intelligent reasoning; (ii) the set of inferences the intelligent reasoning; (ii) the set of inferences the representation sanctions; and (iii) the set of representation sanctions; and (iii) the set of inferences it recommends. inferences it recommends.

Page 33: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Knowledge Representation It is a medium for pragmatically efficient It is a medium for pragmatically efficient

computation, i.e., the computational environment computation, i.e., the computational environment in which thinking is accomplished. One in which thinking is accomplished. One contribution to this pragmatic efficiency is contribution to this pragmatic efficiency is supplied by the guidance a representation supplied by the guidance a representation provides for organizing information so as to provides for organizing information so as to facilitate making the recommended inferences. facilitate making the recommended inferences.

It is a medium of human expression, i.e., a It is a medium of human expression, i.e., a language in which we say things about the world. language in which we say things about the world. From From

http://medg.lcs.mit.edu/ftp/psz/k-rep.htmlhttp://medg.lcs.mit.edu/ftp/psz/k-rep.html

Page 34: INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Intelligent Information Retrieval

Information retrieval supported by Information retrieval supported by knowledge representation, rather than knowledge representation, rather than document representation.document representation.

Useful linksUseful links Stanford Stanford Agent-based IRAgent-based IR