Download - WoK: A Web of Knowledge

Transcript
Page 1: WoK: A Web of Knowledge

David W. EmbleyBrigham Young University

Provo, Utah, USA

WoK: A Web of Knowledge

Page 2: WoK: A Web of Knowledge

A Web of Pages A Web of FactsBirthdate of my great

grandpa Orson

Price and mileage of red Nissans, 1990 or newer

Location and size of chromosome 17

US states with property crime rates above 1%

Page 3: WoK: A Web of Knowledge

• Fundamental questions– What is knowledge?– What are facts?– How does one know?

• Philosophy– Ontology– Epistemology– Logic and reasoning

Toward a Web of Knowledge

Page 4: WoK: A Web of Knowledge

• Existence asks “What exists?”• Concepts, relationships, and constraints with

formal foundation

Ontology

Page 5: WoK: A Web of Knowledge

• The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?”

• Populated conceptual model

Epistemology

Page 6: WoK: A Web of Knowledge

• Principles of valid inference – asks: “What is known?” and “What can be inferred?”

• For us, it answers: what can be inferred (in a formal sense) from conceptualized data.

Logic and Reasoning

Find price and mileage of red Nissans, 1990 or newer

Page 7: WoK: A Web of Knowledge

• Distill knowledge from the wealth of digital web data• Annotate web pages

• Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge

Making this Work How?

Fact

Fact

Fact

AnnotationAnnotation

Page 8: WoK: A Web of Knowledge

Turning Raw Symbols into Knowledge

• Symbols: $ 11,500 117K Nissan CD AC• Data: price(11,500) mileage(117K)

make(Nissan)• Conceptualized data:

– Car(C123) has Price($11,500)– Car(C123) has Mileage(117,000)– Car(C123) has Make(Nissan)– Car(C123) has Feature(AC)

• Knowledge– “Correct” facts– Provenance

Page 9: WoK: A Web of Knowledge

Actualization (with Extraction Ontologies)

Find me the price and mileage of all red Nissans – I want a 1990 or newer.

Page 10: WoK: A Web of Knowledge

Data Extraction Demo

Page 11: WoK: A Web of Knowledge

Semantic Annotation Demo

Page 12: WoK: A Web of Knowledge

Free-Form Query Demo

Page 13: WoK: A Web of Knowledge

Explanation: How it Works

• Extraction Ontologies• Semantic Annotation• Free-Form Query Interpretation

Page 14: WoK: A Web of Knowledge

Extraction Ontologies

Object sets

Relationship sets

Participation constraints

Lexical

Non-lexical

Primary object set

Aggregation

Generalization/Specialization

Page 15: WoK: A Web of Knowledge

Extraction Ontologies

External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?

Key Word Phrase

Left Context: $

Data Frame:

Internal Representation: float

Values

Key Words: ([Pp]rice)|([Cc]ost)| …

Operators

Operator: >

Key Words: (more\s*than)|(more\s*costly)|…

Page 16: WoK: A Web of Knowledge

Generality & Resiliency ofExtraction Ontologies

• Generality: assumptions about web pages– Data rich– Narrow domain– Document types

• Single-record documents (hard, but doable)• Multiple-record documents (harder)• Records with scattered components (even harder)

• Resiliency: declarative– Still works when web pages change– Works for new, unseen pages in the same domain– Scalable, but takes work to declare the extraction

ontology

Page 17: WoK: A Web of Knowledge

Semantic Annotation

Page 18: WoK: A Web of Knowledge

Free-Form Query Interpretation

• Parse Free-Form Query(with respect to data extraction ontology)

• Select Ontology• Formulate Query Expression• Run Query Over Semantically Annotated Data

Page 19: WoK: A Web of Knowledge

Parse Free-Form Query “Find me the and of all s – I want a ”

price

mileage

red

Nissan

1996

or newer

>= Operator

Page 20: WoK: A Web of Knowledge

Select Ontology“Find me the price and mileage of all red Nissans – I want a 1996 or newer”

Page 21: WoK: A Web of Knowledge

• Conjunctive queries and aggregate queries• Mentioned object sets are all of interest.• Values and operator keywords determine conditions.

– Color = “red”– Make = “Nissan”– Year >= 1996

>= Operator

Formulate Query Expression

Page 22: WoK: A Web of Knowledge

For

Let

Where

Return

Formulate Query Expression

Page 23: WoK: A Web of Knowledge

Run QueryOver Semantically Annotated Data

Page 24: WoK: A Web of Knowledge

• How do we create extraction ontologies?– Manual creation requires several dozen person hours– Semi-automatic creation

• TISP (Table Interpretation by Sibling Pages)• TANGO (Table ANalysis for Generating Ontologies)• Nested Schemas with Regular Expressions• Synergistic Bootstrapping• Form-based Information Harvesting

• How do we scale up?– Practicalities of technology transfer and usage– Millions of queries over zillions of facts for thousands of

ontologies

Great!But Problems Still Need Resolution

Page 25: WoK: A Web of Knowledge

Manual Creation

Page 26: WoK: A Web of Knowledge

Manual Creation

Page 27: WoK: A Web of Knowledge

Manual Creation

-Library of instance recognizers-Library of lexicons

Page 28: WoK: A Web of Knowledge

Automatic Annotation with TISP(Table Interpretation with Sibling Pages)

• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations

Page 29: WoK: A Web of Knowledge

Recognize Tables

Data Table

Layout Tables (discard)

NestedData Tables

Page 30: WoK: A Web of Knowledge

Locate Table LabelsExamples: Identification.Gene model(s).Protein Identification.Gene model(s).2

Page 31: WoK: A Web of Knowledge

Locate Table LabelsExamples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2

12

Page 32: WoK: A Web of Knowledge

Locate Table Values

Value

Page 33: WoK: A Web of Knowledge

Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

12

Page 34: WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Page 35: WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Same

Page 36: WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Almost Same

Page 37: WoK: A Web of Knowledge

Interpretation Technique:Sibling Page Comparison

Different

Same

Page 38: WoK: A Web of Knowledge

Technique Details

• Unnest tables• Match tables in sibling pages

– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)

• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment

Page 39: WoK: A Web of Knowledge

Generated RDF

Page 40: WoK: A Web of Knowledge

WoK Demo (via TISP)

Page 41: WoK: A Web of Knowledge

Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies)

• Recognize and normalize table information• Construct mini-ontologies from tables• Discover inter-ontology mappings• Merge mini-ontologies into a growing ontology

Page 42: WoK: A Web of Knowledge

Recognize Table Information

Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%

Page 43: WoK: A Web of Knowledge

Construct Mini-Ontology Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%

Page 44: WoK: A Web of Knowledge

Discover Mappings

Page 45: WoK: A Web of Knowledge

Merge

Page 46: WoK: A Web of Knowledge

• Build a page-layout, pattern-based annotator• Automate layout recognition based on examples• Auto-generate examples with extraction

ontologies• Synergistically run pattern-based annotator &

extraction-ontology annotator

Semi-Automatic Annotation viaSynergistic Bootstrapping

(Based on Nested Schemas with Regular Expressions)

Page 47: WoK: A Web of Knowledge

PatML Editor

Browser-Rendered Page

Page Source Text

InformationStructure Tree

Page 48: WoK: A Web of Knowledge
Page 49: WoK: A Web of Knowledge

Synergistic ExecutionExtraction Ontology

Document

Conceptual Annotator

(ontology-based annotation)

PartiallyAnnotated Document

Structural Annotator

(layout-driven annotation)

Annotated Document

Layout Patterns

Pattern Generation

Page 50: WoK: A Web of Knowledge

Form-Based Information Harvesting• Forms

– General familiarity– Reasonable conceptual framework– Appropriate correspondence

• Transformable to ontological descriptions• Capable of accepting source data

• Instance recognizers– Some pre-existing instance recognizers– Lexicons

• Automated extraction ontology creation?

Page 51: WoK: A Web of Knowledge

Form CreationBasic form-construction facilities:• single-entry field• multiple-entry field• nested form• …

Page 52: WoK: A Web of Knowledge

Created Sample Form

Page 53: WoK: A Web of Knowledge

Generated Ontology View

Page 54: WoK: A Web of Knowledge

Source-to-Form Mapping

Page 55: WoK: A Web of Knowledge

Source-to-Form Mapping

Page 56: WoK: A Web of Knowledge

Source-to-Form Mapping

Page 57: WoK: A Web of Knowledge

Source-to-Form Mapping

Page 58: WoK: A Web of Knowledge

Almost Ready to Harvest

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Page 59: WoK: A Web of Knowledge

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3

Name

Page 60: WoK: A Web of Knowledge

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3

Name

Page 61: WoK: A Web of Knowledge

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Page 62: WoK: A Web of Knowledge

Almost Ready to Harvest …

• Need reading path: DOM-tree structure• Need to resolve mapping problems

– Split/Merge– Union/Selection

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Page 63: WoK: A Web of Knowledge

Can Now Harvest

Name

Page 64: WoK: A Web of Knowledge

Can Now Harvest

Name

14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E

Page 65: WoK: A Web of Knowledge

Can Now Harvest

Name

Voltage-dependent anion-selective channel protein 3VDAC-3hVDAC3Outer mitochondrial membrane Protein porin 3

Page 66: WoK: A Web of Knowledge

Can Now Harvest

Name

Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS

Page 67: WoK: A Web of Knowledge

Harvesting Populates Ontology

Page 68: WoK: A Web of Knowledge

Harvesting Populates Ontology

Also helps adjust ontology constraints

Page 69: WoK: A Web of Knowledge

Can Harvest from Additional Sites

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Page 70: WoK: A Web of Knowledge

AutomatingExtraction Ontology Creation

Lexicons

Name

14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E

Name

T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15

Name

Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS

…14-3-3 protein epsilonMitochondrial import stimulation factor LsubunitProtein kinase C inhibitor protein-1KCIP-114-3-3E…T-complex protein 1 subunit thetaTCP-1-thetaCCT-thetaRenal carcinoma antigen NY-REN-15…Tryptophanyl-tRNA synthetase, mitochondrial precursorEC 6.1.1.2Tryptophan—tRNA ligaseTrpRS(Mt)TrpRS…

Page 71: WoK: A Web of Knowledge

AutomatingExtraction Ontology Creation

Instance RecognizersNumber Patterns Context Keywords and Phrases

Page 72: WoK: A Web of Knowledge

Automatic Source-to-Form Mapping

Page 73: WoK: A Web of Knowledge

Automatic Semantic Annotation

Recognize and annotate with respect to an ontology

Page 74: WoK: A Web of Knowledge

• Advanced free-form queries with disjunction and negation

• Form-based query language• Table-based query languages• Graphical query languages

Practicalities: WoK Query Interfaces(Future Work)

Page 75: WoK: A Web of Knowledge

• Won’t just happen without sufficient content• Niche applications

– Historical Data (e.g. Genealogy)– Topical Blogs

• Local WoKs– Intra-organizational effort– Individual interests

Practicalities: Bootstrapping the WoK(Future Work)

Page 76: WoK: A Web of Knowledge

• Potential Rapid growth– Thousands of ontologies– Millions of simultaneous queries– Billions of annotated pages– Trillions of facts

• Search-engine-like caching & query processing

Practicalities: Scalability(Future Work)

Page 77: WoK: A Web of Knowledge

• Automatic (or near automatic) creation of extraction ontologies

• Automatic (or near automatic) annotation of web pages

• Simple but accurate query specification without specialized training

Key to Success:Simplicity via Automation

www.deg.byu.edu