Semantic Understanding An Approach Based on Information-Extraction Ontologies David W. Embley...

126
Semantic Understanding An Approach Based on Information-Extraction Ontologies David W. Embley Brigham Young University
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    226
  • download

    5

Transcript of Semantic Understanding An Approach Based on Information-Extraction Ontologies David W. Embley...

Semantic UnderstandingAn Approach Based on

Information-Extraction Ontologies

David W. EmbleyBrigham Young University

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Grand Challenge

Semantic UnderstandingSemantic Understanding

Can we quantify & specify the nature of this grand challenge?

Grand Challenge

Semantic UnderstandingSemantic Understanding“If ever there were a technology that could generatetrillions of dollars in savings worldwide …, it wouldbe the technology that makes business informationsystems interoperable.”

(Jeffrey T. Pollock, VP of Technology Strategy, Modulant Solutions)

Grand Challenge

Semantic UnderstandingSemantic Understanding“The Semantic Web: … content that is meaningful tocomputers [and that] will unleash a revolution of newpossibilities … Properly designed, the Semantic Webcan assist the evolution of human knowledge …”

(Tim Berners-Lee, …, Weaving the Web)

Grand Challenge

Semantic UnderstandingSemantic Understanding“20th Century: Data Processing“21st Century: Data Exchange “The issue now is mutual understanding.”

(Stefano Spaccapietra, Editor in Chief, Journal on Data Semantics)

Grand Challenge

Semantic UnderstandingSemantic Understanding“The Grand Challenge [of semantic understanding] has become mission critical. Current solutions … won’t scale. Businesses need economic growth dependent on the web working and scaling (cost: $1 trillion/year).”

(Michael Brodie, Chief Scientist, Verizon Communications)

What is Semantic Understanding?

Understanding: “To grasp or comprehend [what’s]intended or expressed.’’

Semantics: “The meaning or the interpretation of a word, sentence, or other language form.”

- Dictionary.com

Can We Achieve Semantic Understanding?

“A computer doesn’t truly ‘understand’ anything.”

But computers can manipulate terms “in ways that are useful and meaningful to the human user.”

- Tim Berners-Lee

Key Point: it only has to be good enough.And that’s our challenge and our opportunity!

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Information Value Chain

Meaning

Knowledge

Information

Data

Translating data into meaning

Foundational Definitions

Meaning: knowledge that is relevant or activates Knowledge: information with a degree of

certainty or community agreement Information: data in a conceptual framework Data: attribute-value pairs

- Adapted from [Meadow92]

Foundational Definitions

Meaning: knowledge that is relevant or activates Knowledge: information with a degree of

certainty or community agreement (ontology) Information: data in a conceptual framework Data: attribute-value pairs

- Adapted from [Meadow92]

Foundational Definitions

Meaning: knowledge that is relevant or activates Knowledge: information with a degree of

certainty or community agreement (ontology) Information: data in a conceptual framework Data: attribute-value pairs

- Adapted from [Meadow92]

Foundational Definitions

Meaning: knowledge that is relevant or activates Knowledge: information with a degree of

certainty or community agreement (ontology) Information: data in a conceptual framework Data: attribute-value pairs

- Adapted from [Meadow92]

Data

Attribute-Value Pairs• Fundamental for information• Thus, fundamental for knowledge & meaning

Data

Attribute-Value Pairs• Fundamental for information• Thus, fundamental for knowledge & meaning

Data Frame• Extensive knowledge about a data item

�̶Everyday data: currency, dates, time, weights & measures

�̶Textual appearance, units, context, operators, I/O conversion

• Abstract data type with an extended framework

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

?

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

?

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

?

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

?

Olympus C-750 Ultra Zoom

Sensor Resolution 4.2 megapixelsOptical Zoom 10 xDigital Zoom 4 xInstalled Memory 16 MBLens Aperture F/8-2.8/3.7Focal Length min 6.3 mmFocal Length max 63.0 mm

Digital Camera

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

?

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

?

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

?

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

?

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

Car Advertisement

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

?

Flight # Class From Time/Date To Time/Date Stops

Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 16 06 06 17 06 06

Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 24 06 06 24 06 06

?

Flight # Class From Time/Date To Time/Date Stops

Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04

Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04

Airline Itinerary

Flight # Class From Time/Date To Time/Date Stops

Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04

Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04

?

Monday, October 13th

Group A W L T GF GA Pts.USA 3 0 0 11 1 9Sweden 2 1 0 5 3 6North Korea 1 2 0 3 4 3Nigeria 0 3 0 0 11 0

Group B W L T GF GA Pts.Brazil 2 0 1 8 2 7…

?

Monday, October 13th

Group A W L T GF GA Pts.USA 3 0 0 11 1 9Sweden 2 1 0 5 3 6North Korea 1 2 0 3 4 3Nigeria 0 3 0 0 11 0

Group B W L T GF GA Pts.Brazil 2 0 1 8 2 7…

World Cup Soccer

Monday, October 13th

Group A W L T GF GA Pts.USA 3 0 0 11 1 9Sweden 2 1 0 5 3 6North Korea 1 2 0 3 4 3Nigeria 0 3 0 0 11 0

Group B W L T GF GA Pts.Brazil 2 0 1 8 2 7…

?

Calories 250 calDistance 2.50 milesTime 23.35 minutesIncline 1.5 degreesSpeed 5.2 mphHeart Rate 125 bpm

?

Calories 250 calDistance 2.50 milesTime 23.35 minutesIncline 1.5 degreesSpeed 5.2 mphHeart Rate 125 bpm

?

Calories 250 calDistance 2.50 milesTime 23.35 minutesIncline 1.5 degreesSpeed 5.2 mphHeart Rate 125 bpm

Treadmill Workout

Calories 250 calDistance 2.50 milesTime 23.35 minutesIncline 1.5 degreesSpeed 5.2 mphHeart Rate 125 bpm

?

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

?

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

?

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Maps

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,100 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Information Extraction OntologiesSource Target

InformationExtraction

InformationExchange

What is an Extraction Ontology? Augmented Conceptual-Model Instance

• Object & relationship sets• Constraints• Data frame value recognizers

Robust Wrapper (Ontology-Based Wrapper)• Extracts information• Works even when site changes or when new sites

come on-line

CarAds

Color

Feature

AccessoryBodyType

OtherFeatureEngine

Transmission

Mileage

ModelTrim

TrimModel

Year

Make

Price

PhoneNr

0:1

has1:*

0:1has1:*

0:0.7:1has

1:* 0:0.9:1has

1:*

0:0.78:1

has

1:*

0:1

1:*

0:1

1:*

0:1

has1:*

0:*has

1:*

0:*

has

1:*

CarAds

Color

Feature

AccessoryBodyType

OtherFeatureEngine

Transmission

Mileage

ModelTrim

TrimModel

Year

Make

Price

PhoneNr

0:1

has1:*

0:1has1:*

0:0.7:1has

1:* 0:0.9:1has

1:*

0:0.78:1

has

1:*

0:1

1:*

0:1

1:*

0:1

has1:*

0:*has

1:*

0:*

has

1:*

CarAds Extraction Ontology

<ObjectSet x="329" y="51" lexical="true" name="Mileage" id="osmx50"> <DataFrame> <InternalRepresentation> <DataType typeName="String"/> </InternalRepresentation> <ValuePhraseList> <ValuePhrase hint="Mileage Pattern 1"> <ValueExpression color="ffffff"> <ExpressionText>[1-9]\d{0,2}[kK]</ExpressionText> </ValueExpression> <LeftContextExpression color="ffffff"> … <KeywordPhraseList> <KeywordPhrase hint=“New phrase 1”> <KeywordExpression color=“ffffff”> <ExpressionText>\bmiles\b</ExpressionText> …

<ObjectSet x="329" y="51" lexical="true" name="Mileage" id="osmx50"> <DataFrame> <InternalRepresentation> <DataType typeName="String"/> </InternalRepresentation> <ValuePhraseList> <ValuePhrase hint="Mileage Pattern 1"> <ValueExpression color="ffffff"> <ExpressionText>[1-9]\d{0,2}[kK]</ExpressionText> </ValueExpression> <LeftContextExpression color="ffffff"> … <KeywordPhraseList> <KeywordPhrase hint=“New phrase 1”> <KeywordExpression color=“ffffff”> <ExpressionText>\bmiles\b</ExpressionText> …

Extraction Ontologies:An Example of

Semantic Understanding

“Intelligent” Symbol Manipulation Gives the “Illusion of Understanding” Obtains Meaningful and Useful Results

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

A Variety of Applications Information Extraction Semantic Web Page Annotation Free-Form Semantic Web Queries Task Ontologies for Free-Form Service Requests High-Precision Classification Schema Mapping for Ontology Alignment Accessing the Hidden Web Ontology Generation Challenging Applications (e.g. BioInformatics)

Application #1

Information Extraction

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Constant/Keyword Recognition

Descriptor/String/Position(start/end)

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Heuristics

Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint violation

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Keyword Proximity

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Subsumed/Overlapping Constants

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Functional Relationships

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Nonfunctional Relationships

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

First Occurrence without Constraint Violation

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Database-Instance Generator

insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)

Application #2

Semantic Web Page Annotation

Annotated Web Page

OWL<owl:Class rdf:ID="CarAds"> <rdfs:label xml:lang="en">CarAds</rdfs:label>...... <rdfs:subClassOf>

<owl:Restriction> <owl:onProperty rdf:resource="#hasMileage" /> <owl:minCardinality rdf:datatype="&xsd;nonNegativeInteger">0</owl:minCardinality>

</owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf>

<owl:Restriction> <owl:onProperty rdf:resource="#hasMileage" />

<owl:maxCardinality rdf:datatype="&xsd;nonNegativeInteger">1</owl:maxCardinality>

</owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="#hasMileage" /> <owl:allValuesFrom rdf:resource="#Mileage" /> </owl:Restriction> </rdfs:subClassOf>……</owl:Class>……<owl:Class rdf:ID="Mileage"> <rdfs:label xml:lang="en">Mileage</rdfs:label>……</owl:Class>……

<CarAds rdf:ID="CarAdsIns2"><CarAdsValue rdf:datatype="&xsd;string">2</CarAdsValue>

</CarAds>……<Mileage rdf:ID="MileageIns2">

<StartingCharPosition rdf:datatype="&xsd;nonNegativeInteger">237</StartingCharPosition>

<EndingCharPosition rdf:datatype="&xsd;nonNegativeInteger">241</EndingCharPosition>

</Mileage>…….<owl:Thing rdf:about="#CarAdsIns2">

<hasMake rdf:resource="#MakeIns2" /><hasModel rdf:resource="#ModelIns2" /><hasYear rdf:resource="#YearIns2" /><hasMileage rdf:resource="#MileageIns2" /><hasPhoneNr rdf:resource="#PhoneNrIns2" /><hasPrice rdf:resource="#PriceIns2" />

</owl:Thing>

……

Application #3

Free-Form Semantic Web Queries

Step 1. Parse Query “Find me the and of all s – I want a ”

price

mileage

red

Nissan

1998

or newer

>= Operator

Step 2. Find Corresponding Ontology

Similarity value: 6

Similarity value: 2

>= Operator

“Find me the price and mileage of all red Nissans – I want a 1998 or newer”

Step 3. Formulate XQuery Expression

Conjunctive queries run over selected ontology’s extracted values

<Car rdf:ID="CarIns7"> <CarValue rdf:datatype="&xsd;string">7</CarValue>

</Car><Make rdf:ID="MakeIns7">

<MakeValue rdf:datatype="&xsd;string">Nissan</MakeValue> <ontos:URI rdf:datatype="&xsd;string">MakeIns7</ontos:URI> <offset rdf:datatype="&xsd;nonNegativeInteger">41893</offset>

</Make><Year rdf:ID="YearIns7">

<YearValue rdf:datatype="&xsd;string">1999</YearValue> <ontos:URI rdf:datatype="&xsd;string">YearIns7</ontos:URI> <offset rdf:datatype="&xsd;nonNegativeInteger">41641</offset>

</Year><Color rdf:ID="ColorIns7">

<ColorValue rdf:datatype="&xsd;string">red</ColorValue> <ontos:URI rdf:datatype="&xsd;string">ColorIns7</ontos:URI> <offset rdf:datatype="&xsd;nonNegativeInteger">42186</offset>

</Color>

<owl:Thing rdf:about="#CarIns7"> <hasMake rdf:resource="#MakeIns7" /> <hasYear rdf:resource="#YearIns7" /> <hasColor rdf:resource="#ColorIns7" /> <hasMileage rdf:resource="#MileageIns7" /> <hasPrice rdf:resource="#PriceIns7" />

</owl:Thing>

Value-phrase-matching words determine conditions

Conditions:• Color = “red”

• Make = “Nissan”

• Year >= 1998>= Operator

Step 3. Formulate XQuery Expression

1: for $doc in document("file:///c:/ontos/owlLib/Car.OWL")/rdf:RDF2: for $Record in $doc/owl:Thing3:4: let $id := substring-after(xs:string($Record/@rdf:about), "CarIns")5: let $Color := $doc/car:Color[@rdf:ID=concat("ColorIns", $id)]/car:ColorValue/text()6: let $Make := $doc/car:Make[@rdf:ID=concat("MakeIns", $id)]/car:MakeValue/text()7: let $Year := $doc/car:Year[@rdf:ID=concat("YearIns", $id)]/car:YearValue/text()8: let $Price := $doc/car:Price[@rdf:ID=concat("PriceIns", $id)]/car:PriceValue/text()9: let $Mileage := $doc/car:Mileage[@rdf:ID=concat("MileageIns", $id)]/car:MileageValue/text()10:11: where($Color="red" or empty($Color)) and12: ($Make="Nissan" or empty($Make)) and13: ($Year>="1998" or empty($Year)) 14: return <Record ID="{$id}">15: <Price>{$Price}</Price>16: <Mileage>{$Mileage}</Mileage>17: <Color>{$Color}</Color>18: <Make>{$Make}</Make>19: <Year>{$Year}</Year>20: </Record>

For each owl:Thing

Get the instance ID and extracted values

Check conditions

Return values

Step 3. Formulate XQuery Expression

Step 4. Run XQuery Expression Over Ontology’s Extracted Data

Uses Qexo 1.7, GNU’s XQuery engine for Java Use XSLT to transform results to HTML table

Application #4

Task Ontologies for Free-Form Service Requests

Challenges for Web Services

Help users find and use services Reduce requirements for service specification

and resolution

I want to see a dermatologist between the 12th and 15th, at 1:00 PM or after. The dermatologist should be within 15 miles from my home and must accept my IHC insurance.

An Ontological Solution Domain Ontology

• Has a single object set of interest (e.g. Appointment)

• Establishes requirements for insertion of a single object into the object set of interest (e.g. requirements for making an appointment)

• Has extensional recognizers (i.e. can match request to requirements)

Process Ontology• Recognizes constraints

• Obtains information (from DB and from user)

• Satisfies constraints

• Resolves issues (if necessary)

Doctor

Insurance

ServiceDescription

Appointment

Address

Person

Name

PediatricianDuration

Auto Mechanic

Dermatologist

Cost

Date

Time

has

is at

is on

has

provides

has

is with

is for

is at

is at

has

->

has

sells

acceptsMedical Service Provider

Insurance Salesperson

Auto Service Provider

Service Provider

Person Address

Doctor

Insurance

ServiceDescription

Appointment

Address

Person

Name

PediatricianDuration

Auto Mechanic

Dermatologist

Cost

Date

Time

has

is at

is on

has

provides

has

is with

is for

is at

is at

has

->

has

sells

acceptsMedical Service Provider

Insurance Salesperson

Auto Service Provider

Service Provider

Person Address

Domain Ontology

object set of interest

optionalmandatory

functional

Doctor

Insurance

ServiceDescription

Appointment

Address

Person

Name

PediatricianDuration

Auto Mechanic

Dermatologist

Cost

Date

Time

has

is at

is on

has

provides

has

is with

is for

is at

is at

has

->

has

sells

acceptsMedical Service Provider

Insurance Salesperson

Auto Service Provider

Service Provider

Person Address

Doctor

Insurance

ServiceDescription

Appointment

Address

Person

Name

PediatricianDuration

Auto Mechanic

Dermatologist

Cost

Date

Time

has

is at

is on

has

provides

has

is with

is for

is at

is at

has

->

has

sells

acceptsMedical Service Provider

Insurance Salesperson

Auto Service Provider

Service Provider

Person Address

I want to see a dermatologist between the 12th and 15th, at 1:00 PM or after. The dermatologist should be within 15 miles from my home and must accept my IHC insurance.

Appointment

context keywords/phrase: “appointment |want to see a |…”Time textual representation: “([2-9]|1[012]?)\s*:\s*([0-5]\d)\s*[AaPp]\s* \.?\s* [Mm]\s* \.?)” TimeAtOrAfter(t1: Time, t2: Time) returns (Boolean)

contextual keywords/phrases: (at\s+)?Time\s+or\s+after|…Date DateBetween(x1: Date, x2: Date, x3: Date) returns(Boolean) contextual keywords/phrases: between\s+the\s+Date\s+and\s+Date ….Distance textual representation: ((\d+(\.\d+)?)|(\.\d+)) context keywords/phrases: miles? | kilometers? | … DistanceLessThanOrEqual(d1: Distance, d2: Distance) returns (Boolean) contextual keywords/phrases: (within|…)\s+Distance|…

Example: Appointment Request

Example: Car Purchase Request

Example: Apartment Request

Application #5

High-Precision Classification

An Extraction Ontology Solution

Document 1: Car Ads

Document 2: Items for Sale or Rent

Density Heuristic

Document 1: Car Ads

Year: 3Make: 2Model: 3Mileage: 1Price: 1Feature: 15PhoneNr: 3

Expected Values Heuristic

Document 2: Items for Sale or Rent

Year: 1Make: 0Model: 0Mileage: 1Price: 0Feature: 0PhoneNr: 4

Vector Space of Expected Values

OV ______ D1 D2Year 0.98 16 6Make 0.93 10 0Model 0.91 12 0Mileage 0.45 6 2Price 0.80 11 8Feature 2.10 29 0PhoneNr 1.15 15 11

D1: 0.996D2: 0.567

ov

D1

D2

Grouping Heuristic

YearMakeModelPriceYearModelYearMakeModelMileage…

Document 1: Car Ads

{{{

YearMileage…MileageYearPricePrice…

Document 2: Items for Sale or Rent

{{

GroupingCar Ads----------------YearYearMakeModel-------------- 3PriceYearModelYear---------------3MakeModelMileageYear---------------4ModelMileagePriceYear---------------4…Grouping: 0.875

Sale Items----------------YearYearYearMileage-------------- 2MileageYearPricePrice---------------3YearPricePriceYear---------------2PricePricePricePrice---------------1…Grouping: 0.500

Expected Number in Group = floor(∑ Ave ) = 4 (for our example)

Sum of Distinct 1-Max Object Sets in each GroupNumber of Groups * Expected Number in a Group

1-Max

3+3+4+4 4*4

= 0.875 2+3+2+1 4*4

= 0.500

Application #6

Schema Mapping forOntology Alignment

Problem: Different Schemas

Target Database Schema{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Different Source Table Schemas• {Run #, Yr, Make, Model, Tran, Color, Dr}• {Make, Model, Year, Colour, Price, Auto, Air Cond.,

AM/FM, CD}• {Vehicle, Distance, Price, Mileage}• {Year, Make, Model, Trim, Invoice/Retail, Engine,

Fuel Economy}

Solution: Remove Internal Factoring

Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*

Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Legend

ACURA

ACURA

Solution: Replace Boolean Values

Legend

ACURA

ACURA

β CD Table

Yes,

CD

CD

Yes,Yes,βAutoβAir CondβAM/FMYes,

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

Solution: Form Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >

Solution: Adjust Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

Solution: Infer Mappings

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Each row is a car. πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπYearTable

Note: Mappings produce sets for attributes. Joining to form recordsis trivial because we have OIDs for table rows (e.g. for each Car).

Solution: Infer Mappings

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

πPriceTable

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Yes,ρ Colour←Feature π ColourTable U ρ Auto←Feature π Auto β AutoTable U ρ Air Cond.←Feature π Air Cond.

β Air Cond.Table U ρ AM/FM←Feature π AM/FM β AM/FMTable U ρ CD←Featureπ CDβ CDTableYes, Yes, Yes,

Application #7

Accessing the Hidden Web

Obtaining Data Behind Forms

• Web information is stored in databases

• Databases are accessed through forms

• Forms are designed in various ways

Hidden Web Extraction System

Input Analyzer

Retrieved Page(s)

User Query

Site Form

Output Analyzer

Extracted Information

ApplicationExtraction Ontology

“Find green cars costing no more than $9000.”

Application #8

Ontology Generation

TANGO: Table Analysis for Generating Ontologies

Recognize and normalize table information Construct mini-ontologies from tables Discover inter-ontology mappings Merge mini-ontologies into a growing ontology

Recognize Table Information

Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%

Construct Mini-Ontology Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 10%

Discover Mappings

Merge

Application #9

Challenging Applications(e.g. BioInformatics)

Large Extraction Ontologies

Complex Semi-Structured Pages

Additional Analysis Opportunities

Sibling Page Comparison Semi-automatic Lexicon Update Seed Ontology Recognition

Sibling Page Comparison

Sibling Page ComparisonAttributes

Sibling Page Comparison

Sibling Page Comparison

Semi-automatic Lexicon Update

Additional Protein Names

Additional Source Speciesor Organisms

nucleus;

nucleus;zinc ion binding;nucleic acid binding;

zinc ion binding;nucleic acid binding;

linear;

NP_079345;

9606;

Eukaryota; Metazoa;Chorata;Craniata;Vertebrata;Euteleostomi;Mammalia;Eutheria;Primates;Catarrhini;Hominidae;Homo;

NP_079345;

Homo sapiens;human;

GTTTTTGTGTT……….ATAAGTGCATTAACGGCCCACATG;

FLJ14299

msdspagsnprtpessgsgsgg………tagpyyspyalygqrlasasalgyq;

hypothetical protein FLJ14299;

8;eight;

“8:?p\s?12”;“8:?p11.2”;“8:?p11.23”;:: “37,?612,?680”;

“37,?610,?585”;

Seed Ontology Recognition

Seed Ontology Recognition

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Limitations and Pragmatics

Data-Rich, Narrow Domain Ambiguities ~ Context Assumptions Incompleteness ~ Implicit Information Common Sense Requirements Knowledge Prerequisites …

Busiest Airport?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Busiest Airport?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Busiest Airport?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Busiest Airport?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Ambiguous Whom do we trust? (How do they count?)

Busiest Airport?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Important qualification

Dow Jones Industrial Average

High Low Last Chg30 Indus 10527.03 10321.35 10409.85 +85.1820 Transp 3038.15 2998.60 3008.16 +9.8315 Utils 268.78 264.72 266.45 +1.7266 Stocks 3022.31 2972.94 2993.12 +19.65

44.07

10,409.85

Graphics, Icons, …

Dow Jones Industrial Average

High Low Last Chg30 Indus 10527.03 10321.35 10409.85 +85.1820 Transp 3038.15 2998.60 3008.16 +9.8315 Utils 268.78 264.72 266.45 +1.7266 Stocks 3022.31 2972.94 2993.12 +19.65

44.07

10,409.85

Reported onsame date

WeeklyDaily

Implicit information: weekly stated in upper corner of page; daily not stated.

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Some Key Ideas Data, Information, and Knowledge Data Frames

• Knowledge about everyday data items• Recognizers for data in context

Ontologies• Resilient Extraction Ontologies• Shared Conceptualizations

Limitations and Pragmatics

Some Research Issues

Building a library of open source data recognizers Precisely finding and gathering relevant information

• Subparts of larger data• Scattered data (linked, factored, implied)• Data behind forms in the hidden web

Improving concept matching• Heuristic orchestration• Application of NLP techniques• Calculations, unit conversions, data normalization, …

Achieving the potential of the presented applications

www.deg.byu.edu