Rapidly Constructing Integrated Applications from Online Sources
description
Transcript of Rapidly Constructing Integrated Applications from Online Sources
Rapidly Constructing Integrated Applications from Online Sources
Craig A. Knoblock
Information Science Institute
University of Southern California
Motivating Example
BiddingForTravel.com
Priceline
Map
Orbitz
?
Outline Extracting data from unstructured and
ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans
Outline Extracting data from unstructured and
ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans
Ungrammatical & Unstructured Text
Ungrammatical & Unstructured Text
For simplicity “posts”
Goal:
<price>$25</price><hotelName>holiday inn sel.</hotelName>
<hotelArea>univ. ctr.</hotelArea>
Wrapper based IE does not apply (e.g. Stalker, RoadRunner)
NLP based IE does not apply (e.g. Rapier)
Reference SetsIE infused with outside knowledge
“Reference Sets” Collections of known entities and the associated
attributes Online (offline) set of docs
CIA World Fact Book Online (offline) database
Comics Price Guide, Edmunds, etc.
Algorithm Overview – Use of Ref Sets
$25 winning bid at holiday inn sel. univ. ctr.
Post:
Holiday Inn Select University Center
Hyatt Regency Downtown
Reference Set:
Record Linkage
$25 winning bid at holiday inn sel. univ. ctr.
Holiday Inn Select University Center
“$25”, “winning”, “bid”, …
Extraction
$25 winning bid … <price> $25 </price> <hotelName> holiday inn sel.</hotelName> <hotelArea> univ. ctr. </hotelArea> <Ref_hotelName> Holiday Inn Select </Ref_hotelName> <Ref_hotelArea> University Center </Ref_hotelArea>
Ref_hotelName Ref_hotelArea
Holiday Inn Greentree
Holiday Inn Select University Center
Hyatt Regency Downtown
Post:
Reference Set:hotel name hotel area
hotel name hotel area
“$25 winning bid at holiday inn sel. univ. ctr.”
Our Record Linkage Problem Posts not yet decomposed attributes. Extra tokens that match nothing in Ref Set.
Our Record Linkage Solution
Record Level Similarity + Field Level Similarities
VRL = < RL_scores(P, “Hyatt Regency Downtown”), RL_scores(P, “Hyatt Regency”), RL_scores(P, “Downtown”)>
Best matching member of the reference set for the post
Binary RescoringBinary Rescoring
P = “$25 winning bid at holiday inn sel. univ. ctr.”
$25 winning bid at holiday inn sel. univ. ctr.
Post:
Generate VIE
Multiclass SVM
$25 winning bid at holiday inn sel. univ. ctr.
$25 holiday inn sel. univ. ctr.
price hotel name hotel area
Clean Whole Attribute
Extraction Algorithm
VIE = <common_scores(token),
IE_scores(token, attr1),
IE_scores(token, attr2),
… >
Experimental Data SetsHotels Posts
1125 posts from www.biddingfortravel.com Pittsburgh, Sacramento, San Diego Star rating, hotel area, hotel name, price, date booked
Reference Set 132 records Special posts on BFT site.
Per area – list any hotels ever bid on in that area Star rating, hotel area, hotel name
Comparison to Existing SystemsRecord Linkage WHIRL
RL allows non-decomposed attributes
Information Extraction Simple Tagger (CRF)
State-of-the-art IE Amilcare
NLP based IE
Record linkage results
10 trials – 30% train, 70% test
Prec. Recall F-Measure
Hotel
Phoebus 93.60 91.79 92.68
WHIRL 83.52 83.61 83.13
Token level Extraction results: Hotel domain
Not Significant
Prec. Recall F-Measure Freq
Area Phoebus 89.25 87.50 88.28 809.7
Simple Tagger 92.28 81.24 86.39
Amilcare 74.2 78.16 76.04
Date Phoebus 87.45 90.62 88.99 751.9
Simple Tagger 70.23 81.58 75.47
Amilcare 93.27 81.74 86.94
Name Phoebus 94.23 91.85 93.02 1873.9
Simple Tagger 93.28 93.82 93.54
Amilcare 83.61 90.49 86.90
Price Phoebus 98.68 92.58 95.53 850.1
Simple Tagger 75.93 85.93 80.61
Amilcare 89.66 82.68 85.86
Star Phoebus 97.94 96.61 97.84 766.4
Simple Tagger 97.16 97.52 97.34
Amilcare 96.50 92.26 94.27
Outline Extracting data from unstructured and
ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans
Discovering Models of Sources Required for Integration
Provide uniform access to heterogeneous sources Source definitions are used to reformulate queries New service, no source model, no integration! Can we discover models automatically?
Source Definitions:- United- Lufthansa- Qantas
Mediator
?
WebServices
United
Lufthansa
Qantas
newservice
Alitalia
Query
SELECT MIN(price) FROM flightWHERE depart=“MXP” AND arrive=“PIT”
Reformulated Query
Reformulated Query
lowestFare(“MXP”,“PIT”)
calcPrice(“MXP”,“PIT”,”economy”)
Inducing Source Definitions:A Simple Example
Step 1: use metadata to classify input types Step 2: invoke service and classify output types
Mediator
newsource
RateFinder($fromCountry,$toCountry,val):- ?
knownsource
LatestRates($country1,$country2,rate):-exchange(country1,country2,rate)Semantic Types:
currency {USD, EUR, AUD} rate {1936.2, 1.3058, 0.53177}
Predicates:exchange(currency,currency,rate)
currency
{<EUR,USD,1.30799>,<USD,EUR,0.764526>,…}
rate
def_1($from, $to, val) :- LatestRates(from,to,val)
def_2($from, $to, val) :- LatestRates(to,from,val)
def_1($from, $to, val) :- exchange(from,to,val)
def_2($from, $to, val) :- exchange(to,from,val)Mediator
Predicates:exchange(currency,currency,rate)
Inducing Source Definitions:A Simple Example
Step 3: generate plausible source definitions Step 4: reformulate in terms of other sources Step 5: invoke service and compare output
newsource
RateFinder($fromCountry,$toCountry,val):- ?
currency rateInput RateFinder Def_1 Def_2
<EUR,USD> 1.30799 1.30772 0.764692
<USD,EUR> 0.764526 0.764692 1.30772
<EUR,AUD> 1.68665 1.68979 0.591789
match
The FrameworkIntuition: Services often have similar semantics, so we
should be able to use what we know to induce that which we don’t
Two phase algorithmFor each operation provided by the new service:1. Classify its input/output data types
Classify inputs based on metadata similarity Invoke operation & classify outputs based on data
2. Induce a source definition Generate candidates via Inductive Logic Programming Test individual candidates by reformulating them
Use Case: Zip Code Data Single real zip-code service with multiple operations The first operation is defined as:
Goal is to induce definition for a second operation:
Same service so no need to classify inputs/outputs or match constants!
getDistanceBetweenZipCodes($zip1, $zip2, distance) :-centroid(zip1, lat1, long1),centroid(zip2, lat2, long2),distanceInMiles(lat1, long1, lat2, long2, distance).
getZipCodesWithin($zip1, $distance1, zip2, distance2) :-centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 300).
Generating definitions: ILP Want to induce source definition for:
Predicates available for generating definitions:{centroid, distanceInMiles, ≤,=}
New type signature contains that of known source Use known definition as starting point for local search:getDistanceBetweenZipCodes($zip1, $zip2, distance) :-
centroid(zip1, lat1, long1),centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance).
getZipCodesWithin($zip1, $distance1, zip2, distance2)
Plausible Source Definition
1 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d1), (d2 = d1)
2 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d1), (d2 ≤ d1)
3 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d2 ≤ d1)
4 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d1 ≤ d2)
5 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d1 ≤ #d)
6 cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (lt1 ≤ d1)
…
n cen(z1,lt1,lg1), cen(z2,lt2,lg2), dIM(lt1,lg1,lt2,lg2,d2), (d2 ≤ d1), (d1 ≤ #d)
INVALIDd2 unbound!
#d is a constant
UNCHECKABLElt1 inaccessible!
contained indefs 2 & 4
Preliminary ResultsSettings: Number of zip code constants initially available: 6 Number of samples performed per trial: 20 Number of candidate definitions in search space: 5
Results: Converged on “almost correct’’ definition!!!
Number of iterations to convergence: 12
getZipCodesWithin($zip1, $distance1, zip2, distance2) :-centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 243).
Related Work Classifying Web Services
(Hess & Kushmerick 2003), (Johnston & Kushmerick 2004) Classify input/output/services using metadata/data We learn semantic relationships between inputs & outputs
Category Translation (Perkowitz & Etzioni 1995) Learn functions describing operations available on internet We concentrate on a relational modeling of services
CLIO (Yan et. al. 2001) Helps users define complex mappings between schemas They do not automate the process of discovering mappings
iMAP (Dhamanka et. al. 2004) Automates discovery of certain complex mappings Our approach is more general (ILP) & tailored to web sources We must deal with problem of generating valid input tuples
Outline Extracting data from unstructured and
ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans
Dynamically Building Integration Plans
Mediator
Traditional Data Integration Techniques
Find information about all proteins that participate in
Transcription process
(1). SwissProtein: P36246(2). GeneBank: AAS60665.1
………
Dynamically Building Integration Plans (Cont’d)
Mediator
Problem Solved Here
Create a web service that accepts a name of a biological
process, <bname>, and returns information about
proteins that participate in itNew web service
Problem Statement (Cont’d)
Assumption Information-producing web service
operations Applicability
Biological data web services Geospatial services (WMS, WFS) Other applications that do not focus on
transactions
Query-based Web Service Composition Query-based approach
View web service operations as source relations with binding restrictions Can be inferred from WSDL
Create domain ontology Describe source relations in terms of domain relations
Combined Global-as-View / Local-as-View approach
Use data integration system to answer user queries
Template-based Web Service Composition
Our goal is to compose new web services We need to answer template queries, not specific
queries Template-based Query Approach
Generate plans to take into account general parameter values, i.e. Universal Plan [Schoppers, et. al.]
Easy to generate universal plan Plans that answer template query as oppose to specific
query But, plans can be very inefficient
Need to generate optimized “universal integration plans”
Example Scenario Sources
HSProtein($id, name, location, function, seq, pubmedid)
MMProteinInteractions($fromid, toid, source, verified)
Protein
Protein-ProteinInteractions
MMProtein($id, name, location, function, seq, pubmedid)
TranducerProtein($id, name, location, taxonid, seq, pubmedid)
MembraneProtein($id, name, location, taxonid, seq, pubmedid)
DipProtein($id, name, location, taxonid, function)
HSProteinInteractions($fromid, toid, source, verified)
Example Rules and QueryProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606)
ProteinProteinInteractions(fromid, toid, taxonid, source, verified):-MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090)
ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified)
Q(fromid, toid, taxonid, source, verified):- fromid = !fromid, taxonid = !taxonid, ProteinProteinInteractions(fromid, toid, taxonid, source, verified)
Unoptimized Plan
Optimized Plan Exploit constraints in source description to
filter queries to sources
Example Scenario
Q1(fromid, fromname, fromseq, frompubid, toid, toname, toseq, topubid):- fromid = !fromproteinid,Protein(fromid, fromname, loc1, f1, fromseq, frompubid, taxonid1),ProteinProteinInteractions(fromid, toid, taxonid, source, verified),Protein(toid, toname, loc2, f2, toseq, topubid, taxonid2)
Join
InputOutput
Fromproteinid
Fromproteinid, Toproteinid
Fromproteinid, fromseq
Fromproteinid,Toproteinid, toseq
Fromproteinid, fromseq,Toproteinid, toseq
Protein-ProteinInteractions
Protein
Protein
ComposedPlan
Example Integration Plan
Adding Sensing Operations for Tuple-level Filtering Compute original plan for a template query For each constraint on the sources
Introduce constraint into the query Rerun inverse rules algorithm Compare cost of new plan to original plan Save plan with lowest cost
Optimized Universal Integration Plan
Outline Extracting data from unstructured and
ungrammatical sources Automatically discovering models of sources Dynamically building integration plans Efficiently executing the integration plans
Dataflow-style, Streaming Execution Map datalog plans into streaming, dataflow execution
system (e.g., network query engine) We use the Theseus execution system since it supports
recursion Key challenges
Mapping non-recursive plans Mapping recursive plans
Data processing Loop detection Query results update Termination check Recursive callback
Example TranslationProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606)
ProteinProteinInteractions(fromid, toid, taxonid, source, verified):-MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090)
ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified)
Q(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, toid, taxonid, source, verified), (fromid = !fromproteinid), (taxonid = !taxonid)
Example Theseus Plan
Bio-informatics Domain Results Experiments in Bio-informatics domain where we have 60 real
web services provided by NCI We varied number of domain relations in a query from 1-30 and
report composition time with execution time
0
2000
4000
6000
8000
10000
12000
14000
16000
1 2 3 4 5 6 7 8
# of Relations in Query
Tim
e in
Mili
se
co
nd
s
Execution Time
Composition Time
Tuple-level Filtering Tuple-level filtering can improve the execution time of the
generated integration plan by up to 53.8%
Improvement due to Theseus Theseus can improve the execution time of the generated web
service with complex plans by up to 33.6%
Discussion Huge number of sources available Need tools and systems that support the dynamic
integration of these sources In this talk, I described techniques for:
Extracting data from unstructured and ungrammatical sources
Discovering models of online sources required for integration
Dynamic and efficient integration of web sources Efficient execution of integration plans
Much work still left to be done…
More information… http://www.isi.edu/~knoblock Matthew Michelson and Craig A. Knoblock.
Semantic Annotation of Unstructured and Ungrammatical TextIn Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, 2005
Mark James Carman and Craig A. Knoblock. Inducing source descriptions for automated web service composition, In Proceedings of the AAAI 2005 Workshop on Exploring Planning and
Scheduling for Web Services, Grid, and Autonomic Computing, 2005. Snehal Thakkar, Jose Luis Ambite, and Craig A. Knoblock.
Composing, optimizing, and executing plans for bioinformatics web services,
VLDB Journal, Special Issue on Data Management, Analysis and Mining for Life Sciences, 14(3):330--353, Sep 2005.