Pedro DeRose University of Wisconsin-Madison
description
Transcript of Pedro DeRose University of Wisconsin-Madison
Pedro DeRoseUniversity of Wisconsin-Madison
Cimple 1.0: A Community Information ManagementWorkbench
Preliminary Examination
2
The CIM Problem
Numerous online communities database researchers, movie fans, legal professionals, bioinformatics,
enterprise intranets, tech support groups
Each community = many data sources + many members
Database community home pages, project pages, DBworld, DBLP, conference pages...
Movie fan community review sites, movie home pages, theatre listings...
Legal profession community law firm home pages
3
The CIM Problem
Members often want to discover, query, and monitor information in the community
Database community what is new in the past week in the database community? any interesting connection between researchers X and Y? find all citations of this paper in the past one week on the Web what are current hot topics? who has moved where?
Legal profession community which lawyers have moved where? which law firms have taken on which cases?
Solving this problem is becoming increasingly crucial initial efforts at UW, Y!R, MSR, Washington, IBM Almaden
4
SIGMOD-07
Planned Contributions of Thesis1. Decompose the overall CIM problem [IEEE 06, CIDR 07, VLDB 07]
Web Pages
* **
*
HV Jagadish
SIGMOD-07***
*** served in
HV Jagadish
Incremental Expansion
Community Portal
* * *
***
****
Day 1
Day n
.
.
.
.
.
.
.
.
.
.
.
.
User services- keyword search- query- browse- mine …
Data Sources
Leveraging the Community
5
Planned Contributions of Thesis2. Provide concrete solutions to key sub-problem
• creating ER graphs: key novelty is composing plans from operators [VLDB 07]• facilitates developing, maintaining, and optimizing
• leveraging communities: wiki solution with key novelties [ICDE 08]
• combines contributions from both humans and machines
• combines both structured and text contributions
ExtractMbyName
MatchMbyName
Union
{s1 … sn} DBLP\
ExtractMbyName
DBLP
CreateE
MatchMStrict
conferenceentities
personentities
main pages
ExtractLabel
c(person, label)
CreateR
6
Planned Contributions of Thesis3. Capture solutions in the Cimple 1.0 workbench [VLDB 07]
• empty portal shell, including basic services and admin tools• set of general operators, and means to compose them
• simple implementation of operators
• end-to-end development methodology
• extraction/integration plan optimizers
• developers can employ Cimple tools to quickly build and maintain portals
• will release publicly to drive research and evaluation in CIM
7
Planned Contributions of Thesis4. Evaluate solutions and workbench on several domains
• use Cimple to build portals for multiple communities
• evaluate ease of development, extensibility, and accuracy of portal
• have built DBLife, a portal for the database community [CIDR 07, VLDB 07]
• will build a second portal for a non-research domain (e.g., movies, NFL)
8
Planned Thesis Chapters
Selecting initial data sources
Creating the daily ER graphMerging daily graphs into the global ER graph
Incrementally expanding sources and data
Leveraging community members
Developing the Cimple 1.0 workbenchEvaluating the solutions and workbench
9
Selecting Initial Sources
Current solutions often use a "shotgun" approach select as many potentially relevant sources as possible lots of noisy sources, which can lower accuracy
Communitites often show an 80-20 phenomenon small set of sources already covers 80% of interesting activity
Select these 20% of sources e.g., for database community, sites of prominent researchers,
conferences, departments, etc.
Can incrementally expand later semi-automatically or mass collaboration
Crawl sources periodically e.g., DBLife crawls ~10,000 pages (+160 MB) daily
10
Creating the ER Graph – Current Solutions
Manual e.g., DBLP require a lot of human effort
Semi-automatic, but domain-specific e.g., Yahoo! Finance, Citeseer difficult to adapt to new domains
Semi-automatic and general many solutions from the database, WWW, and Semantic Web
communities, e.g., Rexa, Libra, Flink, Polyphonet, Cora, Deadliner often use monolithic solutions, e.g., learning methods such as CRFs require little human effort can be difficult to tailor to individual communities
served in
SIGMOD-07
HV Jagadish
11
Proposed Solution: A Compositional Approach
Web Pages
* **
*SIGMOD-07SIGMOD-07
***
*** served in
ExtractMbyName
MatchMbyName
Union
{s1 … sn} DBLP\
ExtractMbyName
DBLP
CreateE
MatchMStrict
conferenceentities
personentities
main pagesExtractLabel
c(person, label)
CreateR
Discover Entities Discover Relationships
HV Jagadish HV Jagadish
12
Benefits of the Proposed Solution
Easier to develop, maintain, and extend e.g., 2 students less than 1 week to create plans for initial DBLife
Provides opportunities for optimization e.g., extraction and integration plans allow for plan rewriting
Can achieve high accuracy with relatively simple operators by exploiting community properties
e.g., finding people names on seminar pages yields talks with 88% F1
13
Creating Plans to Discover Entities
Raghu Ramakrishnan
Union
ExtractM
MatchM
CreateE
s1 sn…
14
These operators address well-known problems mention recognition, entity disambiguation... many sophisticated solutions
In DBLife, we find simple implementations can already work surprisingly well
often easy to collect entity names from community sources (e.g., DBLP)
ExtractMByName: finds variations of names
entity names within a community are often uniqueMatchMByName: matches mentions by name
these simple methods work with 98% F1 in DBLife
Creating Plans to Discover Entities (cont.)
Union
ExtractM
MatchM
CreateE
s1 sn…
15
Extending Plans to Handle Difficult Spots
Can decide which operators to apply where e.g., stricter operators over more ambiguous data
Provides optimization opportunities similarly to relational query plans see ICDE-07 for a way to optimize such plans
CreateE
MatchMStrict
ExtractMbyName
MatchMbyName
Union
{s1 … sn} DBLP\
ExtractMbyName
DBLP
DBLP: Chen Li
· · ·41. Chen Li, Bin Wang, Xiaochun Yang.VGRAM. VLDB 2007.· · ·38. Ping-Qi Pan, Jian-Feng Hu, Chen Li.Feasible region contraction.Applied Mathematics and Computation.· · ·
16
Creating Plans to Discover Relationships
Categorize relations into general classes e.g., co-occur, label, neighborhood…
Then provide operators for each class e.g., ComputeCoStrength, ExtractLabels, neighborhood selection…
And compose them into a plan for each relation type makes plans easier to develop plans are relatively simple to understand can easily add new plans for new relation types
17
Illustrating Example: Co-occur
To find affiliated(person, org) relationship... e.g., affiliated(Raghu, UW Madison), affiliated(Raghu, Yahoo! Research) categorized as a co-occur relationship
...compose a simple co-occur plan
This simple plan already finds affiliations with 80% F1
ComputeCoStrength
CreateR
personentities
orgentities
Union
s1 sn…
×
Select (strength > θ)
18
Illustrating Example: Label
ICDE'07 Istanbul Turkey
General Chair• Ling Liu• Adnan Yazici
Program Committee Chairs• Asuman Dogac• Tamer Ozsu• Timos Sellis
Program Committee Members• Ashraf Aboulnaga• Sibel Adali…
conferenceentities
personentities
main pages
ExtractLabel
c(person, label)
CreateR
Plan forserved-in(person, conf)
19
Illustrating Example: Neighborhood
UCLA Computer Science Seminars
Title: Clustering and ClassificationSpeaker: Yi Ma, UIUCContact: Rachelle Reamkitkarn
Title: Mobility-Assisted RoutingSpeaker: Konstantinos Psounis, USCContact: Rachelle Reamkitkarn
…
CreateR
orgentities
personentities
seminarpages
c(person, neighborhood)
Plan forgave-talk(person, venue)
20
Creating Daily ER Graphs
SIGMOD-07
Web Pages
* **
*
HV Jagadish
SIGMOD-07***
*** served in
HV Jagadish
* * *
***
****
Day 1
Day n
.
.
.
.
.
.
.
.
.
.
.
.
Data Sources
Global ER Graph
Daily ER Graph
21
Merging Daily ER Graphs
Match
globalER Graph
dailyER Graph
Enrich
Day n Day n+1
AnHai
gave talk
UIUCUIUC
AnHai gave talk
Stanford
Global ER Graph
AnHai
gave talk
Stanford
gave talk
. . .
UIUC
22
Cimple Workflow
Blackboard
Crawler
datasources.xmlhttp://…http://…
index.xmlcrawledPages/…
Discover Relationships
Merge ER Graphs
Services
Discover Entities
ExtractM
MatchM
CreateE
Person
ExtractM
MatchM
CreateE
Publication
…
entities.xml*** ***
relationships.xml
globalGraph.xml
superhomepages, browsing, search
23
Example Plan Specification
# Look for names in pages and mark them as mentionsEXTRACT_MENTIONS = tasks/getMentions/extractMentions$EXTRACT_MENTIONS PerlSearch $CRAWLER->index.xml $NAME_VARIATIONS
# Match mentionsMATCH_MENTIONS = tasks/getEntities/matchMentions$MATCH_ENTITIES RootNames $EXTRACT_MENTIONS->mentions.xml
# Create entitiesCREATE_ENTITIES = tasks/getEntities/createEntities$CREATE_ENTITIES RootNames $MATCH_ENTITIES->mentionGroups.xml
discoverPeopleEnities.cfg#!/usr/bin/perl#################################################################### Arguments: <moduleDir> <fileIndex> <variationsFile># moduleDir: the relative path to the module# fileIndex: the file index of crawled files# variationsFile: a file containing mention name variations## Finds mentions in crawled files by searching for name variations###################################################################use dblife::utils::CrawledDataAccess;use dblife::utils::OutputAccess;
# First get argumentsmy ($moduleDir, $fileIndex, $variationsFile) = @ARGV;
# Parse the crawled file index for info by URLmy %urlToInfoopen(FILEINDEX, "< $fileIndex");while(<FILEINDEX>) { if(/^<file>/) { ... } elsif(/^<\/file>/) { ... } ...}close(FILEINDEX);
# Output as we goopen(OUT, "> $moduleDir/output/output.xml");...
# Search through crawled file for variationsforeach my $url (keys %urlToInfo) { ... }...
close(OUT);
ExtractMentions.pl
24
Example Operator APIs
<mention id="mention-537390"> <mentionType>people</mentionType> <rootName>Jeffrey F. Naughton</rootName> <nameVariation>(Jeffrey|Jefferson|Jeff)\s+Naughton</nameVariation> <filepath>www.cse.uconn.edu/icde04/icde04cfp.txt</filepath> <url>http://www.cse.uconn.edu/icde04/icde04cfp.txt</url> <mentionLeftContextLength>100</mentionLeftContextLength> <mentionRightContextLength>100</mentionRightContextLength> <mentionStartPos>3786</mentionStartPos> <mentionEndPos>3798</mentionEndPos> <mentionMatch>Jeff Naughton</mentionMatch> <mentionWindow> PROGRAM CHAIR 6. Mary Fernandez, AT&T Gail Mitchell, BBN Technologies 7. Ling Liu, Georgia Tech 8. Jeff Naughton, U. Wisc. SEMINAR CHAIR 9. Guy Lohman, IBM Almaden Mitch Cherniack, Brandeis U. 10. Panos Chrysanth</mentionWindow></mention>
<mention id="mention-1143"> <mentionType>people</mentionType> <rootName>Guy M. Lohman</rootName> <nameVariation>Guy\s+Lohman</nameVariation> <filepath>www.cse.uconn.edu/icde04/icde04cfp.txt</filepath> <url>http://www.cse.uconn.edu/icde04/icde04cfp.txt</url> <mentionLeftContextLength>100</mentionLeftContextLength> <mentionRightContextLength>100</mentionRightContextLength> <mentionStartPos>3850</mentionStartPos> <mentionEndPos>3859</mentionEndPos> <mentionMatch>Guy Lohman</mentionMatch> <mentionWindow>il Mitchell, BBN Technologies 7. Ling Liu, Georgia Tech 8. Jeff Naughton, U. Wisc. SEMINAR CHAIR 9. Guy Lohman, IBM Almaden Mitch Cherniack, Brandeis U. 10. Panos Chrysanthis, U. Pittsburgh 11. Aidong Zhang, SU</mentionWindow></mention>
extractMentions/output.xml
<mentionGroup> <mention>mention-545376</mention> <mention>mention-2510</mention> <mention>mention-2511</mention> <mention>mention-2512</mention></mentionGroup>
<mentionGroup> <mention>mention-537390</mention> <mention>mention-537678</mention> <mention>mention-538967</mention> <mention>mention-539718</mention> <mention>mention-539973</mention> <mention>mention-539974</mention> <mention>mention-540184</mention> <mention>mention-540265</mention> <mention>mention-554093</mention> <mention>mention-540732</mention> <mention>mention-540733</mention> <mention>mention-541072</mention> <mention>mention-541073</mention> <mention>mention-541964</mention> <mention>mention-542572</mention> <mention>mention-198996</mention> <mention>mention-543725</mention> <mention>mention-544495</mention> <mention>mention-545057</mention> ...</mentionGroup>
matchMentions/output.xml<entity id="entity-1979"> <entityName>Daniel Urieli</entityName> <entityType>people</entityType> <mention>mention-545376</mention> <mention>mention-2510</mention> <mention>mention-2511</mention> <mention>mention-2512</mention></entity>
<entity id="entity-1980"> <entityName>Jeffrey F. Naughton</entityName> <entityType>people</entityType> <mention>mention-537390</mention> <mention>mention-537678</mention> <mention>mention-538967</mention> <mention>mention-539718</mention> <mention>mention-539973</mention> <mention>mention-539974</mention> <mention>mention-540184</mention> <mention>mention-540265</mention> <mention>mention-554093</mention> <mention>mention-540732</mention> <mention>mention-540733</mention> <mention>mention-541072</mention> <mention>mention-541073</mention> <mention>mention-541964</mention> <mention>mention-542572</mention> <mention>mention-198996</mention> <mention>mention-543725</mention> <mention>mention-544495</mention> <mention>mention-545057</mention> ...</entity>
createEntities/output.xml
25
Experimental Evaluation: Building DBLife
2 days, 2 personsCore Entities (489): researchers (365), department/organizations (94), conferences (30)
Relation Plans (8): authored, co-author, affiliated with, gave talk, gave tutorial, in panel, served in, related topic
Operators: DBLife-specific implementation of MatchMStrict
Data Sources (846): researcher homepages (365), department/organization homepages (94), conference homepages (30), faculty hubs (63), group pages (48), project pages (187), colloquia pages (50), event pages (8), DBWorld (1), DBLP (1)
Initial DBLife (May 31, 2005)
2 days, 2 persons
1 day, 1 person
2 days, 2 persons
Time
Data Source Maintenance: adding new sources, updating relocated pages, updating source metadata
Maintenance and Expansion1 hour/month,
1 person
Time
Relation Instances (63,923): authored (18,776), co-author (24,709), affiliated with (1,359), served in (5,922), gave talk (1,178), gave tutorial (119), in panel (135), related topic (11,725)
Entities (16,674): researchers (5,767), departments/organizations (162), conferences (232), publications (9,837), topics (676)
Mentions (324,188): researchers (125,013), departments/organizations (30,742), conferences (723), publication: (55,242), topics (112,468)
Data Sources (1,075): researcher homepages (463), department/organization homepages (103), conference homepages (54), faculty hubs (99), group pages (56), project pages (203), colloquia pages (85), event pages (11), DBWorld (1), DBLP (1)
Current DBLife (Mar 21, 2007)
26
Relatively Easy to Deploy, Extend, and Debug
DBLife has been deployed and extended by several developers CS at IL, CS at WI, Biochemistry at WI, Yahoo! Research development started after only a few hours Q&A
Developers quickly grasped our compositional approach easily zoomed in on target components could quickly tune, debug, or replace individual components e.g., new student extended ComputeCoStrength operator and added the
"affiliated" plan in just a couple days
27
Accuracy in DBLife
Mean accuracy over 20 randomly chosen researchers
0.980.990.97Discovering entities with source-aware plan
0.890.920.95Finding "on panel" relations (labels)
0.921.000.90Finding "gave tutorial" relations (labels)
0.881.000.87Finding "gave talk" relations (neighborhood)
0.770.810.84Finding "served in" relations (labels)
0.800.830.85Finding "affiliated" relations (co-occurrence)
0.840.980.76Finding "authored" relations (DBLP plan)
0.980.961.00Discovering entities with default plan
0.980.980.99Extracting mentions with ExtractMByName
MeanF1
MeanPrecision
MeanRecall
Experiment
1.001.001.00Matching entities across daily graphs
28
Proposed Future Work: Data Model
Requirements relatively easy to understand and manipulate for non-technical users temporal dimension to represent changing data
Candidate: Temporal ER Graph
"HV Jagadish" "Jagadish, HV"
mentionOfmentionOf
extractedFrom extractedFrom
http://cse.umich.edu/...
"HV Jagadish"
SIGMOD-07
HV Jagadish
served in
mentionOf
extractedFrom
http://cse.umich.edu/...
Day n Day n+1
SIGMOD-07
HV Jagadish
served in
29
Data Model Implementation
Current implementation is ad hoc combination of XML, RDBMS, and unstructured files no clean separation between data and operators
Will explore a more principled implementation efficient storage and processing abstraction through an API
30
Proposed Future Work: Operator Model
Identify core set of efficient operators that encompass many common CIM extraction/integration tasks
Data acquisition query search engines crawl sources (e.g., Crawler)
Data extraction extract mentions (e.g., ExtractM) discover relations (e.g., ComputeCoStrength, ExtractLabels)
Data integration match mentions (e.g., MatchM) match entities over time (e.g., Match operator in merge plan) match entities across portals
Data manipulation select subgraphs join graphs
31
Operator Model Extensibility
Input parameters that can be tuned for each domain
e.g., MatchMByName "HV Jagadish" = "Jagadish, HV" input parameters: name permutation rules
e.g., MatchMByNeighborhood "Cong Yu, HV Jagadish, Yun Chen" = "Yun Chen, HV Jagadish, Amab Nandi" input parameters: window size, overlapping tokens required
32
Proposed Future Work: Plan Model
Need language for developers to compose operators
Workflow language specifies operator order, how they are composed used by current Cimple implementation
Declarative language use a Datalog-like language to compose operators existing work [Shen, VLDB 07] seems promising provides opportunities for data-specific optimizations
33
Planned Thesis Chapters
Selecting initial data sources
Creating the daily ER graphMerging daily graphs into the global ER graph
Incrementally expanding sources and data
Leveraging community members
Developing the Cimple 1.0 workbenchEvaluating the solutions and workbench
34
Incrementally Expanding
Most prior work periodically re-runs source discovery step
Cimple leverages a common community property important new sources and entities are often mentioned in certain
community sources (e.g., DBWorld)
monitor these sources with simple extraction plans
Message type: conf. ann.Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data
Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB 2007
http://mud.cs.utwente.nl ...
35
Leveraging Community Members
Machine-only contributions little human effort, reasonable initial portal, automatic updates inaccuracies from imperfect methods, limited coverage
Human-only contributions accurate, can provide data not in sources significant human effort, difficult to solicit, typically unstructured
Combine contributions from both humans and machines benefits of both encourage human contributions with machine's initial portal
Combine both structured and text contributions allows structured services (e.g., querying)
36
Illustrating Example
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}=Professor #>
<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>
David J. DeWitt
Professor
Interests: Parallel DB
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}=John P. Morgridge
Professor #>
<# person(id=1) {organization}=UW #> since 1976
<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}= John P. Morgridge Professor #><# person(id=1){organization}=UW-Madison#>since 1976
<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #><# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>
David J. DeWitt
John P. Morgridge ProfessorUW-Madison since 1976
Interests: Parallel DBPrivacy
Madwiki solution (Machine assisted development of wikipedias)
37
The Madwiki Architecture
Community wikipedia, backed by a structured database
Challenges include How to model underlying databases G and T How to export a view Vi as a wiki page Wi
How to translate edits to a wiki page Wi into edits to G How to propagate changes to G to affected wiki pages
DataSources G
T
V1
V2
V3
W1
W2
W3
u1
V3’ W3’
T3’
M
38
Modeling the Underlying Database G
Use an entity-relationship graph commonly employed by current community portals familiar to ordinary, database-illiterate users
id = 6name = Privacy
id = 1name = David J. DeWittorganization = UW-Madison
id = 4name = Parallel DB
interestsid = 5
id = 12name = SIGMOD 02
interestsid = 3
servicesid = 11as =general chair
id = 7name = Statisticsinterests
id = 8
39
Storing ER Graph G
Use an RDBMS for efficient querying and concurrency
Extend with temporal functionality for undo [Snodgrass 99]
xid
value start stopwho
1 UW 2007-04-01 9999-12-31 M2 MITRE 2007-05-20 9999-12-31 M
xid
value start stopwho
2 Purdue 2007-05-02 9999-12-31 U1
1 UW-Madison 2007-05-27 9999-12-31 U2
Organization_m
Organization_u
…
…
…
…
…
…
…
…
id etype1 person4 topic12 conf
id rtypeeid1
eid2
3 interests 1 411 services 1 12
… …
… … … …
Entity_ID
Relationship_ID
40
Reconciling Human and Machine Edits
When edits conflict, must reconcile to choose attribute values
Reconciliation policy encoded as view over attribute tables e.g., "latest": current value is latest value, whether human or machine e.g., "humans-first": current value is latest human value, or latest machine
value if there are no human values
xid
value start stopwho
1 UW 2007-04-01 2007-05-27 M2 Purdue 2007-05-02 9999-12-31 U1
1 UW-Madison 2007-05-27 9999-12-31 U2
Organization_p
……
……
… …
41
Exporting Views over G as Wiki Pages
Views over ER graph G select sub-graphs
Use s-slots to represent sub-graph's data in wiki pages
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){organization}=UW-Madison#>
<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>
<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>
David J. DeWitt
UW-Madison
Interests: Parallel DBPrivacy
id = 6name = Privacy
id = 1name = David J. DeWittorganization = UW-Madison
id = 4name = Parallel DB
interestsid = 5
interestsid = 3
42
Translating Wiki Edits to Database Edits
Use wiki edits to infer ER edits to underlying view
Map ER edits to RDBMS edits over G See ICDE 08 for details of mapping
<# person(id=1){name}=David J. DeWitt #><# person(id=1){organization}=UW #>
<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>
<# person(id=1){name}=David J. DeWitt #><# person(id=1){organization}=UW- Madison #>
<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel DB #>
<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>
id = 1name = David J. DeWittorganization = UW
id = 4name = Parallel DB
interestsid = 3
id = 6name = Privacy
id = 1name = David J. DeWittorganization = UW-Madison
id = 4name = Parallel DB
interestsid = 5
interestsid = 3
43
Resolving Ambiguous Wiki Edits
Users can edit data, views, and schemas change the attribute age of person X to 42 display the attribute homepage of conference entities on wiki pages add the new attribute pages to the publication entity schema
Wiki edits can be ambiguous change the attribute title of person X to NULL do not display the attribute title of people entities on wiki pages delete the attribute title from the people entity schema
Recognize ambiguous edits, then ask users to clarify intention
44
Propagating Database Edits to Wiki Pages
Update wiki pages when G changes similar to view update
Eager propagation pre-materialize wiki text, and immediately propagate all changes raises complex concurrency issues
Simpler alternative: lazy propagation materialize wiki text on-the-fly when requested by users underlying RDBMS manages edit concurrency preliminary evaluation indicates lazy propagation is tractable
45
S-slot expressiveness prototype shows s-slots can express all structured data in DBLife
superhomepages except aggregation and top-k
Materializing and serving wiki pages materializing and serving time increases linearly with page size
vast majority of current pages are small, hence served in < 1 second
Madwiki Evaluation
46
Proposed Future Work
Structured data in wiki pages s-slot tags can be confusing and cumbersome to users propose to explore a structured data representation closer to natural text key challenge: how to avoid requiring an intractable NLP solution
Reconciling conflicting edits preliminary work identifies problem, does not provide satisfactory solution propose to explore reconciling edits by learning editor trustworthiness lots of existing work, but not in CIM settings key challenge: edits are sparse, and not missing at random
47
Developing the Cimple 1.0 Workbench
Set of tools for compositional portal building
empty portal shell, including basic services and admin tools browsing, keyword search…
set of general operators, and means to compose them MatchM, ExtractM…
simple implementation of operators MatchMbyName, ExtractMbyName…
end-to-end development methodology 1. select sources, 2. discover entities…
extraction/integration plan optimizers see VLDB 07 for an optimization solution
48
Proposed Evaluation: A Second Domain
Use Cimple to build portal for non-research domain
Evaluate portal's ease of development, extensibility, and accuracy
Movie Domain movie, actor, director, appeared in, directed… sources include critic homepages, celebrity news sites, fan sites existent portals (e.g., IMDB), arguably an unfair advantage
NFL Domain player, coach, team, played for, coached… sources include homepages for players, teams, tournaments, stadiums
Digital Camera Domain manufactureres, cameras, stores, reviewers, makes, sells, reviewed… sources include manufacturer homepages, online stores, reviewer sites
49
Conclusions
CIM is an interesting and increasingly crucial problem many online communities initial efforts at UW, Y!R, MSR, Washington, IBM Almaden
Preliminary work on CIM seems promising decomposed CIM and developed initial solutions began work on the Cimple 1.0 workbench developed the DBLife portal
Much work remains to realize the potential of this approach formal data, operator, and plan model more natural editing and robust reconciliation for Madwiki Cimple 1.0 workbench a second portal
50
Proposed Schedule
February formal data, operator, and plan model for Cimple
March solution for reconciling human and machine edits based on trust
June more natural structured data representation for Madwiki wiki text
July completion of Cimple 1.0 beta workbench
August deployment of second portal built using the Cimple 1.0 workbench public release of Cimple 1.0 workbench
51
Leveraging Communities – Prior Work
Automatic methods are imperfect extraction/integration errors, and incomplete coverage
Some solutions builds portals with human-only contributions Wikipedia, Intellipedia, umasswiki.com, ecolicommunity.org... given an active, trusted community, can achieve excellent accuracy can require lots of effort to maintain, and typically unstructured
Recent work explores allowing structured user contributions Semantic Wikipedia, WikiLens, MetaWeb... focus on extending wiki language to express structured data do not explore synergistic leveraging of human and automatic contributions