Dissertation Proposal Presentation
-
Upload
ritu-khare -
Category
Documents
-
view
1.162 -
download
1
Transcript of Dissertation Proposal Presentation
1
MODELING AND MAPPING FORMS OVER DATABASES:
EMPOWERING USERS TO DESIGN DATABASES IN INDUSTRIAL DOMAINS
Dissertation Proposal
October 07 2010
Ritu Khare
2
Database Design by Non-technical UsersWhy existing methods have not reached the industrial domains?
MOTIVATION
3
Database Design By Non-technical Users
Our inspiration: Applications (Google Forms, FormAssembly, Zohocreator) that allow users to design databases How? Forward Engineering of User Needs into Databases
Great innovation in DB Usability! Database closely reflects user needs.
Very Popular for online data collection – surveys, event organization, etc.
Not used in industrial domains! – healthcare, automobile, etc.
Clinician
design
collect data
User Designed DB
ID Name Phone
DOB
ID PatientID
Date Height
Weight
Patient
VitalSigns
F/W engg
4
Why existing methods are unfit for industrial domains?
No provision to modify or extend an existing database
Translation(Forward Engineering) Method is not reported.
Not tested on non-technical users
Databases are required to evolve w.r.t. new user needs
Data and Database Quality is important quality leads to
productivity. (Batini and Scannapieco, 2006)
Users have no background in data modeling and databases
Existing ApplicationsFeatures of Industrial Domains
5
Proposed System and Research GoalsOpportunity: FormsExample: Form to Database MappingChallenges in Mapping
THE PROPOSAL
6
Proposed System and Research Goals
Proposed System: An application to model and map user needs into an existing database
Goals:1. Modeling: “Usable” medium for users to model
needs Efficiency, Effectiveness, Adoption
2. Mapping: The resultant database should be high-quality, i.e. should satisfy: (Silberschatz et al. 2001, Batini and Scannapieco, 2006, Batini et al. 1992)
Normalization Completeness Compactness Correctness
7
Opportunity: Forms
MODELING: Data-entry Forms provide a good communication medium for users to specify their data collection needs. (Choobineh et al. 1988, Embley, 1989)
MAPPING: Important information on databases could be retrieved by analyzing forms (Choobineh and Mannino, 1988). Search forms provide a useful way in determining the
underlying database(Benslimane, 2007) (Covered in Candidacy Exam)
Data-entry forms provide key guidelines in designing a prospective database(Mannino and Choobineh, 1984).
8
The proposed application: An Example
New User Designed Form
Clinician
designID Patient
IDDate Heigh
tWeight
VitalSigns
Form to Databa
se Mappin
g
ID Name Phone
DOB
Patient
ID
PatientID
Date
Height
Weight
BP Smoking Stat
Existing Database
New Need
s
Evolved Database
Form Modelin
g
NEW PROBLEM!
9
Uniqueness of “Form to Database” Mapping
Two structures are similar.
Mapping involves only schema elements (no values).
Do not consider schema /database evolution when there are unmapped elements.
Semiautomatic
Mapping Discovery How to reconcile the
differences in structures and semantics?
How to detect the form(or need) components (including values) which already exist in the database?
Database Evolution How to extend database based on
new elements in the form? How to automatically determine
functional dependencies and cardinalities from a form?
Schema Mapping(Rahm and Bernstein 2001)
Form to Database Mapping
10 Proposed Application
11
1. Form Design Interface
Title
Category
Field
Format
Subcategory
Supporting Text
Unit
Extended
Checkbox
optionCondition
SIMPLE!1. Terminology (intuitive)2. Features(form patterns)
Subfield
Simple FormAdvanced Form
12
1. Form Design Interface
Input: User actions (based
on data collection
needs)Output: Form
1. Enter the Title “Patient Encounter Form”
2. Enter the category “Patient”3. Enter the field “Name”4. Pick a format “textbox”5. Enter the field “Age”6. …
13
Defining High-Quality Guiding Principles(with respect to a given form)
Completeness Every form element has a place in database
Correctness For each correspondence the form element and
the database element refer to the same real-world element (has matching labels and contexts).
Compactness Every database element occurs just once.
Normalization The database is in 3NF
14
A Simple Approach.
1. Lose grouping information
2. Lose form values3. Heterogeneous attributes placed in same relation. Generated
database is incomplete and not in 3NF (low-quality)!So we propose a tree representation to form.
15
2. Tree Generation Definition: Form Tree
Previous works have proposed a similar tree representation for search forms. (Dragut et al. 09, Wu et al. 09)
1) data-entry forms.2) format nodes to improve DB quality. 3) different representation for checkboxes and radiobuttons.
Input: FormOutput: Form
Tree
16
Form to Database Mapping
Form Tree
ExistingDatabase
Map and Merge???
Main challenges: 1discovering a mapping between two
heterogeneous structures 2. merging new elements into existing database
Form Tree
New Database Graph
ExistingDatabase Graph
ExistingDatabaseMERGE
MAP
3.Birthing
4. Classificatio
n
5. Extension
17
Definition: Database Graph
18
Definition: Mapping Correspondences
Direct correspondenc
e
IndirectCorrespondence
(Value collected on form element is
stored in database element)
19
3. Birthing(term adopted from Jagadish et al. 2007)
Input: Form TreeOutput: New
Database Graph
20
3. Birthing – Pattern 1 (Textbox)
Induced Functional Dependencies:
Address.id -> line1Address.id -> line2Patient.id -> NamePatient.id -> Age
21
3. Birthing – Pattern 2: Radiobutton & Pattern 3: Checkbox
Radiobutton values are mapped to database
valuesRepresent M:1
relationship between Patient and Insurance
Checkbox values are mapped to database
columns(yes/no)Represent 1:1
relationship between Patient and Symptoms
M:1 1:1
22
3. Birthing – Pattern 4: Category/subcat. Pattern 5: Sibling Categories
M:M
M:M
23
3. Birthing Patterns Summarized
24
4. Database Graph Classification
Classify each node to see if it pre-exists in the existing
database or not.i.e. to find whether it “maps” or not.
Existing DB
New Database Graph
Existing DBGraph
25
4. Database Graph ClassificationAlgorithm
Problem: Finding Matching Nodes between new(DGn ) and existing database graph(DGe).
Algorithm For each table node tn in DGn
Let te be the label-matching table node in DGe
If two table nodes tn and te “match”(TableMatch algo) Tag tn i.e., mark this node as a matching/mapped node Tag all matching column and value nodes(ColumnMatch
algo) Else
Rename the table
26
4. Database Graph ClassificationTableMatch Algorithm
Two table nodes “match” if Their labels match Null-value column ratio(NCR) <
tolerance-threshold (efficiency consideration – minimize null value possibilities during data collection) NCR = number of unmatched columns(as
per ColumnMatch) in either table (whichever is higher) / size of union set of columns in both tables
27
Example: NULL Value Column(NCR) Calculation
map NCR= 2/5=0.4
If tolerance-threshold =
0.5(high)
If tolerance-threshold =
0.3(low)
When using Form1, 2 columns will have null
valuesWhen using form 2, 1
columnwil have null values
28
4. Database Graph ClassificationColumnMatch Algorithm
Two non-key column nodes “match” if their Labels /names are same Data types are same Not null constraints are same
Two foreign key column nodes “match” if They both point to the same table nodes as
determined by TableMatch algorithm
29
5. Extension of the Existing Database
Add unmapped tables, columns,
and values
30
Usability ExperimentsMapping ExperimentsContributions
Preliminary Evaluation
Implementation – MySQL, JAVA, JSP, JavaScript, HTML, CSS, Lucene Indexing Package, yFiles Package
31
Usability Evaluation – User Study
5 nurse professionals. No knowledge of database Moderate computer users Familiar with Paper-based
Forms 2 Tasks
Build task Replicate a paper-based form on
the system Model and build task
Model and build a given need (in natural language) into a form using the system interface.
2 rounds (form scale = no. of steps to design a form) Round 1: Small scale needs
Avg. form scale = 17 Generated Avg. 4.2 relations,
5.8 non-key attributes, 1.8 values, and 3.2 foreign key references
Round 2: Large scale needs Avg. form scale 47.4 Generated Avg. 6.2 relations,
13.8 attributes, 10.4 values, and 4.6 foreign key references
Participants and Tasks Study Settings
32
MEASUREMENTS
Duration Ratio = Time(in min)/
Form Scale(#of steps to build form)
Assistance Ratio =# of assistances sought/ Form Scale(#of steps to
build form)
Outliers: P3: considered design
alternatives(high duration ratio)
P5: had difficulty in form terminology(needed more
assistance)
33
Findings
Effectiveness: In 19/20 cases, participants finished the tasks with 100% effectiveness. The unsuccessful case: a
building error committed by a participant who skipped a component while building forms.
Efficiency: Duration ranged from 1 to 9 minutes for simple small-scale needs, and 7 to 19 minutes for advanced long-scale needs. Exception: A participant who
considered several design alternatives .
System Adoption Efficiency : consistently
improved from round 1 to round 2.
Confidence: Very confident for specifying
small-scale needs for both the tasks.
Improved from round 1 to round 2 for the build task. Did not improve for model-and-build task, from round 1 to round 2.
Understanding: improved greatly in round 2. They started synthesizing their
knowledge of form concepts and domain knowledge to consider different design alternatives.
Comparison with a Related Work Appforge (Yang et al. 2008): Users are required to create forms and expressive views and are exposed to the existing schema. In our work, users only create forms and mapping is handled by system.
34
Mapping Experiment Set 1
Experiments on 5 industrial domains.
For each domain, Designed certain
forms and used the mapping algorithms to evolve a database.
Tab.
Attr
Val.
FK
D1 +8 0 0 +16,-8
D2 +6 0 NA +12,-6
D3 +6 0 0 +12,-6
D4 +6 0 NA +12,-6
D5 +5 0 0 +10,-5
S.No.
Domain Form
Table
Attr
Val.
FK
D1 DVD Store
8 22 27 6 27
D2 Charity 6 14 17 0 14
D3 Library 7 18 19 2 17
D4 Automobile
7 16 17 0 17
D5 Insurance
4 14 22 8 13
+ indicates extra element- Indicates missing element
No sign indicates perfect match
Compared with a gold standard (found on the Web) developed by experts
35
Analyzing Inaccuracies and System Enhancement
Added another layer of interaction : to disambiguate cardinality between 2 entities.
M:M
M:M
Result: All the databases were identical to respective gold standard databases.
Inference: The mapping algorithms have the ability to generate databases in industrial domains.
0
10
20
30
40
50
D1 D2 D3 D4 D5
% Red. in Tables
% Red. in Joins
36
Mapping Experiment Set 2
For each domain Performed mapping experiments with at
least 5 different sequences of forms (representing diff. merging situations. )
Result: All the databases generated from different sequences are identical to each other and to the gold standard databases. Inference: The mapping algorithms have the ability to evolve databases in industrial domains in a variety of merging situations
Form Sequence Resultant Database
F1, F2, F3, F4 D1
F2, F4, F1, F3 D2
F1, F4, F3, F2 D3
… …
37
Current and Predicted Contributions Introducing the Form to Database Mapping Algorithms
driven by data-quality principles Mapping experiments on 5 domains
System has the potential to generate high-quality databases in industrial settings solely
based on user-designed forms and user-provided domain knowledge.
to evolve existing databases in a variety of merging situations. Usability Study
System has the potential to be adopted by non-technical users while providing them efficiency and effectiveness in form modeling.
38
Possible Research ExperimentsOther Research Areas/System RefinementPlan for Thesis Completion
What Next?
39
Possible Research Experiments(in healthcare domain)
Have multiple clinicians evolve a new database using diff. forms representing diff. kinds of information. Alter Form and
Database Complexity. Guided Vs unguided
Experiment Scenario 1 Experiment Scenario 2
Have a clinician evolve an existing database based on new needs represented in multiple forms. Alter Form and Database
Complexity Guided Vs unguided
Guided: user is provided with specific needs.
Unguided: user is only given a context and comes up with her own needs
40
Scope for Other Research Areasand System Refinement
Form Design Interface Design Recommendation Different Form Patterns Used in Industrial Domains Modify existing form
Form Filling Component Data Recommendation
Tree Generation Handle else-where designed forms by combining existing
form information extraction techniques (SIGMOD Record Survey, 2010)
Birthing Algorithm Derive Weak Entities, Generalization.
Merging Algorithm – Label Matching Match synonyms, hyponyms, etc.
41
Plan for Dissertation Completion
Thank you!