addressing tim/quality trade-off in view maintenance

27
Addressing Time/Quality Trade-off in View Maintenance Soheila Dehghanzadeh

Transcript of addressing tim/quality trade-off in view maintenance

Page 1: addressing tim/quality trade-off in view maintenance

Addressing Time/Quality Trade-off in View

Maintenance

Soheila Dehghanzadeh

Page 2: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Outline

•Introduction

•Terminology

•Problem definition

•Proposed solution

•Experimental results

•Conclusion

Slide 2

Page 3: addressing tim/quality trade-off in view maintenance

Introduction: Query Processing On Linked Data

•Report changes to the local store (maintenance)• sources pro-actively report changes or their existence (pushing).• query processor discover new sources and changes by crawling

(pulling).

•Maintenance trade-off• Fast maintenance leads high quality but slow response and vice versa.

•Problem: Maintenance according to user defined trade-off.

•Why is it important? It eliminates unnecessary maintenance and leads to faster response and better scalability.

Replication (database) or Caching (web)Off-line

materialization

Local Store

Query Processo

r

Query Response

UPDATES NEW source

s

ScalabilityAvailabilityperforman

ce

ScalabilityAvailabilityperforman

ce

Page 4: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

View Maintenance Categorization

Slide 4

Trade-off Management V.s.Change Reporting Mechanism

Time/quality trade-off

query level replica level

quality constraint

time constraint

quality constraint

time constraint

update stream A B C D

no update stream

E F G H

Page 5: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Problem Definition

•Problem E• Optimizing maintenance to satisfy quality constraints within

the lowest response time for each query.

•Problem F• Optimizing maintenance to satisfy time constraints with the

highest response quality for each query.

Slide 5

Page 6: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Terminology

•Quality requirements:• Freshness B/(A+B)• Completeness B/(B+C)

•Maintenance plan• Each set of views chosen for maintenance is called a

maintenance plan.• Having n views, number of maintenance plans is .• Each maintenance plan leads to a different response

quality.

Slide 6

V1 V2 V3 V4

20% 90% 10% 80%

Page 7: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Freshness Examplea1 b1 T

a2 b2 T

a3 b3 F

a4 b4 T

a5 b5 F

Slide 7

a1

c1

F

a1

c2

F

a1

c3

T

a2

c4

T

a6

c5

F

a1 b1

c1 F

a1 b1

c2 F

a1 b1

c3 T

a2 b2

c4 T

60%

40%

50%

a1

b1

T

a2

b2

T

a3

b3

T

a4

b4

T

a5

b5

T

a1 c1 F

a1 c2 F

a1 c3 T

a2 c4 T

a6 c5 F

a1 b1 c1 F

a1 b1 c2 F

a1 b1 c3 T

a2 b2 c4 T100%

40%

50%

a1 b1 T

a2 b2 T

a3 b3 F

a4 b4 T

a5 b5 F

a1

c1

T

a1

c2

T

a1

c3

T

a2

c4

T

a6

c5

T

a1 b1 c1 T

a1 b1 c2 T

a1 b1 c3 T

a2 b2 c4 T60%

100%

100%

Page 8: addressing tim/quality trade-off in view maintenance

Research questions

•What is the least costly maintenance plan that fulfills response quality requirements.

• What is the quality of response without maintenance?• What is the quality of response of each “maintenance

plan”.

Insight Centre for Data Analytics

Slide 8

Page 9: addressing tim/quality trade-off in view maintenance

Experiment

•We use BSBM benchmark to create a dataset and a query set.

•We label triples with true/false to specify their freshness status.

•We summarize the cache to estimate the quality of a query response without actually executing the query on cache.

•To summarize the cache we extended the cardinality estimation techniques for freshness estimation problem.

Insight Centre for Data Analytics

Slide 9

Alice

Lives Dublin True

Bob Lives Berlin False

Alice

Job Teacher True

Bob Job Developer

False

Page 10: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Cardinality Estimation•Capture the data distribution by splitting data into buckets and only keep the bucket cardinality in the summary.

Slide 10

Alice Job Teacher

Alice Lives Dublin

Alice Job PhD student

Alice Lives Athlon

Bob Job Manager

Bob Lives Berlin

Bob Lives Chicago

Bob Lives Munich

Bob Lives Belfast

Bob Lives Limerick

Bob Job CEO

Bob Job Consultant

Alice Job * 2 Bob Job * 3 Alice Lives * 2 Bob Lives * 5

* Job * 5 * Lives * 7

Freshness

True

True

False

False

True

True

True

False

False

False

False

False

2

3

1

1

1

2

Q1: ?a Job ?bQ2: (?a Job ?b)^(?a Lives ?c)

Estimated

Actual

5 5

35 19

Estimated

Actual

5 5

19 19

Estimated

Actual

2/5 2/5

6/35 3/19

Estimated

Actual

2/5 2/5

3/19 3/19

Page 11: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Cardinality Estimation Approaches•System R assumptions for cardinality estimation:

• data is uniformly distributed per attribute.• predicates are independent (either in same table or among

different tables).

• predicate multiplication approaches make both assumptions.

•Histogram captures the dependencies among predicates for more accurate estimation.

Slide 11

Page 12: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Measure accuracy of the estimation approach

Slide 12

n is the number of queries

Measure the difference between the actual and estimated freshness of queries in a query set.

Page 13: addressing tim/quality trade-off in view maintenance

Estimation Results

Page 14: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Estimation Error 1

Slide 14

a Job teacher T

a Job professor F

a Job PhD F

b Job developer T

a Lives in

Dublin T

b Lives in

Galway F

b Lives in

Cork T

b Lives in

Limerick T

a teacher Dublin T

a Professor Dublin F

a PhD Dublin F

b Developer

Galway F

b Developer

Cork T

b Developer

Limerick

T

?s, Job, ?o 50%

50%

?s, Lives in, ?o 75%

Reason : DependenciesSolution : • A more granular index on join (subject) and bounded dimension

(predicate).• Histogram and table level synopses can capture these dependencies

and reduce the error accordingly.Experiment: We did not observe this error in our experiment because we didn’t have such dependencies in the dataset.

37.5% summary

Data

<?s,Job,?o1> join <?s, Lives in,?o2>

Page 15: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Estimation Error 2

20 October 2014 Slide 15

?s, Job, ?o1 50%

?s, Lives in, ?o2 75% summary

a Job teacher T

a Job professor F

a Job PhD F

b Job developer T

a Lives in

Dublin T

b Lives in

Galway F

b Lives in

Cork T

b Lives in

Limerick T

Data

<?s,Job,Developer> join <?s, Lives in,?o2>

b Developer

Galway F

b Developer

Cork T

b Developer

Limerick

T

37.5%

66%

Reason : bounded object

Solution : • A more granular index on join dimension (subject) and bounded

dimensions (predicate and object) => we need to index the whole dataset-> not efficient.

Experiment: We did not observe any improvement on this error by using histogram.

Page 16: addressing tim/quality trade-off in view maintenance

Concern 1 on problem definition

Bob Job Teacher

True

Bob Job PhD True

Alice Job Professor

True

Bob Job Teacher

True

Bob Job PhD False

Alice Job Professor

True

Bob Job Teacher

True

Bob Job PhD False

Alice Job Professor

False

Bob Job Teacher

False

Bob Job PhD False

Alice Job Professor

False

Bob Lives in

Limerick

True

Bob Lives in Galway True

Alice Lives in Dublin True

Alice Lives in Cork True

Bob Lives in

Limerick

True

Bob Lives in Galway True

Alice Lives in Dublin True

Alice Lives in Cork False

Bob Lives in

Limerick

True

Bob Lives in Galway False

Alice Lives in Dublin True

Alice Lives in Cork False

Bob Lives in

Limerick

False

Bob Lives in Galway False

Alice Lives in Dublin True

Alice Lives in Cork False

Bob Teacher Limerick

True

Bob Teacher Galway True

Bob PhD Limerick True

Bob PhD Galway True

Alice Professor Dublin True

Alice Professor Cork True

Bob Teacher Limerick

True

Bob Teacher Galway True

Bob PhD Limerick False

Bob PhD Galway False

Alice Professor Dublin True

Alice Professor Cork False

Bob Teacher Limerick True

Bob Teacher Galway False

Bob PhD Limerick False

Bob PhD Galway False

Alice Professor Dublin False

Alice Professor Cork False

Bob Teacher Limerick False

Bob Teacher Galway False

Bob PhD Limerick False

Bob PhD Galway False

Alice Professor Dublin False

Alice Professor Cork False

100%

100%

100%

66%

75%

50%

33%

50%

16%

0%25%

0%

True

False

True

True

False

66%

Page 17: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Concern 2 on the suggested solution•We need to build one summaries for each maintenance plan because summary of one maintenance plan can not be used for estimating freshness of a query executed on another maintenance plan.

•This is very inefficient given the space requirements and cost of maintaining these summaries.

Slide 17

Page 18: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Conclusion •We defined quality constraints based on freshness and completeness.

•We summarized a snapshot of a dataset to estimate the freshness of various queries using indexing and histogram for our freshness estimation problem.

•We need to build individual summaries for each maintenance plan since a summary for one maintenance plan can not be used to estimate the quality of a query executed on other maintenance plans.

•Our experiment didn’t fail by estimation error caused by dependency due to lack of such errors in the dataset. Next step is to design a more realistic dataset and again compare the result of histogram and predicate multiplication.

•Summarization techniques are designed for a very static environment and any change on the underlying data needs to build the summary from scratch. So does it really make sense to extend cardinality estimation for freshness estimation?

Slide 18

Page 19: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Problem Definition

•Problem E• Optimizing maintenance to satisfy quality constraints within

the lowest response time for each query.

•Problem F• Optimizing maintenance to satisfy time constraints with the

highest response quality for each query.

Slide 19

Page 20: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Problem description without join

Slide 20

Replica

User queries the replica with time constraints

Replica should maintain only a subset of result that is more likely to be expired.

Page 21: addressing tim/quality trade-off in view maintenance

Scenario Stream Data

Back Ground Data

Window Replica

Insight Centre for Data Analytics

Slide 23

Page 22: addressing tim/quality trade-off in view maintenance

Use CaseTwitter Stream Data

Back Ground Data

Number of mentions in the

last twitter window

User follower count Replica

Raising stars Query: find users who have been mentioned more than 100 times in the last 10 minutes and have more than 1000 followers.With constraint on the execution time.

Insight Centre for Data Analytics

Slide 23

Page 23: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Continuous join operator with one replica•We implemented a set of continuous join operators

• DWJoin : Uses the static replica and never change it(the quality of response degrades).

• Baseline join: Uses the LRU entry to choose entries to update from set of matches.(not necessary Least recently updated requires updating).

• Oracle Join: fetch data directly from source.• Smart Join: compute statistics of change rate and choose

those who are likely to be expired for fetching.• Mixed baseline-smart(possible extensions).

Slide 23

Page 24: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Performance of join operators

Slide 24

Page 25: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Possible extensions

Slide 25

The problem becomes complicated when the query is a join between replicas

Updating which combination of entries incurs the highest increase in join update?

ReplicaReplica

Page 26: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Future works

•Use a better model for learning the change rate in smart policy.

•We believe that smart policy will perform better if the change rate is more predictable.

•Investigate the problem where there is joins on the background knowledge side to know which combination of stale entries will contribute more to the result correctness if they become updated.

Slide 26

Page 27: addressing tim/quality trade-off in view maintenance

Insight Centre for Data Analytics

Thanks a lot for your attention !

Slide 27