Snowflake Gau

Click here to load reader

  • date post

    11-Apr-2015
  • Category

    Documents

  • view

    9.887
  • download

    8

Embed Size (px)

Transcript of Snowflake Gau

30Chapter 3Data Warehouse and OLAP Technology:An Overview3.7 Exercises1. State why, for the integration of multiple heterogeneous information sources, many companies in industryprefer the update-driven approach (which constructs and uses data warehouses), rather than the query-drivenapproach (which applies wrappers and integrators). Describe situations where the query-driven approach ispreferable over the update-driven approach.Answer:For decision-making queries and frequently-asked queries, the update-driven approach is more preferable.This is because expensive data integration and aggregate computation are done before query processing time.In order for the data collected in multiple heterogeneous databases to be used in decision-making processes,data must be integrated and summarized with the semantic heterogeneity problems among multiple databasesanalyzed and solved. If the query-driven approach is employed, these queries will be translated into multiple(often complex) queries for each individual database. The translated queries will compete for resources withthe activities at the local sites, thus degrading their performance. In addition, these queries will generatea complex answer set, which will require further ltering and integration. Thus, the query-driven approachis, in general, inecient and expensive. The update-driven approach employed in data warehousing is fasterand more ecient since most of the queries needed could be done o-line.For queries that are used rarely, reference the most current data, and/or do not require aggregations, thequery-driven approach would be preferable over the update-driven approach. In this case, it may not bejustiable for an organization to pay heavy expenses for building and maintaining a data warehouse, if onlya small number and/or relatively small size databases are used; or if the queries rely on the current data,since the data warehouses do not contain the most current information.2. Briey compare the following concepts. You may use an example to explain your point(s).(a) Snowake schema, fact constellation, starnet query model(b) Data cleaning, data transformation, refresh(c) Discovery-driven cube, multifeature cube, virtual warehouseAnswer:(a) Snowake schema, fact constellation, starnet query model3132 CHAPTER 3. DATA WAREHOUSE AND OLAP TECHNOLOGY: AN OVERVIEWThe snowake schema and fact constellation are both variants of the star schema model, whichconsists of a fact table and a set of dimension tables; the snowake schema contains some normalizeddimension tables, whereas the fact constellation contains a set of fact tables that share some commondimension tables. A starnet query model is a query model (not a schema model), which consists of aset of radial lines emanating from a central point, where each radial line represents one dimension andeach point (called a footprint) along the line represents a level of the dimension, and each step goingout from the center represents the stepping down of a concept hierarchy of the dimension. The starnetquery model, as suggested by its name, is used for querying and provides users with a global view ofOLAP operations.(b) Data cleaning, data transformation, refreshData cleaning is the process of detecting errors in the data and rectifying them when possible. Datatransformation is the process of converting the data from heterogeneous sources to a unied datawarehouse format or semantics. Refresh is the function propagating the updates from the data sourcesto the warehouse.(c) Discovery-driven cube, multi-feature cube, virtual warehouseA discovery-driven cube uses precomputed measures and visual cues to indicate data exceptions atall levels of aggregation, guiding the user in the data analysis process. A multi-feature cube computescomplex queries involving multiple dependent aggregates at multiple granularities (e.g., to nd the totalsales for every item having a maximum price, we need to apply the aggregate function SUM to the tupleset output by the aggregate function MAX). A virtual warehouse is a set of views (containing datawarehouse schema, dimension, and aggregate measure denitions) over operational databases.3. Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the twomeasures count and charge, where charge is the fee that a doctor charges a patient for a visit.(a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.(b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).(c) Starting with the base cuboid [day, doctor, patient], what specic OLAP operations should be performedin order to list the total fee collected by each doctor in 2004?(d) To obtain the same list, write an SQL query assuming the data is stored in a relational database withthe schema fee (day, month, year, doctor, hospital, patient, count, charge).Answer:(a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowakeschema, and the fact constellations schema.(b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).A star schema is shown in Figure 3.1.(c) Starting with the base cuboid [day, doctor, patient], what specic OLAP operations should be performedin order to list the total fee collected by each doctor in 2004?The operations to be performed are: Roll-up on time from day to year. Slice for time=2004. Roll-up on patient from individual patient to all.(d) To obtain the same list, write an SQL query assuming the data is stored in a relational database withthe schema.fee(day, month, year, doctor, hospital, patient, count, charge).3.7. EXERCISES 33time_keydayday_of_weekmonthquarteryeardimension tabletimedimension tablepatientsexdescriptionaddressphone_#patient_namepatient_idtime_keypatient_idchargecountdoctor_idfact table dimension tabledoctordooctor_iddoctor_namephone#addresssexFigure 3.1: A star schema for data warehouse of Exercise 2.3.select doctor, SUM(charge)from feewhere year=2004group by doctor4. Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course,semester, and instructor, and two measures count and avg grade. When at the lowest conceptual level (e.g.,for a given student, course, semester, and instructor combination), the avg grade measure stores the actualcourse grade of the student. At higher conceptual levels, avg grade stores the average grade for the givencombination.(a) Draw a snowake schema diagram for the data warehouse.(b) Starting with the base cuboid [student, course, semester, instructor], what specic OLAP operations(e.g., roll-up from semester to year) should one perform in order to list the average grade of CS coursesfor each Big-University student.(c) If each dimension has ve levels (including all), such as student < major < status < university < all,how many cuboids will this cube contain (including the base and apex cuboids)?Answer:(a) Draw a snowake schema diagram for the data warehouse.A snowake schema is shown in Figure 3.2.(b) Starting with the base cuboid [student, course, semester, instructor], what specic OLAP operations(e.g., roll-up from semester to year) should one perform in order to list the average grade of CS coursesfor each Big-University student.The specic OLAP operations to be performed are: Roll-up on course from course id to department. Roll-up on student from student id to university. Dice on course, student with department=CS and university = Big-University.34 CHAPTER 3. DATA WAREHOUSE AND OLAP TECHNOLOGY: AN OVERVIEWsemester_idyeardimension tablesemestersemester instructordeptrankdimension tableinstructor_idstudent_idcourse_idinstructor_idcountavg_gradestudent_idstudent_namestatusdimension tablecourse_idcourse_namefact tableunivdimension tablestudentsemester_idcoursedimension tableareacityprovincecountryarea_idmajorarea_iduniversitydepartmentFigure 3.2: A snowake schema for data warehouse of Exercise 2.4. Drill-down on student from university to student name.(c) If each dimension has ve levels (including all), such as student < major < status < university < all,how many cuboids will this cube contain (including the base and apex cuboids)?This cube will contain 54= 625 cuboids.5. Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and game, and thetwo measures, count and charge, where charge is the fare that a spectator pays when watching a game on agiven date. Spectators may be students, adults, or seniors, with each category having its own charge rate.(a) Draw a star schema diagram for the data warehouse.(b) Starting with the base cuboid [date, spectator, location, game], what specic OLAP operations shouldone perform in order to list