C20.0046: Database Management Systems Lecture #25

42
M.P. Johnson, DBMS, Stern/NYU, Sprin g 2005 1 C20.0046: Database Management Systems Lecture #25 M.P. Johnson Stern School of Business, NYU Spring, 2005

description

C20.0046: Database Management Systems Lecture #25. M.P. Johnson Stern School of Business, NYU Spring, 2005. Agenda. Querying XML Data Warehousing Next week: Data Mining Websearch Etc. Goals after today:. Be aware of some of the important XML standards - PowerPoint PPT Presentation

Transcript of C20.0046: Database Management Systems Lecture #25

Page 1: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

1

C20.0046: Database Management SystemsLecture #25

M.P. Johnson

Stern School of Business, NYU

Spring, 2005

Page 2: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

2

Agenda Querying XML

Data Warehousing

Next week: Data Mining Websearch Etc.

Page 3: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

3

Goals after today:1. Be aware of some of the important XML

standards

2. Know how to write some DW queries in Oracle

Page 4: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

4

XML: Semi-structured data Not too random

Data organized into entities Similar/related grouped to form other entities

Not too structured Some attributes may be missing Size of attributes may vary

Support of lists/sets

Juuust Right Data is self-describing

Page 5: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

5

<movieinfo>

<movie id="o111">

<title>Lost in Translation</title>

<year>2003</year>

<stars idref="o333 o444"/>

</movie>

<movie id="o222">

<title>Hamlet</title>

<year>1999</year>

<stars idref="o333"/>

</movie>

<person id="o111">

<name>Bill Murray</name>

<movies idref="o111 o222"/>

</person>

</movieinfo>

<movieinfo>

<movie id="o111">

<title>Lost in Translation</title>

<year>2003</year>

<stars idref="o333 o444"/>

</movie>

<movie id="o222">

<title>Hamlet</title>

<year>1999</year>

<stars idref="o333"/>

</movie>

<person id="o111">

<name>Bill Murray</name>

<movies idref="o111 o222"/>

</person>

</movieinfo>

Page 6: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

6

New topic: Querying XML XPath

Simple protocol for accessing node Will use in XQuery and conversion from relations

XQuery SQL : relations :: XQuery : XML

XSLT sophisticated transformations Sometimes for presentation

Page 7: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

7

XQuery Queries are FLWR expressions

Based on Quilt and XML-QL

FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title

FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title

FOR/LET...WHERE...RETURN...

FOR/LET...WHERE...RETURN...

Page 8: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

8

XQuery Find all book titles published after 1995:

FOR $x IN document("bib.xml")/bib/bookWHERE $x/year > 1995RETURN { $x/title }

FOR $x IN document("bib.xml")/bib/bookWHERE $x/year > 1995RETURN { $x/title }

Result: <title>abc</title> <title>def</title> <title>ghi</title>

Page 9: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

9

SQL v. XQuery

Product(pid, name, maker)Company(cid, name, city)

Find all products made in NYC

SELECT x.nameFROM Product x, Company yWHERE x.maker=y.cid and y.city="NYC"

SELECT x.nameFROM Product x, Company yWHERE x.maker=y.cid and y.city="NYC"

FOR $r in document("db.xml")/db, $x in $r/Product/row, $y in $r/Company/rowWHERE $x/maker/text()=$y/cid/text() and $y/city/text() = "NYC"RETURN { $x/name }

FOR $r in document("db.xml")/db, $x in $r/Product/row, $y in $r/Company/rowWHERE $x/maker/text()=$y/cid/text() and $y/city/text() = "NYC"RETURN { $x/name }

SQL

XQuery

Page 10: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

10

SQL v. XQueryFor each company with revenues < 1M count the products over $100

SELECT y.name, count(*)FROM Product x, Company yWHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000GROUP BY y.cid, y.name

SELECT y.name, count(*)FROM Product x, Company yWHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000GROUP BY y.cid, y.name

FOR $r in document("db.xml")/db, $y in $r/Company/row[revenue/text()<1000000]RETURN <Company> <companyName>{ $y/name/text() }</companyName> <numberOfExpensiveProducts> { count($r/Product/row[maker/text()=$y/cid/text()][price/text()>100]) } </numberOfExpensiveProducts> </Company>

FOR $r in document("db.xml")/db, $y in $r/Company/row[revenue/text()<1000000]RETURN <Company> <companyName>{ $y/name/text() }</companyName> <numberOfExpensiveProducts> { count($r/Product/row[maker/text()=$y/cid/text()][price/text()>100]) } </numberOfExpensiveProducts> </Company>

Page 11: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

11

XSLT: XSL Transformations Converts XML docs to other XML docs

Or to HTML, PDF, etc.

E.g.: Have data in XML, want to display to all users Users view web with IE, Firefox, Treo… Have XSLT convert to HTML that looks good on each XSLT processor takes XML doc and XSL template for view

Page 12: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

12

XSLT v. XQuery FLWR expressions:

Often much simpler than XSLT

XSLT v. XQuery: http://www.xmlportfolio.com/xquery.html

FOR $b IN document("bib.xml")//bookWHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998“RETURN $b/title

FOR $b IN document("bib.xml")//bookWHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998“RETURN $b/title

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="document('bib.xml')//book"> <xsl:if test="publisher='Morgan Kaufmann' and year='1998'"> <xsl:copy-of select="title"/> </xsl:if> </xsl:for-each> </xsl:template></xsl:transform>

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="document('bib.xml')//book"> <xsl:if test="publisher='Morgan Kaufmann' and year='1998'"> <xsl:copy-of select="title"/> </xsl:if> </xsl:for-each> </xsl:template></xsl:transform>

Page 13: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

13

Displaying XML with XSL/XSLT XSL: style sheet language for XML

XSL : XML :: CSS : HTML

Menu in XML: http://www.w3schools.com/xml/simple.xml

XSL file for displaying it: http://www.w3schools.com/xml/simple.xsl

XSL applied to the XML: http://www.w3schools.com/xml/simplexsl.xml

More info on Java with XSLT and XPath: http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html

Page 14: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

14

From XML to relations (Oracle) To move single values from XML to tables, can

simply use extractvalue in UPDATE statements:

SQL> UPDATE purchase_orderSET order_nbr = 7101,customer_po_nbr = extractvalue(purchase_order_doc, '/purchase_order/po_number'),customer_inception_date =

to_date(extractvalue(purchase_order_doc,'/purchase_order/po_date'), 'yyyy-mm-dd');

SQL> UPDATE purchase_orderSET order_nbr = 7101,customer_po_nbr = extractvalue(purchase_order_doc, '/purchase_order/po_number'),customer_inception_date =

to_date(extractvalue(purchase_order_doc,'/purchase_order/po_date'), 'yyyy-mm-dd');

Page 15: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

15

From relations to XML (Oracle) Saw how to put XML in a table Conversely, can convert ordinary rel data to XML

XMLElement() generates an XML node Now can call XMLElement ftn to wrap vals in tags:

And can build it up recursively:

SELECT XMLElement("supplier_id", s.supplier_id) ||XMLElement("name", s.name) xml_fragment

FROM supplier s;

SELECT XMLElement("supplier_id", s.supplier_id) ||XMLElement("name", s.name) xml_fragment

FROM supplier s;

SELECT XMLElement("supplier",XMLElement("supplier_id", s.supplier_id), XMLElement("name", s.name))

FROM supplier s;

SELECT XMLElement("supplier",XMLElement("supplier_id", s.supplier_id), XMLElement("name", s.name))

FROM supplier s;

Page 16: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

16

Why XML matters Hugely popular

To past few years what Java was to mid-90s Buzzword-compliant

XML databases won’t likely replace RDBMSs (remember OODBMSs?), but:

Allows for comm. between DBMSs disparate architectures, tools, languages, etc. Basis for Web Services

DBMS vendors are adding XML support MS, Oracle, et al.

Page 17: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

17

For more info APIs: SAX, JAXP Editors: XML Spy, MS XML Notepad:

http://www.webattack.com/get/xmlnotepad.shtml Parsers: Saxon, Xalan, MS XML Parser

Lectures drew on resources from: Nine-week course on XML:

http://www.cs.rpi.edu/~puninj/XMLJ/classes.html W3C XML Tutorial:

http://www.w3schools.com/xml/default.asp http://www.cs.cornell.edu/courses/cs433/2001fa/Slides/Xml,%20XPath,%20&%20Xslt.ppt

Page 18: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

18

Recent XML news/etc. Group at Sun planning “binary XML”

http://developers.slashdot.org/article.pl?sid=05/01/14/1650206&tid=156

XML is “simple and sloppy” http://www.adambosworth.net/archives/000031.html

RDF: Resource Definition Framework Metadata for the web “Semantic web” Content, authors, relations to other content http://www.w3.org/DesignIssues/RDFnot.html

Web + XML = the “global mind” http://novaspivack.typepad.com/nova_spivacks_weblog/2004/06/minding_the_pla.html

Page 19: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

19

New topic: Data Warehousing Physical warehouse: stores different kinds of items

combined from different sources in supply chain access items as a combined package “Synergy”

DW is the sys containing the data from many DBs OLAP is the system for easily querying the DW

Online analytical processing front-end to DW & stats

Page 20: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

20

Integrating Data Ad hoc combination of DBs from different sources

can be problematic

Data may be spread across many systems geographically by division different systems from before mergers…

Page 21: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

21

Conversion/scrubbing/merging Lots of issues…

different types of data Varchar(255) v. char(30)

Different values for data ‘GREEN’/’GR/’2

Semantic differences Cars v. Automobiles

Missing values Handle with nulls or XML

Page 22: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

22

Federated DBs Situ: n different DBs must work together

One idea: write programs for each to talk to each other one How many programs required? Like ambassadors for each country

Page 23: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

23

Federated DBs Better idea: introduce another DB

write programs for it to talk to each other DB

Now how many programs? English in business, French in diplomacy

Warehousing Refreshed nightly

Page 24: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

24

OLTP v. OLAP DWs usually not updated in real-time

data is usually not live but care about higher-level, longer-term patterns For “knowledge workers”/decision-makers

Live data is in system used by OLTP online transaction processing E.g., airline reservations OLTP data loaded into DW periodically, say nightly

Page 25: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

25

Utilizing Data Situ: each time manager has hunch

requests custom reports direct programmers to write/modify SQL app to produce

these results on higher or lower levels, for different specifics

Problem: too difficult/expensive/slow too great a time lag

Page 26: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

26

EISs Could just write queries at command-prompt

But decision makes aren’t (all) SQL programmers

Soln: create an executive information system provides friendly front-end to common, important queries basically a simple DB front-end your project part 5

GROUP BY queries are particularly applicable…

Page 27: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

27

EISs v. OLAP Okay for fixed set of queries But what if queries are open-ended?

Q: What’s driving sales in the Northeast? What’s the source cause? Result from one query influences next query tried

OLAP systems are interactive: run query analyze results think of new query repeat

Page 28: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

28

Star Schemas Popular schema for DW data

One central DB surrounded by specific DBs

Center: fact table

Extremities: data tables

Fields in fact table are foreign keys to data tables

Normalization Snowflake Schema May not be worthwhile…

Page 29: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

29

Dates and star schemas OLAP behaves as though you had a Days table,

with every possible row Dates(day, week, month, year, DID) (5, 27, 7, 2000)

Can join on Days like any other table

Page 30: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

30

Dates and star schemas E.g.: products x salesperson x region x date

Products sold by salespeople in regions on dates

Regular dim tables: Product(PID, name, color) Emp(name, SSN, sal) Region(name, RID)

Fact table: Sales(PID, DID, SSN, RID) Interpret as a cube (cross product of all dimensions)

Can have both data and stats

Page 31: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

31

Drill-down & roll-up Imagine: notice some region’s sales way up Why? Good salesperson? Some popular product

there?

Maybe need to search by month, or month and product, abstract back up to just product…

“slicing & dicing”

Page 32: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

32

OLAP and data warehousing Could write GROUP BY queries for each

OLAP systems provide simpler, non-SQL interface for this sort of thing

Vendors: MicroStrategy, SAP, etc.

Otoh: DW-style operators have been added to SQL and some DBMSs…

Page 33: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

33

DW extensions in SQL: ROLLUP (Oracle) Suppose have orders table (from two years), with

region and date info:

Can select total sales:

Examples derived/from Mastering Oracle SQL, 2e (O’Reilly) Get data here: http://examples.oreilly.com/mastorasql2/mosql2_data.sql

SELECT sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_id;

SELECT sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_id;

SQL> column month format a10SQL> @mosql2_dataSQL> describe all_orders;

SQL> column month format a10SQL> @mosql2_dataSQL> describe all_orders;

Page 34: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

34

Can write GROUP BY queries for year or region or both:

SELECT r.name region, o.year, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY (r.name, o.year);

SELECT r.name region, o.year, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY (r.name, o.year);

DW extensions in SQL: ROLLUP (Oracle)

Page 35: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

35

ROLLUP operator Extension of GROUP BY Does GROUP BY on several levels, simultaneously Order matters

Get sales totals for each region/year pair each region, and the grand total:

SELECT r.name region, o.year, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (r.name, o.year);

SELECT r.name region, o.year, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (r.name, o.year);

DW extensions in SQL: ROLLUP (Oracle)

Page 36: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

36

Change the order of the group fields to get a different sequence of groups

To get totals for each year/region pair, each year, and the grand total, and just reverse group-by order:

SELECT o.year, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (o.year, r.name);

SELECT o.year, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (o.year, r.name);

DW extensions in SQL: ROLLUP (Oracle)

Page 37: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

37

Adding more dimensions, like month, is easy (apart from formatting):

NB: summing happens on each level

SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (o.year, o.month, r.name);

SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (o.year, o.month, r.name);

DW extensions in SQL: ROLLUP (Oracle)

Page 38: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

38

If desired, can combine fields for the sake of grouping:

DW extensions in SQL: ROLLUP (Oracle)

SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP ((o.year, o.month), r.name);

SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP ((o.year, o.month), r.name);

Page 39: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

39

DW extensions in SQL: CUBE (Oracle) Another GROUP BY extension: CUBE

Subtotals all possible combins of group-by fields (powerset) Syntax: “ROLLUP” “CUBE” Order of fields doesn’t matter (apart from ordering)

To get subtotals for each region/month pair, each region, each month, and the grand total:

SELECT to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY CUBE (o.month, r.name);

SELECT to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY CUBE (o.month, r.name);

Page 40: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

40

DW extensions in SQL: CUBE (Oracle) Again, can easily add more dimensions:

SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY CUBE (o.year, o.month, r.name);

SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY CUBE (o.year, o.month, r.name);

Page 41: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

41

DW SQL exts: GROUPING SETS (Oracle) That’s a lot of rows Instead of a cube of all combinations, maybe we just

want the totals for each individual field:

SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY GROUPING SETS (o.year, o.month, r.name);

SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY GROUPING SETS (o.year, o.month, r.name);

Page 42: C20.0046: Database Management Systems Lecture #25

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

42

Next time Overview of data mining

Some other odds & ends…