C20.0046: Database Management Systems Lecture #25
description
Transcript of C20.0046: Database Management Systems Lecture #25
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
1
C20.0046: Database Management SystemsLecture #25
M.P. Johnson
Stern School of Business, NYU
Spring, 2005
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
2
Agenda Querying XML
Data Warehousing
Next week: Data Mining Websearch Etc.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
3
Goals after today:1. Be aware of some of the important XML
standards
2. Know how to write some DW queries in Oracle
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
4
XML: Semi-structured data Not too random
Data organized into entities Similar/related grouped to form other entities
Not too structured Some attributes may be missing Size of attributes may vary
Support of lists/sets
Juuust Right Data is self-describing
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
5
<movieinfo>
<movie id="o111">
<title>Lost in Translation</title>
<year>2003</year>
<stars idref="o333 o444"/>
</movie>
<movie id="o222">
<title>Hamlet</title>
<year>1999</year>
<stars idref="o333"/>
</movie>
<person id="o111">
<name>Bill Murray</name>
<movies idref="o111 o222"/>
</person>
</movieinfo>
<movieinfo>
<movie id="o111">
<title>Lost in Translation</title>
<year>2003</year>
<stars idref="o333 o444"/>
</movie>
<movie id="o222">
<title>Hamlet</title>
<year>1999</year>
<stars idref="o333"/>
</movie>
<person id="o111">
<name>Bill Murray</name>
<movies idref="o111 o222"/>
</person>
</movieinfo>
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
6
New topic: Querying XML XPath
Simple protocol for accessing node Will use in XQuery and conversion from relations
XQuery SQL : relations :: XQuery : XML
XSLT sophisticated transformations Sometimes for presentation
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
7
XQuery Queries are FLWR expressions
Based on Quilt and XML-QL
FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title
FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title
FOR/LET...WHERE...RETURN...
FOR/LET...WHERE...RETURN...
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
8
XQuery Find all book titles published after 1995:
FOR $x IN document("bib.xml")/bib/bookWHERE $x/year > 1995RETURN { $x/title }
FOR $x IN document("bib.xml")/bib/bookWHERE $x/year > 1995RETURN { $x/title }
Result: <title>abc</title> <title>def</title> <title>ghi</title>
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
9
SQL v. XQuery
Product(pid, name, maker)Company(cid, name, city)
Find all products made in NYC
SELECT x.nameFROM Product x, Company yWHERE x.maker=y.cid and y.city="NYC"
SELECT x.nameFROM Product x, Company yWHERE x.maker=y.cid and y.city="NYC"
FOR $r in document("db.xml")/db, $x in $r/Product/row, $y in $r/Company/rowWHERE $x/maker/text()=$y/cid/text() and $y/city/text() = "NYC"RETURN { $x/name }
FOR $r in document("db.xml")/db, $x in $r/Product/row, $y in $r/Company/rowWHERE $x/maker/text()=$y/cid/text() and $y/city/text() = "NYC"RETURN { $x/name }
SQL
XQuery
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
10
SQL v. XQueryFor each company with revenues < 1M count the products over $100
SELECT y.name, count(*)FROM Product x, Company yWHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000GROUP BY y.cid, y.name
SELECT y.name, count(*)FROM Product x, Company yWHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000GROUP BY y.cid, y.name
FOR $r in document("db.xml")/db, $y in $r/Company/row[revenue/text()<1000000]RETURN <Company> <companyName>{ $y/name/text() }</companyName> <numberOfExpensiveProducts> { count($r/Product/row[maker/text()=$y/cid/text()][price/text()>100]) } </numberOfExpensiveProducts> </Company>
FOR $r in document("db.xml")/db, $y in $r/Company/row[revenue/text()<1000000]RETURN <Company> <companyName>{ $y/name/text() }</companyName> <numberOfExpensiveProducts> { count($r/Product/row[maker/text()=$y/cid/text()][price/text()>100]) } </numberOfExpensiveProducts> </Company>
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
11
XSLT: XSL Transformations Converts XML docs to other XML docs
Or to HTML, PDF, etc.
E.g.: Have data in XML, want to display to all users Users view web with IE, Firefox, Treo… Have XSLT convert to HTML that looks good on each XSLT processor takes XML doc and XSL template for view
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
12
XSLT v. XQuery FLWR expressions:
Often much simpler than XSLT
XSLT v. XQuery: http://www.xmlportfolio.com/xquery.html
FOR $b IN document("bib.xml")//bookWHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998“RETURN $b/title
FOR $b IN document("bib.xml")//bookWHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998“RETURN $b/title
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="document('bib.xml')//book"> <xsl:if test="publisher='Morgan Kaufmann' and year='1998'"> <xsl:copy-of select="title"/> </xsl:if> </xsl:for-each> </xsl:template></xsl:transform>
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="document('bib.xml')//book"> <xsl:if test="publisher='Morgan Kaufmann' and year='1998'"> <xsl:copy-of select="title"/> </xsl:if> </xsl:for-each> </xsl:template></xsl:transform>
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
13
Displaying XML with XSL/XSLT XSL: style sheet language for XML
XSL : XML :: CSS : HTML
Menu in XML: http://www.w3schools.com/xml/simple.xml
XSL file for displaying it: http://www.w3schools.com/xml/simple.xsl
XSL applied to the XML: http://www.w3schools.com/xml/simplexsl.xml
More info on Java with XSLT and XPath: http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
14
From XML to relations (Oracle) To move single values from XML to tables, can
simply use extractvalue in UPDATE statements:
SQL> UPDATE purchase_orderSET order_nbr = 7101,customer_po_nbr = extractvalue(purchase_order_doc, '/purchase_order/po_number'),customer_inception_date =
to_date(extractvalue(purchase_order_doc,'/purchase_order/po_date'), 'yyyy-mm-dd');
SQL> UPDATE purchase_orderSET order_nbr = 7101,customer_po_nbr = extractvalue(purchase_order_doc, '/purchase_order/po_number'),customer_inception_date =
to_date(extractvalue(purchase_order_doc,'/purchase_order/po_date'), 'yyyy-mm-dd');
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
15
From relations to XML (Oracle) Saw how to put XML in a table Conversely, can convert ordinary rel data to XML
XMLElement() generates an XML node Now can call XMLElement ftn to wrap vals in tags:
And can build it up recursively:
SELECT XMLElement("supplier_id", s.supplier_id) ||XMLElement("name", s.name) xml_fragment
FROM supplier s;
SELECT XMLElement("supplier_id", s.supplier_id) ||XMLElement("name", s.name) xml_fragment
FROM supplier s;
SELECT XMLElement("supplier",XMLElement("supplier_id", s.supplier_id), XMLElement("name", s.name))
FROM supplier s;
SELECT XMLElement("supplier",XMLElement("supplier_id", s.supplier_id), XMLElement("name", s.name))
FROM supplier s;
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
16
Why XML matters Hugely popular
To past few years what Java was to mid-90s Buzzword-compliant
XML databases won’t likely replace RDBMSs (remember OODBMSs?), but:
Allows for comm. between DBMSs disparate architectures, tools, languages, etc. Basis for Web Services
DBMS vendors are adding XML support MS, Oracle, et al.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
17
For more info APIs: SAX, JAXP Editors: XML Spy, MS XML Notepad:
http://www.webattack.com/get/xmlnotepad.shtml Parsers: Saxon, Xalan, MS XML Parser
Lectures drew on resources from: Nine-week course on XML:
http://www.cs.rpi.edu/~puninj/XMLJ/classes.html W3C XML Tutorial:
http://www.w3schools.com/xml/default.asp http://www.cs.cornell.edu/courses/cs433/2001fa/Slides/Xml,%20XPath,%20&%20Xslt.ppt
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
18
Recent XML news/etc. Group at Sun planning “binary XML”
http://developers.slashdot.org/article.pl?sid=05/01/14/1650206&tid=156
XML is “simple and sloppy” http://www.adambosworth.net/archives/000031.html
RDF: Resource Definition Framework Metadata for the web “Semantic web” Content, authors, relations to other content http://www.w3.org/DesignIssues/RDFnot.html
Web + XML = the “global mind” http://novaspivack.typepad.com/nova_spivacks_weblog/2004/06/minding_the_pla.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
19
New topic: Data Warehousing Physical warehouse: stores different kinds of items
combined from different sources in supply chain access items as a combined package “Synergy”
DW is the sys containing the data from many DBs OLAP is the system for easily querying the DW
Online analytical processing front-end to DW & stats
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
20
Integrating Data Ad hoc combination of DBs from different sources
can be problematic
Data may be spread across many systems geographically by division different systems from before mergers…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
21
Conversion/scrubbing/merging Lots of issues…
different types of data Varchar(255) v. char(30)
Different values for data ‘GREEN’/’GR/’2
Semantic differences Cars v. Automobiles
Missing values Handle with nulls or XML
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
22
Federated DBs Situ: n different DBs must work together
One idea: write programs for each to talk to each other one How many programs required? Like ambassadors for each country
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
23
Federated DBs Better idea: introduce another DB
write programs for it to talk to each other DB
Now how many programs? English in business, French in diplomacy
Warehousing Refreshed nightly
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
24
OLTP v. OLAP DWs usually not updated in real-time
data is usually not live but care about higher-level, longer-term patterns For “knowledge workers”/decision-makers
Live data is in system used by OLTP online transaction processing E.g., airline reservations OLTP data loaded into DW periodically, say nightly
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
25
Utilizing Data Situ: each time manager has hunch
requests custom reports direct programmers to write/modify SQL app to produce
these results on higher or lower levels, for different specifics
Problem: too difficult/expensive/slow too great a time lag
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
26
EISs Could just write queries at command-prompt
But decision makes aren’t (all) SQL programmers
Soln: create an executive information system provides friendly front-end to common, important queries basically a simple DB front-end your project part 5
GROUP BY queries are particularly applicable…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
27
EISs v. OLAP Okay for fixed set of queries But what if queries are open-ended?
Q: What’s driving sales in the Northeast? What’s the source cause? Result from one query influences next query tried
OLAP systems are interactive: run query analyze results think of new query repeat
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
28
Star Schemas Popular schema for DW data
One central DB surrounded by specific DBs
Center: fact table
Extremities: data tables
Fields in fact table are foreign keys to data tables
Normalization Snowflake Schema May not be worthwhile…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
29
Dates and star schemas OLAP behaves as though you had a Days table,
with every possible row Dates(day, week, month, year, DID) (5, 27, 7, 2000)
Can join on Days like any other table
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
30
Dates and star schemas E.g.: products x salesperson x region x date
Products sold by salespeople in regions on dates
Regular dim tables: Product(PID, name, color) Emp(name, SSN, sal) Region(name, RID)
Fact table: Sales(PID, DID, SSN, RID) Interpret as a cube (cross product of all dimensions)
Can have both data and stats
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
31
Drill-down & roll-up Imagine: notice some region’s sales way up Why? Good salesperson? Some popular product
there?
Maybe need to search by month, or month and product, abstract back up to just product…
“slicing & dicing”
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
32
OLAP and data warehousing Could write GROUP BY queries for each
OLAP systems provide simpler, non-SQL interface for this sort of thing
Vendors: MicroStrategy, SAP, etc.
Otoh: DW-style operators have been added to SQL and some DBMSs…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
33
DW extensions in SQL: ROLLUP (Oracle) Suppose have orders table (from two years), with
region and date info:
Can select total sales:
Examples derived/from Mastering Oracle SQL, 2e (O’Reilly) Get data here: http://examples.oreilly.com/mastorasql2/mosql2_data.sql
SELECT sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_id;
SELECT sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_id;
SQL> column month format a10SQL> @mosql2_dataSQL> describe all_orders;
SQL> column month format a10SQL> @mosql2_dataSQL> describe all_orders;
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
34
Can write GROUP BY queries for year or region or both:
SELECT r.name region, o.year, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY (r.name, o.year);
SELECT r.name region, o.year, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY (r.name, o.year);
DW extensions in SQL: ROLLUP (Oracle)
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
35
ROLLUP operator Extension of GROUP BY Does GROUP BY on several levels, simultaneously Order matters
Get sales totals for each region/year pair each region, and the grand total:
SELECT r.name region, o.year, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (r.name, o.year);
SELECT r.name region, o.year, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (r.name, o.year);
DW extensions in SQL: ROLLUP (Oracle)
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
36
Change the order of the group fields to get a different sequence of groups
To get totals for each year/region pair, each year, and the grand total, and just reverse group-by order:
SELECT o.year, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (o.year, r.name);
SELECT o.year, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (o.year, r.name);
DW extensions in SQL: ROLLUP (Oracle)
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
37
Adding more dimensions, like month, is easy (apart from formatting):
NB: summing happens on each level
SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (o.year, o.month, r.name);
SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP (o.year, o.month, r.name);
DW extensions in SQL: ROLLUP (Oracle)
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
38
If desired, can combine fields for the sake of grouping:
DW extensions in SQL: ROLLUP (Oracle)
SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP ((o.year, o.month), r.name);
SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY ROLLUP ((o.year, o.month), r.name);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
39
DW extensions in SQL: CUBE (Oracle) Another GROUP BY extension: CUBE
Subtotals all possible combins of group-by fields (powerset) Syntax: “ROLLUP” “CUBE” Order of fields doesn’t matter (apart from ordering)
To get subtotals for each region/month pair, each region, each month, and the grand total:
SELECT to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY CUBE (o.month, r.name);
SELECT to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY CUBE (o.month, r.name);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
40
DW extensions in SQL: CUBE (Oracle) Again, can easily add more dimensions:
SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY CUBE (o.year, o.month, r.name);
SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY CUBE (o.year, o.month, r.name);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
41
DW SQL exts: GROUPING SETS (Oracle) That’s a lot of rows Instead of a cube of all combinations, maybe we just
want the totals for each individual field:
SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY GROUPING SETS (o.year, o.month, r.name);
SELECT o.year, to_char(to_date(o.month, 'MM'),'Month') month, r.name region, sum(o.tot_sales)FROM all_orders o join region rON r.region_id = o.region_idGROUP BY GROUPING SETS (o.year, o.month, r.name);
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
42
Next time Overview of data mining
Some other odds & ends…