New Generation Database Systems: XML Databases and Grid-based Digital Libraries

23
IS 257 – Fall 2006 2006.11.28- SLIDE 1 New Generation Database Systems: XML Databases and Grid-based Digital Libraries University of California, Berkeley School of Information IS 257: Database Management

description

New Generation Database Systems: XML Databases and Grid-based Digital Libraries. University of California, Berkeley School of Information IS 257: Database Management. Lecture Outline. XML and DBMS The Grid and DBMS The Grid Data Grids Grid-based DBMS. Lecture Outline. XML and DBMS - PowerPoint PPT Presentation

Transcript of New Generation Database Systems: XML Databases and Grid-based Digital Libraries

Page 1: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 1

New Generation Database Systems: XML Databases and Grid-based

Digital LibrariesUniversity of California, Berkeley

School of InformationIS 257: Database Management

Page 2: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 2

Lecture Outline• XML and DBMS• The Grid and DBMS

– The Grid– Data Grids– Grid-based DBMS

Page 3: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 3

Lecture Outline• XML and DBMS• The Grid and DBMS

– The Grid– Data Grids– Grid-based DBMS

Page 4: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 4

Standards: XML/SQL• As part of SQL3 an extension providing a

mapping from XML to DBMS is being created called XML/SQL

• The (draft) standard is very complex, but the ideas are actually pretty simple

• Suppose we have a table called EMPLOYEE that has columns EMPNO, FIRSTNAME, LASTNAME, BIRTHDATE, SALARY

Page 5: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 5

Standards: XML/SQL• That table can be mapped to:

<EMPLOYEE> <row><EMPNO>000020</EMPNO> <FIRSTNAME>John</FIRSTNAME> <LASTNAME>Smith</LASTNAME> <BIRTHDATE>1955-08-21</BIRTHDATE> <SALARY>52300.00</SALARY> </row>

<row> … etc. …

Page 6: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 6

Standards: XML/SQL• In addition the standard says that

XMLSchemas must be generated for each table, and also allows relations to be managed by nesting records from tables in the XML.

• Variants of this are incorporated into the latest versions of ORACLE

• (Slides from Oracle Web Site on ORACLE XML)

Page 7: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 7

Lecture Outline• XML and DBMS• The Grid and DBMS

– The Grid– Data Grids– Grid-based DBMS

Page 8: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 8

Grid-based Digital Libraries• So what’s this Grid thing anyhow?• Data Grids and Distributed Storage• Grid-Based IR• Grid-Based Digital Libraries

This lecture borrows heavily from presentations by Ian Foster (Argonne National Laboratory & University of Chicago), Reagan Moore and others from San Diego Supercomputer Center

Page 9: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 9

The Grid: On-Demand Access to Electricity

Time

Qua

lity,

eco

nom

ies

of s

cale

Source: Ian Foster

Page 10: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 10

By Analogy, A Computing Grid

• Decouples production and consumption– Enable on-demand access– Achieve economies of scale– Enhance consumer flexibility– Enable new devices

• On a variety of scales– Department– Campus– Enterprise– Internet

Source: Ian Foster

Page 11: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 11

What is the Grid?“The short answer is that, whereas the Web

is a service for sharing information over the Internet, the Grid is a service for sharing computer power and data storage capacity over the Internet. The Grid goes well beyond simple communication between computers, and aims ultimately to turn the global network of computers into one vast computational resource.”

Source: The Global Grid Forum

Page 12: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 12

Not Exactly a New Idea …• “The time-sharing computer system can

unite a group of investigators …. one can conceive of such a facility as an … intellectual public utility.”– Fernando Corbato and Robert Fano , 1966

• “We will perhaps see the spread of ‘computer utilities’, which, like present electric and telephone utilities, will service individual homes and offices across the country.” Len Kleinrock, 1967

Source: Ian Foster

Page 13: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 13

But, Things are Different Now

• Networks are far faster (and cheaper)– Faster than computer backplanes

• “Computing” is very different than pre-Net– Our “computers” have already disintegrated– E-commerce increases size of demand peaks– Entirely new applications & social structures

• We’ve learned a few things about software

Source: Ian Foster

Page 14: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 14

Computing isn’t Really Like Electricity

• I import electricity but must export data• “Computing” is not interchangeable but highly

heterogeneous: data, sensors, services, …• This complicates things; but also means that the

sum can be greater than the parts – Real opportunity: Construct new capabilities

dynamically from distributed services• Raises three fundamental questions

– Can I really achieve economies of scale?– Can I achieve QoS across distributed services?– Can I identify apps that exploit synergies?

Source: Ian Foster

Page 15: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 15

Why the Grid?(1) Revolution in Science• Pre-Internet

– Theorize &/or experiment, aloneor in small teams; publish paper

• Post-Internet– Construct and mine large databases of

observational or simulation data– Develop simulations & analyses– Access specialized devices remotely– Exchange information within

distributed multidisciplinary teamsSource: Ian Foster

Page 16: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 16

Why the Grid?(2) Revolution in Business• Pre-Internet

– Central data processing facility• Post-Internet

– Enterprise computing is highly distributed, heterogeneous, inter-enterprise (B2B)

– Business processes increasingly computing- & data-rich

– Outsourcing becomes feasible => service providers of various sorts

Source: Ian Foster

Page 17: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 17

The Information GridImagine a web of data• Machine Readable

– Search, Aggregate, Transform, Report On, Mine Data – using more computers, and less humans

• Scalable– Machines are cheap – can buy 50 machines with

100Gb or memory and 100 TB disk for under $100K, and dropping

– Network is now faster than disk• Flexible

– Move data around without breaking the appsSource: S. Banerjee, O. Alonso, M. Drake - ORACLE

Page 18: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 18

Tier0/1 facility

Tier2 facility

10 Gbps link

2.5 Gbps link

622 Mbps link

Other link

Tier3 facility

The Foundations are Being Laid

Cambridge

Newcastle

Edinburgh

Oxford

Glasgow

Manchester

Cardiff

Soton

London

Belfast

DL

RAL Hinxton

Page 19: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 19

Data Grid Problem• “Enable a geographically distributed

community [of thousands] to pool their resources in order to perform sophisticated, computationally intensive analyses on Petabytes of data”

• Note that this problem:– Is common to many areas of science– Overlaps strongly with other Grid problems

Page 20: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 20

Data Grids forHigh Energy Physics

Tier2 Centre ~1 TIPS

Online System

Offline Processor Farm ~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance Regional Centre

Italy Regional Centre

Germany Regional Centre

InstituteInstituteInstituteInstitute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Caltech ~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

Image courtesy Harvey Newman, Caltech

Page 21: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 21

Grids and Open StandardsIn

crea

sed

func

tiona

lity,

stan

dard

izat

ion

Time

Customsolutions

Open GridServices Arch

GGF: OGSI, …(+ OASIS, W3C)

Multiple implementations,including Globus Toolkit

Web services

Globus Toolkit

Defacto standardsGGF: GridFTP, GSI

X.509,LDAP,FTP, …

App-specificServices

Page 22: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 22

The Gridas Enabler of 21st Century Science• Entirely new approaches to enquiry based

on– Deep analysis of huge quantities of data– Interdisciplinary collaboration– Large-scale simulation– Smart instrumentation

• Enabled by an infrastructure that enables access to, and integration of, resources & services without regard for location

Page 23: New Generation Database Systems: XML Databases and Grid-based Digital Libraries

IS 257 – Fall 2006 2006.11.28- SLIDE 23

Not only Science…• The Database world is moving to the Grid

for large-scale applications• Oracle 10g is specifically designed to

exploit clustered/grid computing using RACs (Real Application Clusters)

• An example from the Information/Publishing world…– Presentation from Oracle about Thomson

Legal’s use of Oracle 10g and RACs