New Generation Database Systems: XML Databases and Grid-based Digital Libraries
description
Transcript of New Generation Database Systems: XML Databases and Grid-based Digital Libraries
IS 257 – Fall 2006 2006.11.28- SLIDE 1
New Generation Database Systems: XML Databases and Grid-based
Digital LibrariesUniversity of California, Berkeley
School of InformationIS 257: Database Management
IS 257 – Fall 2006 2006.11.28- SLIDE 2
Lecture Outline• XML and DBMS• The Grid and DBMS
– The Grid– Data Grids– Grid-based DBMS
IS 257 – Fall 2006 2006.11.28- SLIDE 3
Lecture Outline• XML and DBMS• The Grid and DBMS
– The Grid– Data Grids– Grid-based DBMS
IS 257 – Fall 2006 2006.11.28- SLIDE 4
Standards: XML/SQL• As part of SQL3 an extension providing a
mapping from XML to DBMS is being created called XML/SQL
• The (draft) standard is very complex, but the ideas are actually pretty simple
• Suppose we have a table called EMPLOYEE that has columns EMPNO, FIRSTNAME, LASTNAME, BIRTHDATE, SALARY
IS 257 – Fall 2006 2006.11.28- SLIDE 5
Standards: XML/SQL• That table can be mapped to:
<EMPLOYEE> <row><EMPNO>000020</EMPNO> <FIRSTNAME>John</FIRSTNAME> <LASTNAME>Smith</LASTNAME> <BIRTHDATE>1955-08-21</BIRTHDATE> <SALARY>52300.00</SALARY> </row>
<row> … etc. …
IS 257 – Fall 2006 2006.11.28- SLIDE 6
Standards: XML/SQL• In addition the standard says that
XMLSchemas must be generated for each table, and also allows relations to be managed by nesting records from tables in the XML.
• Variants of this are incorporated into the latest versions of ORACLE
• (Slides from Oracle Web Site on ORACLE XML)
IS 257 – Fall 2006 2006.11.28- SLIDE 7
Lecture Outline• XML and DBMS• The Grid and DBMS
– The Grid– Data Grids– Grid-based DBMS
IS 257 – Fall 2006 2006.11.28- SLIDE 8
Grid-based Digital Libraries• So what’s this Grid thing anyhow?• Data Grids and Distributed Storage• Grid-Based IR• Grid-Based Digital Libraries
This lecture borrows heavily from presentations by Ian Foster (Argonne National Laboratory & University of Chicago), Reagan Moore and others from San Diego Supercomputer Center
IS 257 – Fall 2006 2006.11.28- SLIDE 9
The Grid: On-Demand Access to Electricity
Time
Qua
lity,
eco
nom
ies
of s
cale
Source: Ian Foster
IS 257 – Fall 2006 2006.11.28- SLIDE 10
By Analogy, A Computing Grid
• Decouples production and consumption– Enable on-demand access– Achieve economies of scale– Enhance consumer flexibility– Enable new devices
• On a variety of scales– Department– Campus– Enterprise– Internet
Source: Ian Foster
IS 257 – Fall 2006 2006.11.28- SLIDE 11
What is the Grid?“The short answer is that, whereas the Web
is a service for sharing information over the Internet, the Grid is a service for sharing computer power and data storage capacity over the Internet. The Grid goes well beyond simple communication between computers, and aims ultimately to turn the global network of computers into one vast computational resource.”
Source: The Global Grid Forum
IS 257 – Fall 2006 2006.11.28- SLIDE 12
Not Exactly a New Idea …• “The time-sharing computer system can
unite a group of investigators …. one can conceive of such a facility as an … intellectual public utility.”– Fernando Corbato and Robert Fano , 1966
• “We will perhaps see the spread of ‘computer utilities’, which, like present electric and telephone utilities, will service individual homes and offices across the country.” Len Kleinrock, 1967
Source: Ian Foster
IS 257 – Fall 2006 2006.11.28- SLIDE 13
But, Things are Different Now
• Networks are far faster (and cheaper)– Faster than computer backplanes
• “Computing” is very different than pre-Net– Our “computers” have already disintegrated– E-commerce increases size of demand peaks– Entirely new applications & social structures
• We’ve learned a few things about software
Source: Ian Foster
IS 257 – Fall 2006 2006.11.28- SLIDE 14
Computing isn’t Really Like Electricity
• I import electricity but must export data• “Computing” is not interchangeable but highly
heterogeneous: data, sensors, services, …• This complicates things; but also means that the
sum can be greater than the parts – Real opportunity: Construct new capabilities
dynamically from distributed services• Raises three fundamental questions
– Can I really achieve economies of scale?– Can I achieve QoS across distributed services?– Can I identify apps that exploit synergies?
Source: Ian Foster
IS 257 – Fall 2006 2006.11.28- SLIDE 15
Why the Grid?(1) Revolution in Science• Pre-Internet
– Theorize &/or experiment, aloneor in small teams; publish paper
• Post-Internet– Construct and mine large databases of
observational or simulation data– Develop simulations & analyses– Access specialized devices remotely– Exchange information within
distributed multidisciplinary teamsSource: Ian Foster
IS 257 – Fall 2006 2006.11.28- SLIDE 16
Why the Grid?(2) Revolution in Business• Pre-Internet
– Central data processing facility• Post-Internet
– Enterprise computing is highly distributed, heterogeneous, inter-enterprise (B2B)
– Business processes increasingly computing- & data-rich
– Outsourcing becomes feasible => service providers of various sorts
Source: Ian Foster
IS 257 – Fall 2006 2006.11.28- SLIDE 17
The Information GridImagine a web of data• Machine Readable
– Search, Aggregate, Transform, Report On, Mine Data – using more computers, and less humans
• Scalable– Machines are cheap – can buy 50 machines with
100Gb or memory and 100 TB disk for under $100K, and dropping
– Network is now faster than disk• Flexible
– Move data around without breaking the appsSource: S. Banerjee, O. Alonso, M. Drake - ORACLE
IS 257 – Fall 2006 2006.11.28- SLIDE 18
Tier0/1 facility
Tier2 facility
10 Gbps link
2.5 Gbps link
622 Mbps link
Other link
Tier3 facility
The Foundations are Being Laid
Cambridge
Newcastle
Edinburgh
Oxford
Glasgow
Manchester
Cardiff
Soton
London
Belfast
DL
RAL Hinxton
IS 257 – Fall 2006 2006.11.28- SLIDE 19
Data Grid Problem• “Enable a geographically distributed
community [of thousands] to pool their resources in order to perform sophisticated, computationally intensive analyses on Petabytes of data”
• Note that this problem:– Is common to many areas of science– Overlaps strongly with other Grid problems
IS 257 – Fall 2006 2006.11.28- SLIDE 20
Data Grids forHigh Energy Physics
Tier2 Centre ~1 TIPS
Online System
Offline Processor Farm ~20 TIPS
CERN Computer Centre
FermiLab ~4 TIPSFrance Regional Centre
Italy Regional Centre
Germany Regional Centre
InstituteInstituteInstituteInstitute ~0.25TIPS
Physicist workstations
~100 MBytes/sec
~100 MBytes/sec
~622 Mbits/sec
~1 MBytes/sec
There is a “bunch crossing” every 25 nsecs.
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server
Physics data cache
~PBytes/sec
~622 Mbits/sec or Air Freight (deprecated)
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Caltech ~1 TIPS
~622 Mbits/sec
Tier 0Tier 0
Tier 1Tier 1
Tier 2Tier 2
Tier 4Tier 4
1 TIPS is approximately 25,000
SpecInt95 equivalents
Image courtesy Harvey Newman, Caltech
IS 257 – Fall 2006 2006.11.28- SLIDE 21
Grids and Open StandardsIn
crea
sed
func
tiona
lity,
stan
dard
izat
ion
Time
Customsolutions
Open GridServices Arch
GGF: OGSI, …(+ OASIS, W3C)
Multiple implementations,including Globus Toolkit
Web services
Globus Toolkit
Defacto standardsGGF: GridFTP, GSI
X.509,LDAP,FTP, …
App-specificServices
IS 257 – Fall 2006 2006.11.28- SLIDE 22
The Gridas Enabler of 21st Century Science• Entirely new approaches to enquiry based
on– Deep analysis of huge quantities of data– Interdisciplinary collaboration– Large-scale simulation– Smart instrumentation
• Enabled by an infrastructure that enables access to, and integration of, resources & services without regard for location
IS 257 – Fall 2006 2006.11.28- SLIDE 23
Not only Science…• The Database world is moving to the Grid
for large-scale applications• Oracle 10g is specifically designed to
exploit clustered/grid computing using RACs (Real Application Clusters)
• An example from the Information/Publishing world…– Presentation from Oracle about Thomson
Legal’s use of Oracle 10g and RACs