Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois &...
-
Upload
michael-daley -
Category
Documents
-
view
214 -
download
0
Transcript of Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois &...
Distributed Tera-Mining
R. L. Grossman
Laboratory for Advanced Computing
University of Illinois &
Magnify, Inc.
Trend 1. Explosion of Data …
… All in the Wrong Format
With no one to analyze it.
The Data Gap
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
1995 1996 1997 1998 1999
The Data Gap
Total new disk (TB) since 1995
New Ph.D.s
Most data comes a GB and a TB at a time.
Trend 2. Sonet is dead. Lambda Rules.
Gigabytes can be moved in seconds.
Trend 3: Most Data is Distributed
Bush’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.
Example 1: ENSO & Cholera
El Nino Data at NCAR Cholera Data at WHO
Example 2: Voting
County BUCHANANALACHUA 263BAKER 73BAY 248BRADFORD 65BREVARD 570BROWARD 788 Table 1
County ReformAlachua 91Baker 4Bay 55Bradford 3Brevard 148Broward 332
Table 2
Correlation: Reform Voters vs Votes for Buchanan
0
500
1000
1500
2000
2500
3000
3500
4000
0 50 100 150 200 250 300 350 400 450
Palm Beach
DataSpace – One Approach to Making Data Useful
16 terabytes of documents4 billion documents
Today’sMulti-media
Web
Tomorrow’sData Web
petabytes of data tens of billions to
trillions of records
• html• http• search by keyword• workstations servers
• pmml & dtml • dstp• correlate & mine• data & compute clusters
Complementary to the grid, which we view as a distributed computer.
attributes [aid]
UCK [uckid]
k[i], y[j]
k[i], x[i]
DSTP Server 1
DSTP Server 2
Click to obtain graph
Terra Mining TestbedOptical testbed for distributed tera miningof scientific data.
Goal also to be testbed forbroadband based business services.
Lessons Learned
1. It’s the data stupid. Cycles, cylinders & lambdas are all commodities.
2. The fundamental challenge: lower the cost to make data useful.
3. The emergence of internet infrastructure for data is inevitable.
Opens up possibilities for new
types of scientific discoveries.
For More Information DataSpace
http://www.dataspaceweb.nethttp://www.ncdm.uic.edu
DataSpace Standardshttp://www.dmg.org
Selected articleshttp://www.twocultures.net
Magnify – http://www.magnify.com
End of Slides
FTP Still Lives
Trend 2. Bandwidth is a Commodity
OC-3 OC-12 OC-48
El Nina Anomalies
Indonesia Cholera Cases
Cholera Cases
Distributed Exabytes (New Disks)
0
2000
4000
6000
8000
10000
12000
14000
1995 1996 1997 1998 1999 2000 2001 2002 2003
Source: IDC (1999) "1999 Winchester Disk Drive Market Forecast and Review"
Petabytes1 Exabyte
Trend 3: Most Data is Distributed
W’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.
Example 2: Voting
Database 1: Total Votes for Buchanan by County
County BUCHANANALACHUA 263BAKER 73BAY 248BRADFORD 65BREVARD 570BROWARD 788
Database 2: Total Registered Reform Voters by County
County ReformAlachua 91Baker 4Bay 55Bradford 3Brevard 148Broward 332