A New Partnership for Cross-Scale, Cross-Domain eScience
-
Upload
bill-howe -
Category
Technology
-
view
574 -
download
0
description
Transcript of A New Partnership for Cross-Scale, Cross-Domain eScience
![Page 1: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/1.jpg)
A New Partnership for eScience
Bill Howe, UW
Ed Lazowska, UW
Garth Gibson, CMU
Christos Faloutsos, CMU
Peter Lee, CMU (DARPA)
Chris Mentzel, Moore
QuickTime™ and a decompressor
are needed to see this picture.
![Page 2: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/2.jpg)
![Page 3: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/3.jpg)
http://escience.washington.edu
![Page 4: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/4.jpg)
3/12/09 Bill Howe, eScience Institute 4
![Page 5: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/5.jpg)
3/12/09 Bill Howe, eScience Institute 5
\
![Page 6: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/6.jpg)
3/12/09 Bill Howe, eScience Institute 6
The University of Washington eScience Institute
Rationale The exponential increase in sensors is transitioning all fields of science and
engineering from data-poor to data-rich Techniques and technologies include
Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing
If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive
Mission Help position the University of Washington at the forefront of research both in
modern eScience techniques and technologies, and in the fields that depend upon them
Strategy Bootstrap a cadre of Research Scientists Add faculty in key fields Build out a “consultancy” of students and non-research staff
QuickTime™ and a decompressor
are needed to see this picture.
![Page 7: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/7.jpg)
3/12/09 Bill Howe, eScience Institute 7
Staff and Funding
Funding $1M/year direct appropriation from WA State Legislature $1.5M from Gordon and Betty Moore Foundation (joint with CMU) Multiple proposals outstanding
Staffing Dave Beck, Research Scientist: Biosciences and software eng. Jeff Gardner, Research Scientist: Astrophysics and HPC Bill Howe,Research Scientist: Databases, visualization, DISC Ed Lazowska, Director Erik Lundberg (50%), Operations Director Mette Peters, Health Sciences Liaison Chance Reschke, Research Engineer: large scale computing platforms
…plus a senior faculty search underway …plus a “consultancy” of students and professional staff
![Page 8: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/8.jpg)
3/12/09 Bill Howe, eScience Institute 8
All science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: lab automation, high-throughput sequencing
“Increase data collection exponentially with FlowCam!”
![Page 9: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/9.jpg)
3/12/09 Bill Howe, eScience Institute 9
The long tail is getting fatter:
notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)
The Long Tailda
ta v
olum
e
rank
Researchers with growing data management challenges but limited resources for cyberinfrastructure
• No dedicated IT staff
• Over-reliance on inadequate but familiar toolsCERN (~15PB/year)
LSST (~100PB)
PanSTARRS (~40PB)
Ocean Modelers <Spreadsheet
users>
SDSS (~100TB)
Seis-mologists
MicrobiologistsCARMEN (~50TB)
“The future is already here. It’s just not very evenly distributed.” -- William Gibson
![Page 10: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/10.jpg)
3/12/09 Bill Howe, eScience Institute 10
Case Study: Armbrust Lab
![Page 11: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/11.jpg)
3/12/09 Bill Howe, eScience Institute 11
Armbrust Lab Tech Roadmap
ClustalW
scala
bili
ty
cluster/cloud
workstation/server
MAQsp
ecif
ic ta
sks
gene
ral t
asks
Excel
NCBI BLAST
Phred/Phrap
CloudBurst
CLC Genomics Machine
Hadoop/Dryad
Parallel Databases
?
Azure, AWS
WebBlast*
RDBMS
R
PPlacer*
AnnoJBioPython
Past
Present
Soon
Other tools
specialization
![Page 12: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/12.jpg)
3/12/09 Bill Howe, eScience Institute 12
What Does Scalable Mean?
Operationally: In the past: “Works even if data doesn’t fit in main memory” Now: “Can make use of 1000s of cheap computers”
Formally: In the past: polynomial time and space. If you have N data
items, you must do no more than Nk operations Soon: logarithmic time and linear space. If you have N data
items, you must do no more than N log(N) operations
Soon, you’ll only get one pass at the data So you better make that one pass count
![Page 13: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/13.jpg)
3/12/09 Bill Howe, eScience Institute 13
A Goal: Cross-Scale Solutions
Gracefully scale up from files to databases to cluster to cloud from MB to GB to TB to PB
“Gracefully” means: logical data independence no expensive ETL migration projects
“Gracefully” means: everyone can use it Hackers / Computational Scientists Lab/Field Scientists The Public K12 Legislators
![Page 14: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/14.jpg)
3/12/09 Bill Howe, eScience Institute 14
Data Model Operations Services
GPL * * None for free
Workflow * arbitrary boxes-and-arrows
typing, provenance, Pegasus-style resource mapping, task parallelism
SQL / Relational Algebra
Relations Select, Project, Join, Aggregate, …
optimization, physical data independence, indexing, parallelism
MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance, scheduling
Pig Nested Relations
RA-like, with Nest/Flatten
optimization, monitoring, scheduling
DryadLINQ IQueryable, IEnumerable
RA + Apply + Partitioning
typing, massive data parallelism, fault tolerance
MPI Arrays/ Matrices
70+ ops data parallelism, full control
![Page 15: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/15.jpg)
3/12/09 Bill Howe, eScience Institute 15
MapReduce
Many tasks process big data, produce big data Want to use hundreds or thousands of CPUs
... but this needs to be easy Parallel databases exist, but require DBAs and $$$$ …and do not easily scale to thousands of computers
MapReduce is a lightweight framework, providing: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring
![Page 16: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/16.jpg)
3/12/09 Bill Howe, eScience Institute 16
public class LogEntry { public string user, ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; }}
public class UserPageCount{ public string user, page; public int count; public UserPageCount( string usr, string page, int cnt){ this.user = usr; this.page = page; this.count = cnt; }}
PartitionedTable<string> logs = PartitionedTable.Get<string>(@”file:\\…\logfile.pt”);var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToPartitionedTable(@”file:\\…\results.pt”);
slide source: Christophe Poulain, MSR
A complete DryadLINQ program
![Page 17: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/17.jpg)
3/12/09 Bill Howe, eScience Institute 17
Relational DatabasesPre-relational DBMS brittleness: if your data changed, your application often broke.
Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.
physical data independence
logical data independence
files and pointers
relations
views
“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”
Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
![Page 18: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/18.jpg)
3/12/09 Bill Howe, eScience Institute 18
Relational Databases
Databases are especially, but exclusively, effective at “Needle in Haystack” problems:
Extracting small results from big datasets Transparently provide “old style” scalability Your query will always* finish, regardless of dataset size.
Indexes are easily built and automatically used when appropriateCREATE INDEX seq_idx ON sequence(seq);
SELECT seq FROM sequence WHERE seq = ‘GATTACGATATTA’;
*almost
![Page 19: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/19.jpg)
3/12/09 Bill Howe, eScience Institute 19
Key Idea: Data Independence
physical data independence
logical data independence
files and pointers
relations
views
SELECT * FROM my_sequences
SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’;
f = fopen(‘table_file’);fseek(10030440);while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
![Page 20: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/20.jpg)
3/12/09 Bill Howe, eScience Institute 20
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
![Page 21: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/21.jpg)
3/12/09 Bill Howe, eScience Institute 21
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
![Page 22: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/22.jpg)
3/12/09 Bill Howe, eScience Institute 22
Shared Nothing Parallel Databases
Teradata Greenplum Netezza Aster Data Systems DataAllegro Vertica MonetDB
Microsoft
Recently commercialized as “Vectorwise”
![Page 23: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/23.jpg)
Case Study: Astrophysics Simulation
![Page 24: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/24.jpg)
24
N-body Astrophysics Simulation
• 15 years in dev
• 109 particles
• Gravity
• Months to run
• 7.5 million CPU hours
• 500 timesteps
• Big Bang to now
Simulations from Tom Quinn’s Lab, work by Sarah Loebman, YongChul Kwon, Bill Howe, Jeff Gardner, Magda Balazinska
![Page 25: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/25.jpg)
25
Q1: Find Hot Gas
SELECT id
FROM gas
WHERE temp > 150000
![Page 26: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/26.jpg)
26
Single Node: Query 1
169 MB 1.4 GB 36 GB
![Page 27: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/27.jpg)
27
Multiple Nodes: Query 1
Database Z
![Page 28: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/28.jpg)
28
Multiple Nodes:Query 2
Database Z
![Page 29: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/29.jpg)
29
Q4: Gas Deletion
SELECT gas1.id
FROM gas1
FULL OUTER JOIN gas2
ON gas1.id=gas2.id
WHERE gas2.id=NULL
Particles removed between two timesteps
![Page 30: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/30.jpg)
30
Single Node: Query 4
![Page 31: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/31.jpg)
31
Multiple Nodes: Query 4
![Page 32: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/32.jpg)
3/12/09 Bill Howe, eScience Institute 32
Ease of Use
star43 = FOREACH rawGas43 GENERATE $0 AS pid:long; star60 = FOREACH rawGas60 GENERATE $0 AS pid:long; groupedGas = COGROUP star43 BY pid, star60 BY pid;
selectedGas = FOREACH groupedGas GENERATE FLATTEN((IsEmpty(gas43) ? null : gas43)) as s43, FLATTEN((IsEmpty(gas60) ? null : gas60)) as s60;
destroyed = FILTER selectedGas BY s60 is null;
![Page 33: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/33.jpg)
Visualization and Mashups
Dancing with Data
![Page 34: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/34.jpg)
3/12/09 Bill Howe, eScience Institute 34
Data explosion, again
Data growth is outpacing Moore’s Law Why? Cost of acquisition has dropped through the floor Every pairwise comparison of datasets
generates a new dataset -- N2 growth
So: Scalable analysis is necessary But: Scalable analysis is hard
![Page 35: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/35.jpg)
3/12/09 Bill Howe, eScience Institute 35
It’s not just the size….
Corollary: # of apps scales as N2
Every pairwise comparison motivates a new application
To keep up, we need to entrain new programmers, make existing programmers more productive, or both
![Page 36: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/36.jpg)
3/12/09 Bill Howe, eScience Institute 36
Satellite Images + Crime Incidence Reports
![Page 37: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/37.jpg)
3/12/09 Bill Howe, eScience Institute 37
Twitter Feed + Flickr Stream
![Page 38: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/38.jpg)
3/12/09 Bill Howe, eScience Institute 38
Zooplankton and Temperature
<Vis movie>
QuickTime™ and a decompressor
are needed to see this picture.
![Page 39: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/39.jpg)
3/12/09 Bill Howe, eScience Institute 39
Why Visualization?
High bandwidth of the human visual cortex Query-writing presumes a precise goal Try this in SQL: “What does the salt wedge look like?”
![Page 40: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/40.jpg)
3/12/09 Bill Howe, eScience Institute 40
Data Product Ensembles
source: Antonio Baptista, Center for Coastal Margin Observation and Prediction
![Page 41: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/41.jpg)
3/12/09 Bill Howe, eScience Institute 41
Example: Find matching sequences
Given a set of sequences Find all sequences equal to
“GATTACGATATTA”
![Page 42: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/42.jpg)
3/12/09 Bill Howe, eScience Institute 42
Example System: Teradata
AMP = unit of parallelism
![Page 43: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/43.jpg)
3/12/09 Bill Howe, eScience Institute 43
Example System: Teradata
SELECT * FROM Orders o, Lines i WHERE o.item = i.item AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
![Page 44: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/44.jpg)
3/12/09 Bill Howe, eScience Institute 44
Example System: Teradata
AMP 1 AMP 2 AMP 3
selectdate=today()
selectdate=today()
selectdate=today()
scanOrder o
scanOrder o
scanOrder o
hashh(item)
hashh(item)
hashh(item)
AMP 1 AMP 2 AMP 3
![Page 45: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/45.jpg)
3/12/09 Bill Howe, eScience Institute 45
Example System: Teradata
AMP 1 AMP 2 AMP 3
scanItem i
AMP 1 AMP 2 AMP 3
hashh(item)
scanItem i
hashh(item)
scanItem i
hashh(item)
![Page 46: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/46.jpg)
3/12/09 Bill Howe, eScience Institute 46
Example System: Teradata
AMP 1 AMP 2 AMP 3
join join joino.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines where hash(item) = 1
contains all orders and all lines where hash(item) = 2
contains all orders and all lines where hash(item) = 3
![Page 47: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/47.jpg)
3/12/09 Bill Howe, eScience Institute 47
MapReduce Programming Model
Input & Output: each a set of key/value pairs Programmer specifies two functions:
Processes input key/value pair Produces set of intermediate pairs
Combines all intermediate values for a particular key Produces a set of merged output values (usually just
one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell
slide source: Google, Inc.
![Page 48: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/48.jpg)
3/12/09 Bill Howe, eScience Institute 48
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
![Page 49: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/49.jpg)
3/12/09 Bill Howe, eScience Institute 49
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
How many “big”, “medium”, and “small” words are used?
![Page 50: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/50.jpg)
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
Big = Yellow = 10+ letters
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter
Example: Word length histogram
![Page 51: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/51.jpg)
Abridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of nature'sgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying it's foundation on such principles and organizing it's power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
Split the document into chunks and process each chunk on a different computer
Chunk 1
Chunk 2
![Page 52: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/52.jpg)
(yellow, 20)(red, 71)(blue, 93)(pink, 6 )
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
Map Task 1(204 words)
Map Task 2(190 words)
(key, value)
(yellow, 17)(red, 77)(blue, 107)(pink, 3)
Example: Word length histogram
![Page 53: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/53.jpg)
3/12/09 Bill Howe, eScience Institute 53
(yellow, 17)(red, 77)(blue, 107)(pink, 3)
(yellow, 20)(red, 71)(blue, 93)(pink, 6 )
Reduce tasks
(yellow, 17)(yellow, 20)
(red, 77)(red, 71)
(blue, 93)(blue, 107)
(pink, 6)(pink, 3)
Example: Word length histogram
A Declaration By the Representatives of the United States of America, in GeneralCongress Assembled.When in the course of human events it becomes necessary for a people to advance fromthat subordination in which they have hitherto remained, and to assume among powers ofthe earth the equal and independent station to which the laws of nature and of nature'sgod entitle them, a decent respect to the opinions of mankind requires that they shoulddeclare the causes which impel them to the change.We hold these truths to be self-evident; that all men are created equal and independent;that from that equal creation they derive rights inherent and inalienable, among which arethe preservation of life, and liberty, and the pursuit of happiness; that to secure theseends, governments are instituted among men, deriving their just power from the consentof the governed; that whenever any form of government shall become destructive of theseends, it is the right of the people to alter or to abolish it, and to institute new government,laying it's foundation on such principles and organizing it's power in such form, as tothem shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transientcauses: and accordingly all experience hath shewn that mankind are more disposed tosuffer while evils are sufferable, than to right themselves by abolishing the forms towhich they are accustomed. But when a long train of abuses and usurpations, begun at adistinguished period, and pursuing invariably the same object, evinces a design to reducethem to arbitrary power, it is their right, it is their duty, to throw off such government andto provide new guards for future security. Such has been the patient sufferings of thecolonies; and such is now the necessity which constrains them to expunge their formersystems of government. the history of his present majesty is a history of unremittinginjuries and usurpations, among which no one fact stands single or solitary to contradictthe uniform tenor of the rest, all of which have in direct object the establishment of anabsolute tyranny over these states. To prove this, let facts be submitted to a candid world,for the truth of which we pledge a faith yet unsullied by falsehood.
Map task 1
Map task 2
“Shuffle step”
(yellow, 37)
(red, 148)
(blue, 200)
(pink, 9)
![Page 54: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/54.jpg)
3/12/09 Bill Howe, eScience Institute 54
New Example: What does this do?
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1);
reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result);
slide source: Google, Inc.
![Page 55: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/55.jpg)
3/12/09 Bill Howe, eScience Institute 55
Before RDBMS: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.
“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” -- E.F. Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
Relational Database Management Systems (RDBMS)
![Page 56: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/56.jpg)
3/12/09 Bill Howe, eScience Institute 56
MapReduce is a Nascent Database Engine
Access Methods and Scheduling:
Query Language:
Query Optimizer:
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Pig Latin
Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90
![Page 57: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/57.jpg)
3/12/09 Bill Howe, eScience Institute 57
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
MapReduce and Hadoop
MR introduced by Google Published paper in OSDI 2004
MR: high-level programming model and implementation for large-scale parallel data processing
Hadoop Open source MR implementation Yahoo!, Facebook, New York Times
![Page 58: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/58.jpg)
3/12/09 Bill Howe, eScience Institute 58
operators: • LOAD• STORE • FILTER• FOREACH … GENERATE • GROUP
binary operators: • JOIN• COGROUP• UNION
other support:• UDFs• COUNT• SUM • AVG• MIN/MAX
Additional operators:http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm
A Query Language for MR: Pig Latin
High-level, SQL-like dataflow language for MR Goal: Sweet spot between SQL and MR
Applies SQL-like, high-level language constructs to accomplish low-level MR programming.
![Page 59: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/59.jpg)
3/12/09 Bill Howe, eScience Institute 59
New Task: k-mer Similarity
Given a set of database sequences and a set of query sequences
Return the top N similar pairs, where similarity is defined as the number of common k-mers
![Page 60: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/60.jpg)
3/12/09 Bill Howe, eScience Institute 60
Pig Latin program
D = LOAD ’db_sequences.fasta' USING FASTA() AS (did,dsequence);
Q = LOAD ’query_sequences.fasta' USING FASTA() AS (qid,qsequence);
Kd = FOREACH D GENERATE did, FLATTEN(kmers(7, dsequence));Kq = FOREACH Q GENERATE qid, FLATTEN(kmers(7, qsequence));
R = JOIN Kd BY kmer, Kq BY kmer
G = GROUP R BY (qid, did);C = FOREACH G GENERATE qid, did, COUNT(kmer) as scoreT = FILTER C BY score > 4
STORE g INTO seqs.txt';
![Page 61: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/61.jpg)
3/12/09 Bill Howe, eScience Institute 61
New Task: Alignment
RMAP alignment implemented in HadoopMichael Schatz, CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics 25(11), April 2009
Goal: Align reads to a reference genome
Overview: Map: Split reads and reference into k-mers Reduce: for matching k-mers, find end-to-end
alignments (seed and extend)
![Page 62: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/62.jpg)
3/12/09 Bill Howe, eScience Institute 62
MapReduce Overhead
QuickTime™ and a decompressor
are needed to see this picture.
![Page 63: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/63.jpg)
3/12/09 Bill Howe, eScience Institute 63
Elastic MapReduce
Custom Jar Java
Streaming Any language that can read/write stdin/stdout
Pig Simple data flow language
Hive SQL
![Page 64: A New Partnership for Cross-Scale, Cross-Domain eScience](https://reader036.fdocuments.net/reader036/viewer/2022062405/554e75f8b4c9054a698b4d8f/html5/thumbnails/64.jpg)
3/12/09 Bill Howe, eScience Institute 64
Taxonomy of Parallel Architectures
Easiest to program, but $$$$
Scales to 1000s of computers