Exploring Space in Cyberspace: Cyber-Enabled...
Transcript of Exploring Space in Cyberspace: Cyber-Enabled...
011
01100
1010011
00101000
1110100011
001001110110110
100101010001011101
0010010011010101000
101111010011000111110011
01011010110001110101101010
1110110111101101010010110100
01111101010101010001101001001000
Exploring Space in Cyberspace:
Cyber-Enabled Research and
Discovery in Astronomy
S. George Djorgovski
(Caltech)
CDI Workshop,
Seattle,
Nov. 2007
Overview• Astronomy in the era of information abundance
– Exponential growth in data volume and complexity
• The Virtual Observatory concept and status
– A domain-specific, community-wide, distributed, open
framework for science with massive and complex data sets
– Technology-enabled, but science-driven
• Examples of VO science drivers
– Exploration of parameter spaces
– Exploration of the time domain: distributed analysis and
mining of massive data streams
• Some general comments and musings on Cyber-science
– What is really new here? What are the important trends?
– The enhanced science and technology synergies
Galactic Center Region (a tiny portion) 2MASS NIR Image
Digital Sky Surveys
• The dominant source of data in astronomy today,
typically several tens of TB each, ~ 108 - 109
sources detected, ~ 102 - 103 attributes per
source; all wavelengths, radio to !-ray
– Examples: SDSS, Palomar surveys, 2MASS, …
• A single survey feeds a broad range of science,
from statistical studies of major constituents of
the universe, to discovery of rare types of objects
• Federated in the Virtual Observatory framework
• Current surveys are mainly single-snapshot; the
next generation will be synoptic (multi-pass),
opening the time-domain astronomy (cosmic
cinematography); Peta-scale data sets
– Examples: PanSTARRS, LSST; SKA; etc.
Panchromatic Views
of the
Universe
Visible + X-ray
Crab Star forming complex
Radio + IR
Understanding
of complex
phenomena
requires
complex data
Radio
Far-InfraredVisible
Data Fusion
! A More
Complete,
Less Biased
Picture
Theoretical Simulations Are Also Becoming MoreComplex and Generate Many TB’s of Data
Structure formation in the Universe Supernova explosions
Numerical simulations are not just a weak substitute for the analytical
theory - they are an inevitable methodology to study theoretically
many complex phenomena, e.g., star or galaxy formation, etc.
An Example of Where We Are Heading
The exponential growth of
data volume (and also
complexity, quality) driven
by the exponential growth
in detector and computing
technology1970
19751980
19851990
19952000
0.1
1
10
100
1000
CCDs Glass
… but our understanding
of the universe increases much more slowly!
Data ! Knowledge ?
• Large digital sky surveys are becoming the dominant
source of data in astronomy: ~ 10-100 TB/survey (soon PB),
~ 106 - 109 sources/survey, many wavelengths…
• Data sets many orders of magnitude larger, morecomplex, and more homogeneous than in the past
doubling t ! 1.5 yrs
(from A. Szalay)
The Virtual Observatory Concept• Astronomy community’s response to the scientific and
technological challenges posed by the exponential growth of data
sets and data complexity
– Technology-enabled, but science-driven: harness the IT
advances in service of astronomy
• A complete, dynamical, distributed, web-based, open researchenvironment for astronomy with massive and complex data sets
– Provide content (data, metadata)
services, standards, and
analysis/compute services
– Federate the existing + forthcoming
digital sky surveys and archives,
facilitate data inclusion and
distribution
– Develop and provide data
exploration and discovery tools
SurveyTelescope
Archive
Follow-UpTelescope
Results
Target Selection
Data Mining
From Traditional to Survey to VO-Based Science
Highly successful and increasingly prominent, but inherently
limited by the information content of individual surveys …
What comes next, beyond survey science is the VO science
Another Survey/Archive?
Data Analysis
Results
Telescope
Traditional: Survey-Based:
SurveysObservatories
Missions
Surveyand
MissionArchives Follow-Up
Telescopesand
MissionsData Services:
Data discovery
Warehousing
Federation
Standards…Compute Services:
Data Mining
and Analysis,
Statistics,
Visualization…
Networking
Digitallibraries
Primary Data Providers
NVO
SecondaryData
Providers
Nat’l Virtual Observatory: A Systemic View
Numerical Sim’s
User Community
International
VO’s
Virtual
Observatory
Is Real!
http:// ivoa.net
http://www.euro-vo.org
http://us-vo.org
VO as a New Research Environment
• The VO is not yet another data center, archive, mission, or a
traditional project It does not fit into any of the usual
structures today
– It is inherently distributed, and web-centric
– It is based on a rapidly developing technology (IT/CS)
– It transcends the traditional boundaries between differentwavelength regimes, agency domains
– It has an unusually broad range of constituents and interfaces
– It is inherently multidisciplinary
– It is inherently trans-national in its reach
• The VO represents a novel type of a scientific organization for
the era of information abundance
– Many other fields are building VOs of their own
– They are always discipline-based, not institution-based
• Statistical astronomy done right
– Precision cosmology, Galactic structure, stellar astrophysics …
– Discovery of significant patterns and multivariate correlations
– Poissonian errors unimportant
• Systematic exploration of the observable parameter spaces
– Searches for rare or unknown types of objects and phenomena
– Low surface brightness universe, the time domain …
• Multi-wavelength data fusion to disentangle complexprocesses and superpositions
– e.g., interpretation of the precision CMBR measurements
• Confronting massive numerical simulations with massive
data sets
+ things we have not thought of yet …
Virtual Observatory Enabled Science
Understanding the CMBR ForegroundsIntegrated SZ
Grav. LensingInteg. Sachs-Wolfe
Galaxies (SF)Radio Sources
Galactic ThermalGal. Nonthermal
CMB Signal
Exploration of Parameter Spaces
• How many different types ofobjects are there?
– Which ones are identifiablewith known, physically distincttypes (e.g., stars, galaxies,quasars, etc.)?
• Are there rare and/orpreviously unknown classes,seen as outliers?
– Are there intermediate ortransition types?
– Are there negative clusters?
– Anomalies possibly indicativeof problems with the data?
• Are there new multivariatecorrelations?
An Example: Discoveries of High-Redshift
Quasars and Type-2 Quasars in DPOSS
Known, astrophysically interesting but rare types of objects,
with a known or predictable parameter space signature
High-z QSOs
Type-2 QSOs
Color-color parameterspace used for selection Spectroscopic identification
NormalStars
Peculiar types of quasars:Rare Types of ObjectsDiscovered as Outliers in
a Color Parameter Space
DQ White Dwarf
Highly unusual CV
(Fan et al., SDSS; Djorgovski et al., DPOSS)
Peculiar types of stars:
Dwarf Planets, Flying Rocks, and Snowballs
Dwarf planets and KBOs
Sedna,
Xena, …
?
QuauarM. Brown
et al.
NEAT,
Catalina,
etc.
Tunguska
Killer Asteroids
Flaring stars Novae, Cataclysmic Variables
A Rich Variety of Time-Domain Phenomena
Supernovae
Gravitational Microlensing Accretion to SMBHsGamma-Ray Bursts
Blazars: Accelerators in the Sky
TeV !-ray
Detections
They are quasars where we are looking straight down the relativistic
jet (" ~ 10 - 100). Instabilities and shocks produce strong variability
• Known sources of !-rays (up to a few
TeV), and probable sources of ultra-
high energy cosmic rays (up to ~ 1021
eV ~ 108! LHC !)
– The future of particle (astro)physics?
• Probes of relativistic physics, AGN,
and cosmic star formation history
Donald Rumsfeld’s Epistemology
There are known knowns,
There are known unknowns, and
There are unknown unknowns
And Some Mysteries…
Megaflares in normal stars !An example from DPOSS: A normal, main-
sequence star which underwent an outburst by
a factor of > 300 (orders of magnitude more
than the Solar flares). The cause, duration, and
frequency of these bursts is currently unknown
Archival optical transients
Seen in many surveys
(DPOSS, DLS, PQ, SN
surveys, …). Their physical
nature is unknown
!
The Palomar-Quest Event Factory
R
I
tonight baselineDetect ~ 1 - 2 !106 sources
per half-night scan
Find ~ 103 apparent
transients (in the data)
Identify ~ 2 - 4 !102 real
transients (on the sky)
Identify ~ 1 - 10 possible
Astrophysical transients
Compare withthe baseline sky
Remove instrum.artifacts
Removeasteroids
Classification and follow-up
The VOEventNet Project
• A telescope sensor network with a feedback
• Scientific measurements spawning other measurements and data
analysis in the real time (time scales ~ minutes/hours/days)
• Immediate web-based dissemination and publishing
• Please see http://voeventnet.org
P48PQ Event
FactoryVOEN Engine
P60
Raptor
ParitelWeb Event
Archive
Externalarchives
Compute resources Robotictelescopenetwork
Follow-up obs.
PI: R. WilliamsSpons. NSF/DDDAS
Broader and Societal Benefits of a VO
• Professional Empowerment: Scientists and studentsanywhere with an internet connection would be able to doa first-rate science A broadening of the talent poolin astronomy, democratization of the field
• Interdisciplinary Exchanges:
– The challenges facing the VO are common to mostsciences and other fields of the modern human endeavor
– Intellectual cross-fertilization, feedback to IT/CS
• Education and Public Outreach:
– Unprecedented opportunities in terms of the content,broad geographical and societal range, at all levels
– Astronomy as a magnet for the CS/IT education
“Weapons of Mass Instruction”
VO Education and Public Outreach
Google Sky: uses DSS,
SDSS, HST data, etc.,
for easy sky browsing
Soon also: Microsoft’s
World Wide Telescope
Transformation and Synergy• We are entering the second phase of
the IT revolution: the rise of the
information/data driven computing– The impact is like that of the industrial
revolution and the invention of the
printing press, combined
• All science in the 21st century is becoming cyber-science (aka
e-science) - and with this change comes the need for a new
scientific methodology
• The challenges we are tackling:
– Management of large, complex, distributed data sets
– Effective exploration of such data ! new knowledge
– These challenges are universal
• There is a great emerging synergy of the computationally
enabled science, and the science-driven IT
Information Technology ! New Science
• The information volume grows exponentially
Most data will never be seen by humans!
The need for data storage, network, database-relatedtechnologies, standards, etc.
• Information complexity is also increasing greatly
Most data (and data constructs) cannot becomprehended by humans directly!
The need for data mining, KDD, data understandingtechnologies, hyperdimensional visualization, AI/Machine-assisted discovery …
• We need to create a new scientific methodology on the basisof applied CS and IT
• VO is the framework to effect this for astronomy
A Modern Scientific Discovery Process
Data Gathering (e.g., from sensor networks, telescopes…)
Data Farming:Storage/ArchivingIndexing, SearchabilityData Fusion, Interoperability
Data Mining (or Knowledge Discovery in Databases):
Pattern or correlation searchClustering analysis, automated classificationOutlier / anomaly searchesHyperdimensional visualization
Data Understanding
New Knowledge
} Database
Technologies
KeyTechnicalChallenges
KeyMethodologicalChallenges
+feedback
The key role of data analysis is to replace the
raw complexity seen in the data with a reduced
set of patterns, regularities, and correlations,
leading to their theoretical understanding
However, the complexity of data sets and
interesting, meaningful constructs in them is
starting to exceed the cognitive capacity of the
human brain
Universal Challenges:Towards The New Scientific Methodology
• Data farming and harvesting
– Semantic webs, computational and data grids, universal or trans-
disciplinary standards and ontologies …
– Digital scholarly publishing and curation (libraries)… data, metadata, virtual data, hierarchical data products; legacy vs. dynamical; open
vs. proprietary; data, knowledge, and codes; persistency; peer review; web samizdat
vs. officially blessed and supported; mandates; etc., etc.
• Data mining and understanding, knowledge extraction
– Scalable DM algorithms
– Hyperdimensional visualization
– Empirical validation of numerical models
– Computer science as the “new mathematics”
• The art and science of scientific software systems
– Architecture, design, implementation, validation …
Some Distinguishing Characteristics of
Data/Comp. Enabled (e-)Science
• Data-intensive: massive (TB-scale and beyond) data sets
– Poissonian errors not important, systematics dominate
• Data complexity: multi-wavelength and/or multi-scale and/or
multi-epoch data sets, 100’s or 1000’s parameters per source,
combining imaging, spectroscopy, etc.
– Heterogeneity and visualization are key issues
• Computationally intensive
• Traditional solutions do not scale to the scope of new problems
– Need new tools and scalable algorithms
• Data and computing resources (an experise) are generally
geographically distributed
• Inherently cross-cutting in many ways (CS/Astro, multi-# …)
Some Thoughts on CyberScience
• Enables a broad spectrum of users and contributors
– From large teams, to small teams, to individuals
– Data volume ~ team size, but scientific returns ! f (team size)
– Human talent is distributed very broadly geographically
Open, distributed, web-based nature of new science is a key feature
• Transition from data-poor to data-rich science
– Chaotic ! Organized … regulation vs. creative freedom
– Can we learn to ask a new kind of questions?
• Information is cheap, but expertise is expensive
– Just like the hardware/software situation
• Computer science as the “new mathematics”
– It plays the role in relation to other sciences which mathematics
did in ~ 17th - 20th century
Summary Comments• Cyber-enabled (computationally and data-enabled) science is a
practical necessity
– Complex problems ! simulations, complex and massive data sets
– Distributed resources (data, facilities, people…) ! virtual scientificorganizations (VO is an example)
• It is IT-enabled, and has a potential to drive transformative
scientific and practical advances
– The key challenge is to think differently (computationally)
– Remember the origins of WWW; now grid, semantic web,knowledge extraction tools, MP/HPC design and apps…
• There is a great deal of methodological commonality between
different fields
– … and this commonality can lubricate some genuine multi- or
interdisciplinary research, with a great discovery potential
– Let’s avoid wasteful replication of efforts, share the tools, methods