Addressing the Challenges of the Scientific Data Deluge
description
Transcript of Addressing the Challenges of the Scientific Data Deluge
1
Addressing the Challenges of the Scientific Data Deluge
Kenneth ChiuSUNY Binghamton
2
Outline
• Overview of collaborative projects that I’m working on.
• Discussion of challenges and approaches.
• Technical overview of specific projects
3
Autoscaling Project
• Traditional research focus in sensor networks on energy, routing, etc.
• In “environmental observatories”, management is the problem.
• Adding a sensor takes a lot of manual reconfiguration.– Calibration, recalibration.– QA/QC is also a major issue.
• What corrections have been applied to the data, and, what calibrations/maintenance have been applied to the sensor?
• With U. Wisconsin, SDSC, and Indiana University.
4
Motivation
• Adding a sensor requires a great deal of manual effort.– Reconfiguring datalogger– Reconfiguring data acquisition software– Reconfiguring QA/QC triggers– Reconfiguring database tables
• QA/QC is not very automated• Result: Sensor networks are not very scalable.• Goal: Automate.
5
Metadatafor each final table
Metadata
• describes each final table
• are used to generate forms dynamically for data retrieval from website
• entered manually
6
Approach
• Use a agent-based, bottom-up approach.
• Agents coordinate among themselves, as much as possible.
• Unify communications. All communications done via data streams.
• Data streams represented as content-based, publish-subscribe systems.
7
Long-Term Ecological Research (LTER)
Data-logger
QAAgent
Env.Agent
QAAgent
Oracle
WebServer
WebBrowser
Config. Event (CIMA)
Other Connection
Trout Lake Station
University ofWisconsin Campus
Buoy
ORB
ARTS Connection
Sensors
Env. Event (CIMA)
JDBC/ODBC Connection
WebBrowser
Config.Agent
Config.Agent
Config.Agent
Config.Agent
1
2
3
3
4
Other Locations
8
Agents
• Characteristics– Autonomous– Bottom up– Distributed coordination– Independence/loosely-coupled
• Can be thought of as a “style” for implementing distributed systems.
9
Sensor Metadata
• Each sensor has intrinsic and extrinsic properties.– Intrinsic are type, model number, etc.
• Static: Cannot be changed.• Dynamic: SDI-12 address.
– Extrinsic are location, sampling rate, etc.
• Use code generation techniques to generate the proper code based on the sensor data.
10
Automatic Sensor Detection and Inventory
InstrumentAgent
WebService
Acquisition Computer
Field Station Computer
DataloggerProgram
SensorMetadata
Repository
3
5: Upload
4: Generate
Datalogger
Sensor
Response
Request
2
2
3
1: Detection event
Data Center
Database
6: Data
7
11
QA/QC
• Malfunctioning anemometer detected as an abnormal occurrence of zero wind speed values.
0
50
100
150
200
250
Jan-95 Jan-97 Jan-99 Jan-01 Jan-03
frequency ofzero hourly average wind speed values per month
12
Another Example
• Buoy was pulled down in the water by the ice.
-2
-1
0
1
2
3
4
23-Nov 23-Dec 22-Jan 21-Feb
watertemperature(deg C)
-2
-1
0
1
2
3
4
15-Nov 15-Dec 14-Jan 13-Feb
sensors displaced normal winter
Hu and Benson
13
Crystal Grid Framework
• Seeks to develop standards and middleware for integrating instrument and sensor data into wide-area infrastructures, such as grid computing.
• With Indiana University.
14
Motivation• Process of collecting and generating data is often critical.
– Current mechanisms for monitoring and control either require physical presence, or use ad hoc protocols, formats.
• Instruments and sensors are already “wired”.– Usually via obscure, or perhaps proprietary protocols.
• Using standard mechanisms and protocols can give these devices a grid presence.– Benefit from a single, unified paradigm, terminology.– Single set of standards; exploit existing grid standards.– Simplifies end-to-end provenance tracking.– Faster, seamless interactions between data acquisition and data
processing.– Greater interoperability and compatibility.
Philosophy: Push grid standards as close to the instrument or sensor as possible. (But no further!) Deal with “impedance mismatches” close to the instrument, so as to localize complexity.
15
• Develop a set of standard grid services for accessing and controlling instruments.– Based on Web standards such as WSDL, SOAP, XML, etc.
• Develop a instrument ontology for describing instruments.– Applications use the description to interact.
• The goal is to develop middleware that abstracts and layers functionality.– Minor differences in instruments should only result in minor loss
of functionality to the application.
• Move metadata and provenance as close to the instrument as possible.
Goals
16
Overview
Physical Network Transport
Data Pipeline
AcquisitionComponent
AcquisitionCode
InstrumentAccess
AnalysisComponent
AnalysisCode
InstrumentAccess
CurationComponent
CurationCode
InstrumentAccess
Instrument
Sensor 1
Controller
Sensor 2
InstrumentPresentation
Scientist
InstrumentAccess
RemoteAccess
GUI
Device-Independent Application Module
Device-Dependent Virtualization Module
Shared Implementation
17
Distributed X-Ray Crystallography
• Crystallographer, chemist, and technician may be separated.– Large resources such as synchrotrons– Convenience and productivity– Expanding usage to smaller institutions
• Data collection, analysis, and curation may be separated.
• Approximate data requirements: 1-10 TB/year.– Currently stored at IU.
• Real-time data collection and control.• Collaboration with IU, Sydney, JCU,
Southampton.
18
X-Ray Crystallography
• Scientists are very reluctant (understandably) to install your software on the acquisition machine.– Use a proxy box by which to access files via CIFS or
NFS.– Scan for files which indicate activity.
• Unfortunately, scientists can manually create files, which can confuse the scanner. No ideal solution.
• For sensor data, request-response is not ideal.– Push data using one-way messages.
• In WSDL 2.0, consider “connecting” out-only services to in-only services.
19
X-Ray Crystallography
Portal
InstrumentManager
DataArchive
Non-grid service
Grid service
Persistent
Non-persistent
Portal
InstrumentManager
DataArchive
Indiana University
University of Sydney
InstrumentServices
Proxy BoxAcquisition Machine
CIFS
Argonne National Labs
University of Southhampton
Fromdiffractometer Instrument
Services
Proxy BoxAcquisition Machine
CIFS
Fromdiffractometer
20
TASCS: Center for Technology for Advanced Scientific Component
Software
• Multi-institution DOE project.• Seeks to develop a common component
architecture for scientific components.• My focus within it is to develop a
BabelRMI/Proteus implementation.– And develop C++ reflection techniques to improve
dynamic connection abilities.
• With LLNL and many other institutions.
21
Babel
• Language interoperability toolkit developed at LLNL.
• Allows writing objects in a number of languages, including non-OOP ones such as Fortran.
• Began as a purely in-process tool, now includes an RMI interface.
22
Proteus
• Started off as a unification API for messaging over multiple standards and implementations, such as CORBA, JMS, SOAP.
• Moving towards focusing on multiprotocol web services.
• Though almost always bound to SOAP, WSDL actually fully supports almost any protocol.
23
Runtime
Stub
IOR
C++ Skel
RMI Stub
Proteus
Impl
Skel
IOR
C++ Stub
Proteus
SerializableObject
B-PAdapter
B-PAdapter
SerializableObject
Generated
Library
User
Babel-ProteusGenerated
WSIT WSIT
24
Multiprotocol
Network
Proteus
Client
ProviderA
ProviderB
Proteus
Client
ProviderA
ProviderB
Protocol A
Protocol B
Process 1 Process 2
28
Lake Sunapee• Most e-Science/cyberinfrastructure R&D is for
institutional science.– Assume significant resources and expertise.
• Much less work on CI for citizen science, non-profits organizations, etc.
• This project explores how to engage them in the development of cyberinfrastructure and e-Science.– Also with a focus on how to use e-Science to
engage and educate K-12.– Also with a focus on how to train CS students to
better engage scientists.• With U. Wisconsin, U. Michigan, LSPA, and
IES.
29
• Hold a series of workshops to understand needs.
• Research and develop systems to allow them accessible means to interpret the sensor data.
• Course component: seminar/project course where students will work with citizen scientists in small groups to define and implement e-Science projects with the lake association.
30
• Semantic publish-subscribe.– Content-based publish-subscribe needs a
content model.– Semantic web/description logics provide an
ideal content model.
31
Many Small Datasets
• Much ecological data is characterized not by a few large datasets, but many small datasets.– e-Science has up to now chosen to focus on a
few large datasets, mostly.
32
Flexible Electronics and Nanotechnology
• Work with Howard Wang in BU ME.
• “Ontologies” for materials science processes (internal).
• Undergraduate education project (NSF).
33
Material Processes
• Materials science research product is the characterization of a process (vibration, heating, chemical, electrical, etc.).
• Applying such research is finding a sequence of processes that will transform a material A (with certain properties such as particle size) to a material B (with certain other properties).
• Very difficult to search the research literature.• Also, this is a type of path finding problem.
34
“annealing”
hasName
tempSchedule
“a schedule”
Conceptually, the schedule is just a function that gives the temperature as output given the time as input. One question is whether or not to attempt to represent it partially in the graph model, or to treat it’s representation as completely outside the model.
For example, a function can be represented as a table, or a Fourier series, wavelets, etc.
“annealing”
hasName
tempSchedule
“a differentschedule”
This is an anonymous node that only serves to “bind” the other nodes together. You can think of it as representing the process as a whole.
Information is sparse.
35
Undergraduate Education
• Groups of nanotechnology students develop senior design projects with CS students.
36
Programs-Australia-Canada-China-Finland-Florida-New Zealand-Israel-South Korea-Taiwan-United Kingdom-Wisconsin
First meeting:San DiegoMarch 7-9, 2005
Source: T. Kratz
37
Vision and Driving Rationale for GLEON
• A global network of hundreds of instrumented lakes, data, researchers, students,
• Predict lake ecosystems response to natural and anthropogenic mediated events – Through improved data inputs to simulation models– To better plan and preserve freshwater resources on
the planet
• More or less a grass roots organization.• Led by Peter Arzberger at SDSC, and with U.
Wisconsin.
38
Why develop such a network?
• Global e-science becoming increasingly possible
• Developments in sensors and sensor networks allow some key measurements to be automated
Porter, Arzberger, C. Lin, F. P. Lin, Kratz, et al. (2005)
July 2005 Issue
Source: T. Kratz
39
40
Outline
• Overview of collaborative projects that I’m working on.
• Discussion of challenges and approaches.
• Technical overview of specific projects
41
Research Challenges
• Biggest challenge is data.• Much time and effort is spent managing data in time-
consuming and human-intensive means.– Often stored in Excel, text files, SAS.– Metadata in notebooks, gray matter.
• No incentives to make data reusable.– Providing data is not valued academically.
• Too much manual work involved in acquisition.– Means much is not captured automatically and semantically.
• Standardization of things such as ontologies are very slowl, and tend to be top-down.– Can we first build a system that provides some benefit without
forcing them to go through a painful standardization process?
42
Cyberinfrastructure and e-Science
• There have been huge improvements in hardware.
• There have been huge local improvements in software.
• Not so many improvements in large-scale integration and interoperability.
43
Data, Data, and More Data!
• Data is the driver of science.
• Recent advances in technology have given us the ability to acquire and generate prodigious amounts of data.
• Processing power, disk, memory have increased at exponential rates.
44
It’s Not a Few Huge Datasets
• Huge datasets get more attention.– More glamorous.– Traditional type of CS problem.– Easier to think about.
• But it’s the number of different datasets that is the real problem.– If you have one big one, can concentrate efforts on
the problem.– Not very amenable to traditional CS “thinking”, since
there is a very significant “human-in-the-loop” component.
– The best CS research is useless if the human ignores it.
45
We Are The Same!(More or Less)
Technology advances fast.
People advance slowl!People compose our institutions, our organizations, our
modes of practice.
Result: The old ways of doing things don’t cut it. But we haven’t yet figured out the
new ways.
46
Technology Impacts Slowly
• Technologies often require many systemic changes to bring benefits.– Sometimes require other complementary technologies
to be invented.
• Steam engine invented in 1712, did not become huge economic success till 1800’s.
• Motor and generator invented in early 1800s.– Real benefits did not occur till 1900s.
47
• Steam-powered factories built around a single large engine.
• Belts and other mechanical drives distributed power.• If you brought a motor to a factory foreman:
– His factory wasn’t built for it.– He might not be able to power it.
• Chicken-and-egg problem.
– He doesn’t even know how to use it.
• It took decades.• Similarly, I believe we are in the early stages when it
comes to computer technology.
Steam To Electric
48
Socio-Technical Problem
• What will it take to figure out how to use all this data?
• Not a pure CS problem, people’s actions affect how easy is it use all the data.
• Many problems these days are sociotechnical in nature.– Password security is a solved problem.– Interoperability is a solved problem.
• Figuring out how to use data is even harder than power, since power distribution is physical, easy to see.– Data/info flow is hard to see.
49
A Vision
• A scientist sits in his office.• He wonders: “I wonder if children who live closer
to cell towers have higher rates of autism?”• How much time would it take a scientist to test
this hypothesis?– Find the data.– Reformat the data, convert it, etc.– Run some analysis tools. Maybe find time on a large
resource.• But the data is out there!
– There are many hypotheses that are never tested because it would take too much work.
50
• This vision also applies to business, military, medicine, industry, management, etc.
• There are a million sources of data out there.– Real-time data streams, archived data, scientific
publications, etc.
• How can we build a flexible infrastructure that will allow analyses to be composed and answered on the fly?
• How do we go from data+computation to knowledge?
51
RDF-like Data Model
• We hypothesize that part of the problem is that RDBMS are based on data models that do not fit scientific data well.– This “impedance” mismatch is a barrier.
• Thus, develop models that more closely resemble the mental model that scientists use when thinking about data.– The less a priori structure imposed on the
data, the better.
52
Goals• Allow some common subset of code and design to be used for
many scientific data and applications.• Suggest a data and information architecture for querying and
storage.• Provide some fundamental semantics. Each discipline would then
refine these semantics.• Don’t get bogged down in trying to figure out everything. Just try to
find some LCD.• This is a logical model of data. Also need a “physical” model to
handle transport, archiving, etc. Then need to map from the physical model to the logical model. For example, an image file has more than just the raw intensities. But some metadata may not be in the file. We don’t want the logical model to be concerned about the how the data is actually arranged.
• Promote bottom-up, grass-roots approaches to building standards.
53
One Person’s Metadata Is Another Person’s Data
• Distinction between data and metadata is artificial and problematic.– What is metadata in one context becomes data in another. For
example, suppose you are taking the temperature at a set of locations (determined via GPS). So for each reading, the temperature is the data, and the metadata is the location. But now suppose that you need the error of the location. So now the error becomes the metametadata of the location metadata?
– A made-up example based loosely on crystallography: The spatial correction is based on a calibration image obtained from a brass plate. So the calibration image is metadata for the set of frames. Now suppose that they need the temperature of the brass plate when the image was made. So now the temperature is metametadata.
54
• Use a graph-based model.– Base on RDF.– Actual data is stored as a graph
• Contrast with models like E-R, where the graph “models” the data, rather than actually being the data.
• A node in E-R might be “customer”, and represent the class of entities that are customers, rather than any specific customer.
• The model:– Each node is a datum.– Each edge denotes an association/attribute/property.– Nodes can be grouped into nodesets, which are also nodes.
• A node may be in more than one nodeset.
– A node-edge-node triple can also be a node.– Main difference from RDF is an attempt to build reification into
the model.
• Somewhat similar to a hypergraph.
55
• The edge with the attribute name set_attr_1 is an attribute of a nodeset.• The edge with the attribute name triple_prop is an attribute of the above
edge.
13
20
temperature
angle
set_attr_1
triple_prop
Nodeset
Nodeset
57
Complete Capture of Raw Data
• Complete digital capture of data and metadata.– Already digital.
• Must have full provenance and other metadata.
58
Put Everything In the Triplestore
• Unify semantic networks and data graphs.
• Metadata relationships can use reified triples.
• Don’t wait for standards, people take too long to decide.– Bottom-up standards tend to work better.– First must have the demand for the standard.
• All data is read-only.
59
But We Can Never Store That Much
• Maybe we can.
• But to drive a technology, first need to show a need.
• RDBMS have had several decades of research to improve performance.
60
Publications Are Data
• In some fields, such as materials science, papers are 80% boilerplate text.
• It’s better to directly publish this as structured, semantic data.– No NL.
• Use NL annotations where needed.
61
• A scientist runs experiments.– All data is captured.
• She reaches a point where she wishes to publish.
• She reviews her experimental data (all captured with provenance, and full metadata, sensor calibration, etc.), and drags and drops what is most relevant.
• She creates a narrative by creating some annotated links between experiments to explain the insights.– Typically probably at most one page of text, maybe
less.• She clicks a button to submit for publication.
62
Closer Ties Between Theoreticians and Practitioners
• In the real world, likely that semantic data treatments will need to deal with uncertainty, quantitativeness, ambiguity, fuzziness.– There is research in these areas, but not a lot of
penetration into practice, which prevents good feedback to the theoreticians.
– For example, many practitioners don’t even know about polyhierarchies. (Clay Shirky)
• Often attempts to create ontologies result in trying to figure out which class is the parent.
63
Outline
• Overview of collaborative projects that I’m working on.
• Discussion of challenges and approaches.
• Technical overview of specific projects
64
Distributed Triplestores
• Published in e-Science 2007.
• With IU student Tharaka Devadithya.
65
Motivation
• Data in some domains is dynamically structured.• Predefining structures (e.g., schemas in RDBMS)
creates a barrier for storing such data.– Certain minute details may get discarded.
• Scientists generally store experiment details in text or binary files (e.g., spreadsheets, word processing documents).– These files can be stored in databases as BLOBs.– However, it is not possible to efficiently query these
data.– Sharing data among other collaborators require that
everyone can read the format used by the author.
66
Storing Dynamically Structured Data
• An RDBMS can be used by modifying its schema each time the structure of data changes. – Not a feasible option if the schemas need to be
modified very frequently.• Data can be stored in a file system with a
hierarchical directory structure to organize the data.– The author needs to remember the organization of
data.– Difficult to share data among collaborators.
• There is a strong requirement for a store of dynamically structured data that does not hinder the ability of efficient querying.
67
Dynamic Structures with Databases
Timestamp Value Units
2006-10-12 14:23:33
25.2 Celsius
2006-10-12 16:44:25
25.5 Celsius
Timestamp Timezone Value Units
2006-10-12 14:23:33
EST (or NULL?)
25.2 Celsius
2006-10-12 16:44:25
EST (or NULL?)
25.5 Celsius
Date Time Timezone Value Units
2006-10-12 14:23:33
EST (or NULL?)
25.2 Celsius
2006-10-12 16:44:25
EST (or NULL?)
25.5 Celsius
New column
68
Dynamic Structures with Databases…more issues
• Suppose the following information is stored about a sensor.– Manufacturer– Measurement type (e.g., temperature, humidity)– Measurement units
• What if there is one sensor whose manufacturer is not known?– Insert NULL to the Manufacturer field?
• Now, what if it is required to store purchased date only for one sensor?– Add a new column? What value to store in this column
for other sensors?– Add another table and join with the original table?
69
Semantic Web Solution
• Semantic web solutions have been successfully used both in scientific and commercial environments.– Do not impose any structure on the data.– Data modeled as a directed graph.
• Resource Description Framework (RDF) is the most commonly used standard for representing such graphs.– Can be used to describe any property about any
resource.
70
RDF and Triplestores
• Triple– Subject: the resource being described– Predicate: the property being described– Object: the value of the property
• E.g., methyl-cyanide crystallographer John– The crystallographer for methyl-cyanide is John.
• A graph in RDF is represented as a set of triples. – Each triple connects a subject node to an object node
in the graph.• A persistent set of such triples is known as a
triplestore.
Subject Predicate Object
71
Example of RDF Graph
72
XML Databases
• Proposed as suitable for such dynamically structured data.
• Commercial databases starting to provide native support for XML.
• XML is extensible and does not impose any structure on the data.– Therefore, it allows to dynamically build
structures
• Suffers from update anomalies
73
Update Anomalies with XML• Assume an XML database is used for storing information about
crystallography experiments, as follows.
<experiment> <crystallographer> <name>John Smith</name> <designation>Scientist</designation> <address>...</address> </crystallographer> <startTime>...</startTime> <location>IUMSC</location> ...</experiment>
• Results in storing redundant information– Address of John Smith will be the same for all experiments.– What happens if he changes his address? Update all previous XML fragments?
• Solution: Normalize certain details as in relational DBMS.– E.g., separate address information from the experiment details and provide a link
(reference) to an address document
74
• However, in order to normalize, the schema should be known in advance.
• This is not possible when data gets added arbitrarily without being compliant with any predefined schema.
• The user has to determine how to normalize the data.• Solution: to normalize everything
– resulting only in attribute, value pairs. E.g.,
<experiment> <crystallographer ref=“JohnSmith”></exeriment>
<JohnSmith> <name ref=”John Smith”></JohnSmith>
…– Very similar to RDF model
75
Need for a Distributed Triplestore
• Origination points• Ownership• Scalability
– Large number of triples.• E.g., consider a table in a RDBMS having 15 columns.
Migrating its data to a triplestore would result in 15 triples for each row in the table.
• Also there will be – data from more than one table– data that normally do not get stored in a database
– This leads to scalability issues.• E.g., querying would be slow, indices might often need to be
fetched/stored from disk.– In order to go beyond the scalability limits of a single triplestore,
triples need to be distributed across multiple triplestores.
77
Our Approach
• Clients access the triplestores via a mediator. • Mediator maintains several indexes to facilitate efficient
querying. • When the mediator receives a query
– breaks down the query in to several sub-queries– find out which triplestores are capable of responding to each sub-query.
• Indexes are mainly used to– build a cost model for the querying– eliminate the triplestores that are unable to give results for a given sub-
query.
78
Types of Indexes at the Mediator
• Predicate Index– Contains details about the predicates in each triplestore– Certain fields are used for cost estimation for sub-queries.
• Node Index– Maintains a list of nodes in the triple graph along with the
triplestores in which these nodes exist.– Contains only resources (E.g., ns:crystallographer); Literals
(E.g., “John Smith”) are not stored.– Used to eliminate certain triplestores when sub-querying.
• Edge Index– Two edge indexes are used for outgoing and incoming edges,
respectively. – Used to avoid querying triplestores that do not have the
corresponding edges from or to them.
79
Future Work
• Minimize joins between triplestores– Identify frequent joins– Instruct the triplestores to re-distribute their
triples such that most of the future joins will be performed locally.
• Avoid extra level of network hop due to the mediator by using a mediator cache.
• Consider network communication when estimating costs for the query plan.
80
Parallel XML Parsing
• Published in Grid 2006, CCGrid 2007, e-Science 2007, IPDPS 2008, ICWS 2008 (streaming), HiPC 2008 (streaming).
• With BU students Yinfei Pan and Ying Zhang.
81
Motivation
• XML is gained wide prevalence as a data format for input and output.
• Multicore CPUs are becoming widespread.– Plans for 100 cores.
• If you have 100 cores, and you are only using one to read and write your output, that could be a significant waste.
82
Parallel XML Parsing
• How can XML parsing be parallelized?– Task parallelism.– Pipeline parallelism.– Data parallelism.
83
• Task parallelism.– Multiple independent processing steps.– The sauce for a dish with sauce can be made in parallel to the
main part.
Step 1
Step 2A
Step 2B
Step 3
Time
Core 1
Core 1
Core 2
Core 1
84
• Pipeline parallelism.– Multiple stages, all simultaneously performed in parallel.– If you are making two cakes (but only have one oven), you can start
mixing the batter for the second cake while the first one is in the oven.
Stage 1Data C
Stage 2Data B
Stage 3Data A
Tim
e
Core 1 Core 2 Core 3
Stage 1Data D
Stage 2Data C
Stage 3Data B
Stage 1Data E
Stage 2Data D
Stage 3Data C
85
• Data parallelism– Divide the data up, process multiple pieces in parallel.
Input Chunk 1 Input Chunk 2 Input Chunk 3
Core 1 Core 2 Core 3
Output Chunk 1 Output Chunk 1 Output Chunk 1
Merge
Output
86
But XML is Inherently Sequential
• How can a chunk be parsed without knowing what came before?
• The parser doesn’t know what state to start in.• Could do various scanning forwards and
backwards, but it is ad hoc, and tricky.– Special characters like < can be in comments.
<element attr=“value”>content</element>
87
Previous work
• We used a fast, sequential preparse scan– Build an outline of the document (skeleton)– Skeleton are used to guide full parse by first
decomposing XML document into well-formed fragments on well-defined unambiguous positions
– The XML fragments are parsed separately on each core by Libxml2 APIs
– Merge the results into final DOM with Libxml2 APIs
• The preparse is sequential, however, so Amdahl’s law kicks in. We scale well to 4 cores, or so.
• So how can we parallelize the preparse?
88
Example: The Preparsing DFA
• The preparsing DFA has two actions: START and END, which are used to build the skeleton during execution of the DFA.
0 1
2
5
3
6
74
>/!"'a
> / ! a'
> / ! a"
< / ! a"'a
</
>!
>
a( START )
a
>
/
>( END )
( END )
""
'
'
89
Example of running preparsing DFA
<foo>sample</foo>
0 1 0 03 0 1 2
END
2 0
START
3
How can this be parallelized?
90
Meta-DFA• Goal
– Pursues simultaneously all possible states at the beginning of a chunk when a processor is about to parse the chunk
• Achieved by:– Transforming the original DFA to a meta-DFA whose transition
function runs multiple instances of the original DFA in parallel via sub-DFAs
– For each state q of the original DFA, the meta-DFA includes a complete copy of the DFA as a sub-DFA which begins execution in state q at the beginning of the chunk
– For the actual execution, the meta-DFA transitions from a set of states to another set of states
91
Output Merging• Since the meta-DFA pursues multiple
possibilities simultaneously, there are also multiple outputs when a chunk is finished.– One corresponding to each possible initial state.
• We know definitively the state at the end of the first chunk.– This is used to select which output of the second
chunk is the correct one.– The definitive state at the end of the second chunk is
now known.– Etc.
92
Performance Evaluation
• Machine:– Sun E6500 with 30 400 MHz US-II processors– Operating System: Solaris 10– Compiler: g++ 4.0 with the option -O3– XML Standard Library: Libxml2 2.6.16
• Tests:– We take the average of ten runs– Test file is selected from a well-known project named
Protein Data Bank (PDB), sized to 34 MB– All the speedups are measured against parsing with
stand-alone Libxml2
93
• The full parsing process is:– First do a parallel preparse using a meta-DFA.
This generates an outline of the document known as the skeleton.
– Then use techniques based on parallel depth-first tree search to parallelize the full parse.
– Subtrees of the document are parsed using unmodified libxml2.
94
Preparser Speedup
• Parallel preparser relative to the non-parallel preparser
95
Speedup on parallel full parsing
• After applying our meta-DFA technique in parallizing the preparsing stage, the parallel full parsing is now scalable.
96
Summary• Data parallel XML parsing is challenging because the
parser does not know in which state to begin a chunk.– One solution is to simply begin the parser in all states
simultaneously.
• This can be achieved by modeling the parser as a DFA with actions, then transforming the DFA into a meta-DFA (product machine).
• The meta-DFA runs multiple instances of the original DFA, one instance for each state of the original DFA.
• The number of states in the meta-DFA is finite, so it is also a DFA and can be executed by a single core.– The parallelism of the meta-DFA is logical parallelism.
97
Future Work
• Parallelizing XPath– Significantly more challenging, but due to
Amdahl’s law, first need to parallelize parsing.
• Offload preparsing to FPGA or perhaps GPU.
98
Acknowledgements
• Grateful for the support provided by the NSF and the DOE for this work.– NSF awards 0836667, 0753178, 0513687,
and 0446298– DOE Award DE-FG02-07ER25803