Graph OLAP: Towards Online Analytical Processing on Graphs
description
Transcript of Graph OLAP: Towards Online Analytical Processing on Graphs
Graph OLAP: Towards Online Analytical Processing on GraphsChen Chen, Xifeng Yan, Feida Zhu, Jiawei Han,
Philip S. YuUniversity of Illinois at Urbana-Champaign
IBM T. J. Watson Research CenterUniversity of Illinois at Chicago
OutlineMotivationFrameworkEfficient ComputationExperimentsConclusion
Online Analytical ProcessingJim Gray, 1997OLAP as a powerful analytical tool
The Usefulness of OLAPMulti-dimensional
Different perspectivesMulti-level
Different granularitiesCan we offer roll-up/drill-down and slice/dice
on graph data?Traditional OLAP cannot handle this, because
they ignore links among data objects
The Prevalence of GraphsChemical compounds, computer vision
objects, circuits, XMLEspecially various information networks
Biological networksBibliographic networksSocial networksWorld Wide Web (WWW)
ApplicationsWWW
>= 3 billion nodes, >= 50 billion arcsFacebook
>= 100 million active usersCombining topological structures and
node/edge attributesGreat challenge to view and analyze them
We propose Graph OLAP to tackle this issue
Scenario #1A bibliographic
networkThe collaboration
patterns among researchers for SIGMOD 2004
Scenario #2
OutlineMotivationFramework
Data ModelTwo types of Graph OLAPDimension, Measure and OLAP operations
Efficient ComputationExperimentsConclusion
Data ModelWe have a collection of network snapshots G = {G1, G2, . . . , GN}
Each snapshot Gi = (I1,i, I2,i, . . . , Ik,i; Gi)I1,i, I2,i, . . . , Ik,i are k informational attributes
describing the snapshot as a wholeGi = (Vi, Ei) is an attributed graph, with
attributes attached with its nodes Vi and edges Ei
Since G1, G2, . . . , GN only represent different observations of a network, V1, V2, . . . , VN actually correspond to the same set of objects
Two Types of OLAPInformational OLAP (abbr. I-OLAP)Topological OLAP (abbr. T-OLAP)
Informational OLAPDimensions come
from informational attributes attached at the whole snapshot level, so-called Info-Dims
e.g., scenario #1
I-OLAP CharacteristicsOverlay multiple pieces of informationDo not change the objects whose interactions
are being looked atIn the underlying snapshots, each node is a
researcherIn the summarized view, each node is still a
researcher
Topological OLAPDimensions come from the node/edge
attributes inside individual networks, so-called Topo-Dims
e.g., scenario #2
T-OLAP CharacteristicsZoom in/Zoom outNetwork topology changed: “generalized”
nodes and “generalized” edgesIn the underlying network, each node is a
researcherIn the summarized view, each node becomes an
institute that comprises multiple researchers
Measures in Graph OLAPMeasure is an aggregated graph
I-aggregated graphT-aggregated graphOther measures like node count, average
degree, etc. can be treated as derivedGraph plays a dual role
Data sourceAggregate measure
Generality of the FrameworkMeasures could be complex
e.g., maximum flow, shortest path, centralityCombine I-OLAP and T-OLAP into a hybrid
case
Graph OLAP OperationsGraph I-OLAP Graph T-OLAP
Roll-up
Overlay multiple snapshots to form a higher-level summary via I-aggregated graph
Shrink the topology and obtain a T-aggregated graph that represents a compressed view, whose topological elements (i.e., nodes and/or edges) have been merged and replaced by corresponding higher-level ones
Drill-down
Return to the set of lower-level snapshots from the higher-level overlaid (aggregated) graph
A reverse operation of roll-up
Slice/dice
Select a subset of qualifying snapshots based on Info-Dims
Select a subgraph of the network based on Topo-Dims
OutlineMotivationFrameworkEfficient Computation
Measure classificationOptimizationsConstraint pushing
ExperimentsConclusion
Two Categories of StrategiesTop-down
Generalized cells laterHow to combine and leverage intermediate
results?Bottom-up
Generalized cells firstHow to early-stop?
Measure ClassificationHow to combine and leverage intermediate
results?Distributive
The computation of high-level cells can be directly built on low-level cells
Algebraic Not distributive, but can be easily derived from
several distributive measuresHolistic
Neither distributive nor algebraic
ExamplesDistributive: collaboration frequency
Use distributiveness to drive computation up the cuboid lattice
Algebraic: maximum flowWill prove laterSemi-distributive
Holistic: centralityNeed to go down to the raw data and start
from scratch
OptimizationsSpecial measures may have special
properties that can help optimize the calculations
We discuss two of them here, with regard to I-OLAPLocalizationAttenuation
LocalizationDuring computation, only a neighborhood of
the networks needs to be consultede.g., the collaboration frequency of “R.
Agrawal” and “R.Srikant” for [sigmod, all-years] only depends on their collaboration frequencies in each SIGMOD conferences
Perfect (i.e., 0-neighborhood) localizationk-neighborhood is less ideal, but still useful
e.g., # of common friends shared by “R. Agrawal” and “R.Srikant”
AttenuationConsider the transporting capability (i.e.,
maximum flow) from source S to destination TMultiple transportation networks, each one is
operated by a separate companyWith regard to I-OLAP, each network is a
“snapshot”, and overlaying more than one snapshots means to share link capacities among companies
AttenuationData graph C
Node: citiesEdge: capacity of a link
Measure graph FNode: citiesEdge: when maximum flow is transmitted, the
quantity that passes through a link
AttenuationMaximum flow is algebraic
F can be derived from C Just run the maximum flow algorithm
The capacity graph C is obviously distributiveLemma
Let F be a flow in C and let CF be its residual graph, where residual means that CF = C - F, then F′ is a maximum flow in CF if and only if F + F′ is a maximum flow in C
AttenuationConsider two snapshots that are overlaid
Maximum flow F1, F2 already calculated from C1, C2
Without attenuation Compute the overall maximum flow F from C1 + C2
With attenuation Take F1 + F2 as basis Compute the residual maximum flow F′ from (C1 - F1)
+ (C2 - F2), and augment it onto F1 + F2
Thus, our input attenuates from C1 + C2 to (C1 + C2 ) - (F1 + F2 ), which substantially decreases the efforts
Constraint PushingIceberg graph cube
Partial materializationSatisfying some interestingness requirement
Push the constraintsAnti-monotone
e.g., maximum flow |f| ≥ δ|f|
Monotone e.g., diameter d ≥ δd
OutlineMotivationFrameworkEfficient ComputationExperimentsConclusion
OLAP a Bibliographic NetworkWe get the coauthorship data from DBLPMeasure
Information CentralityTwo Info-Dims
Area Database (DB): PODS/SIGMOD/VLDB/ICDE/EDBT Data Mining (DM): ICDM/SDM/KDD/PKDD Information Retrieval (IR): SIGIR/WWW/CIKM
Time
OLAP a Bibliographic Network
EfficiencyA test that computes maximum flow as the
measureSynthetically generate flow networks
Details in the paper, with each “snapshot” representing an individual player in the transportation industry
Like the Multi-Way method, calculate low-level cells before merging them into high-level onesOne takes advantage of the attenuation
heuristicThe other does not
Efficiency
OutlineMotivationFrameworkEfficient ComputationExperimentsConclusion
ConclusionWe propose a Graph OLAP framework to
perform multi-dimensional, multi-level analysis on network dataMeasure is an aggregated graphInformational/Topological dimensions lead to I-
OLAP, T-OLAP
ConclusionMainly focusing on I-OLAP, we discuss how a
graph cube can be efficiently computed and materializeddistributive, algebraic, holisticOptimizations: localization, attenuationConstraint pushing
Future WorksTechnical issues for T-OLAPSelective drilling and discovery-driven
InfoNet-OLAP
Thank You!