Graph OLAP: Towards Online Analytical Processing on Graphs

Graph OLAP: Towards Online Analytical Processing on GraphsChen Chen, Xifeng Yan, Feida Zhu, Jiawei Han,

Philip S. YuUniversity of Illinois at Urbana-Champaign

IBM T. J. Watson Research CenterUniversity of Illinois at Chicago

OutlineMotivationFrameworkEfficient ComputationExperimentsConclusion

Online Analytical ProcessingJim Gray, 1997OLAP as a powerful analytical tool

The Usefulness of OLAPMulti-dimensional

Different perspectivesMulti-level

Different granularitiesCan we offer roll-up/drill-down and slice/dice

on graph data?Traditional OLAP cannot handle this, because

they ignore links among data objects

The Prevalence of GraphsChemical compounds, computer vision

objects, circuits, XMLEspecially various information networks

Biological networksBibliographic networksSocial networksWorld Wide Web (WWW)

ApplicationsWWW

>= 3 billion nodes, >= 50 billion arcsFacebook

>= 100 million active usersCombining topological structures and

node/edge attributesGreat challenge to view and analyze them

We propose Graph OLAP to tackle this issue

Scenario #1A bibliographic

networkThe collaboration

patterns among researchers for SIGMOD 2004

Scenario #2

OutlineMotivationFramework

Data ModelTwo types of Graph OLAPDimension, Measure and OLAP operations

Efficient ComputationExperimentsConclusion

Data ModelWe have a collection of network snapshots G = {G1, G2, . . . , GN}

Each snapshot Gi = (I1,i, I2,i, . . . , Ik,i; Gi)I1,i, I2,i, . . . , Ik,i are k informational attributes

describing the snapshot as a wholeGi = (Vi, Ei) is an attributed graph, with

attributes attached with its nodes Vi and edges Ei

Since G1, G2, . . . , GN only represent different observations of a network, V1, V2, . . . , VN actually correspond to the same set of objects

Two Types of OLAPInformational OLAP (abbr. I-OLAP)Topological OLAP (abbr. T-OLAP)

Informational OLAPDimensions come

from informational attributes attached at the whole snapshot level, so-called Info-Dims

e.g., scenario #1

I-OLAP CharacteristicsOverlay multiple pieces of informationDo not change the objects whose interactions

are being looked atIn the underlying snapshots, each node is a

researcherIn the summarized view, each node is still a

researcher

Topological OLAPDimensions come from the node/edge

attributes inside individual networks, so-called Topo-Dims

e.g., scenario #2

T-OLAP CharacteristicsZoom in/Zoom outNetwork topology changed: “generalized”

nodes and “generalized” edgesIn the underlying network, each node is a

researcherIn the summarized view, each node becomes an

institute that comprises multiple researchers

Measures in Graph OLAPMeasure is an aggregated graph

I-aggregated graphT-aggregated graphOther measures like node count, average

degree, etc. can be treated as derivedGraph plays a dual role

Data sourceAggregate measure

Generality of the FrameworkMeasures could be complex

e.g., maximum flow, shortest path, centralityCombine I-OLAP and T-OLAP into a hybrid

case

Graph OLAP OperationsGraph I-OLAP Graph T-OLAP

Roll-up

Overlay multiple snapshots to form a higher-level summary via I-aggregated graph

Shrink the topology and obtain a T-aggregated graph that represents a compressed view, whose topological elements (i.e., nodes and/or edges) have been merged and replaced by corresponding higher-level ones

Drill-down

Return to the set of lower-level snapshots from the higher-level overlaid (aggregated) graph

A reverse operation of roll-up

Slice/dice

Select a subset of qualifying snapshots based on Info-Dims

Select a subgraph of the network based on Topo-Dims

OutlineMotivationFrameworkEfficient Computation

Measure classificationOptimizationsConstraint pushing

ExperimentsConclusion

Two Categories of StrategiesTop-down

Generalized cells laterHow to combine and leverage intermediate

results?Bottom-up

Generalized cells firstHow to early-stop?

Measure ClassificationHow to combine and leverage intermediate

results?Distributive

The computation of high-level cells can be directly built on low-level cells

Algebraic Not distributive, but can be easily derived from

several distributive measuresHolistic

Neither distributive nor algebraic

ExamplesDistributive: collaboration frequency

Use distributiveness to drive computation up the cuboid lattice

Algebraic: maximum flowWill prove laterSemi-distributive

Holistic: centralityNeed to go down to the raw data and start

from scratch

OptimizationsSpecial measures may have special

properties that can help optimize the calculations

We discuss two of them here, with regard to I-OLAPLocalizationAttenuation

LocalizationDuring computation, only a neighborhood of

the networks needs to be consultede.g., the collaboration frequency of “R.

Agrawal” and “R.Srikant” for [sigmod, all-years] only depends on their collaboration frequencies in each SIGMOD conferences

Perfect (i.e., 0-neighborhood) localizationk-neighborhood is less ideal, but still useful

e.g., # of common friends shared by “R. Agrawal” and “R.Srikant”

AttenuationConsider the transporting capability (i.e.,

maximum flow) from source S to destination TMultiple transportation networks, each one is

operated by a separate companyWith regard to I-OLAP, each network is a

“snapshot”, and overlaying more than one snapshots means to share link capacities among companies

AttenuationData graph C

Node: citiesEdge: capacity of a link

Measure graph FNode: citiesEdge: when maximum flow is transmitted, the

quantity that passes through a link

AttenuationMaximum flow is algebraic

F can be derived from C Just run the maximum flow algorithm

The capacity graph C is obviously distributiveLemma

Let F be a flow in C and let CF be its residual graph, where residual means that CF = C - F, then F′ is a maximum flow in CF if and only if F + F′ is a maximum flow in C

AttenuationConsider two snapshots that are overlaid

Maximum flow F1, F2 already calculated from C1, C2

Without attenuation Compute the overall maximum flow F from C1 + C2

With attenuation Take F1 + F2 as basis Compute the residual maximum flow F′ from (C1 - F1)

+ (C2 - F2), and augment it onto F1 + F2

Thus, our input attenuates from C1 + C2 to (C1 + C2 ) - (F1 + F2 ), which substantially decreases the efforts

Constraint PushingIceberg graph cube

Partial materializationSatisfying some interestingness requirement

Push the constraintsAnti-monotone

e.g., maximum flow |f| ≥ δ|f|

Monotone e.g., diameter d ≥ δd

OLAP a Bibliographic NetworkWe get the coauthorship data from DBLPMeasure

Information CentralityTwo Info-Dims

Area Database (DB): PODS/SIGMOD/VLDB/ICDE/EDBT Data Mining (DM): ICDM/SDM/KDD/PKDD Information Retrieval (IR): SIGIR/WWW/CIKM

Time

OLAP a Bibliographic Network

EfficiencyA test that computes maximum flow as the

measureSynthetically generate flow networks

Details in the paper, with each “snapshot” representing an individual player in the transportation industry

Like the Multi-Way method, calculate low-level cells before merging them into high-level onesOne takes advantage of the attenuation

heuristicThe other does not

Efficiency

ConclusionWe propose a Graph OLAP framework to

perform multi-dimensional, multi-level analysis on network dataMeasure is an aggregated graphInformational/Topological dimensions lead to I-

OLAP, T-OLAP

ConclusionMainly focusing on I-OLAP, we discuss how a

graph cube can be efficiently computed and materializeddistributive, algebraic, holisticOptimizations: localization, attenuationConstraint pushing

Future WorksTechnical issues for T-OLAPSelective drilling and discovery-driven

InfoNet-OLAP

Thank You!

Graph OLAP: Towards Online Analytical Processing on Graphs

Documents

Transcript of Graph OLAP: Towards Online Analytical Processing on Graphs