An Evaluation of Alternative Physical Graph Data …dblab.usc.edu/Users/papers/Neo4jTR.pdfAn...

16
An Evaluation of Alternative Physical Graph Data Designs for Processing Interactive Social Networking Actions Shahram Ghandeharizadeh, Reihane Boghrati and Sumita Barahmand Database Laboratory Technical Report 2014-10 Computer Science Department, USC Los Angeles, California 90089-0781 shahram, boghrati, barahman @usc.edu September 7, 2014 Abstract This study quantifies the tradeoff associated with alternative physical representations of a social graph for process- ing interactive social networking actions. We conduct this evaluation using a graph data store named Neo4j deployed in a client-server (REST) architecture using the BG benchmark. In addition to the average response time of a design, we quantify its SoAR defined as the highest observed throughput given the following service level agreement: 95% of actions to observe a response time of 100 milliseconds or faster. For an action such as computing the shortest distance between two members, we observe a tradeoff between speed and accuracy of the computed result. With this action, a relational data design provides a significantly faster response time than a graph design. The graph designs provide a higher SoAR than a relational one when the social graph includes large member profile images stored in the data store. A Introduction A graph database provides an intuitive representation of a social graph. It supports vertices that may represent members and edges that may represent a relationship such as friendship between two members. Queries may filter vertices of interest and navigate edges to retrieve relevant data. Updates may insert and delete a vertex, add and remove edges between vertices, and change the property value of edges and vertices. Facebook’sTAO [1] is an example graph data store that serves a social graph to hundreds of millions of users on a daily basis. In Proceedings of the Sixth TPC Technology Conference on Performance Evaluation and Benchmarking (TPCTC), Hangzhou, China, Septem- ber 2014. 1

Transcript of An Evaluation of Alternative Physical Graph Data …dblab.usc.edu/Users/papers/Neo4jTR.pdfAn...

An Evaluationof AlternativePhysicalGraphDataDesignsfor

ProcessingInteractiveSocialNetworkingActions�

ShahramGhandeharizadeh,ReihaneBoghrati andSumitaBarahmand

DatabaseLaboratoryTechnicalReport2014-10

ComputerScienceDepartment,USC

Los Angeles,California90089-0781

�shahram,boghrati,barahman� @usc.edu

September7, 2014

Abstract

Thisstudyquantifiesthetradeoff associatedwith alternativephysicalrepresentationsof asocialgraphfor process-

ing interactivesocialnetworkingactions.Weconductthisevaluationusinga graphdatastorenamedNeo4jdeployed

in a client-server (REST)architectureusingtheBG benchmark.In additionto theaverageresponsetimeof a design,

we quantify its SoARdefinedasthehighestobservedthroughputgiven thefollowing servicelevel agreement:95%

of actionsto observe a responsetime of 100 millisecondsor faster. For an actionsuchascomputingthe shortest

distancebetweentwo members,we observe a tradeoff betweenspeedandaccuracy of thecomputedresult.With this

action,a relationaldatadesignprovidesa significantlyfasterresponsetime thana graphdesign.Thegraphdesigns

provide a higherSoAR thana relationalonewhenthesocialgraphincludeslargememberprofile imagesstoredin

thedatastore.

A Intr oduction

A graphdatabaseprovidesanintuitiverepresentationof asocialgraph.It supportsverticesthatmayrepresentmembers

andedgesthatmayrepresenta relationshipsuchasfriendshipbetweentwo members.Queriesmayfilter verticesof

interestandnavigateedgesto retrieve relevantdata. Updatesmay insertanddeletea vertex, addandremove edges

betweenvertices,andchangethepropertyvalueof edgesandvertices.Facebook’sTAO [1] is anexamplegraphdata

storethatservesa socialgraphto hundredsof millions of userson adaily basis.�In Proceedingsof theSixthTPCTechnologyConferenceonPerformanceEvaluationandBenchmarking(TPCTC),Hangzhou,China,Septem-

ber2014.

1

Onemayrepresentasocialgraphusingdifferentphysicalgraphrepresentations.To illustrate,considerthefriend-

ship relationshipbetweentwo membersA andB. It maystartwith onemember, sayMemberA, extendinga friend

invitation to MemberB. And, MemberB acceptingthis invitation. Two physicalrepresentations,termedLabeledand

Distinct, areasfollows. With Labeled,thefriendshipedgebetweenMemberA andB is assignedavalueto identify it

asa friendshipinvitation. OnceMemberB acceptsA’s invitation,thevalueof thisedgechangesto denoteaconfirmed

friendship.With Distinct,therearetwo typesof edges,onefor apendingfriendinvitationandasecondfor aconfirmed

friendship.WhenMemberB acceptsA’s invitation, thesystemdeletestheedgecorrespondingto thefriend invitation

andcreatesaconfirmedfriendshipedgebetweenthem.Thisdesigncreatesanddeletesedgesmorefrequentlythanthe

Labeleddesign.

A researchtopic is whatarethe tradeoff associatedwith thesealternative designsfor differentworkloads?And,

how do they comparewith datastoresthatimplementa differentdatamodelsuchasrelationaldatabasemanagement

systems(RDBMSs)? To investigatetheseresearchtopics,we hada choiceof benchmarksincluding BG [7, 6, 14],

LinkBench[4], LDBC [11, 2], or amicro-benchmarksuchas[3] and[16]. After acarefulanalysis,wedecidedto use

BG for two reasons.First, BG is a statefulbenchmarkthatquantifiesboth the averageresponsetime of a datastore

andits throughputgiven a pre-specifiedservicelevel agreement(SLA). The latter is termedSocialAction Rating,

SoAR[7], andis similar to thetpsrating1 definedby theTPC-Cbenchmark[12, 15]. As reportedin SectionC.3,an

RDBMS may provide an averageresponsetime that is fasterthanNeo4j for someactionswhile Neo4joutperforms

the RDBMS when consideringSoAR with certaindatabasesettings. Second,BG quantifiesthe amountof stale,

inconsistent,or invalid data(collectively, termedunpredictabledata [7, 8]) producedby a datastore. This is useful

becausecertainsocialnetworking actionssuchascomputingtheshortestdistancebetweentwo membersmayutilize

heuristicsearchtechniquesthatdonot producecorrectresults,seediscussionsof Figure4 in SectionC.

The primary contribution of this study are two folds. First, it identifiesfour physicalgraphdatadesignsfor

processinginteractivesocialnetworkingactions,seeFigure3. Second,it evaluatesthesedesignsusingtheNeo4j[22]

datastoreandthe BG benchmark.This includesextensionsof BG with the following threegraphorientedactions:

GetShortestDistance,List CommonFriends,andList Friends-of-Friends.Themainfindingsof our evaluationareas

follows. TheDistinctphysicalgraphdesignprovidesasuperiorperformancewhencomparedwith theLabeleddesign.

With the threenew graphorientedactions,an industrialstrengthrelationaldatabasemanagementsystem(SQL-X)

providesfasterresponsetimesthanNeo4jconfiguredwith a variantof theDistinct designnamedStoredDistinct(see

descriptionof Figure3 for details).Onereasonfor this is thenormalizationguidelineof therelationaldatamodelthat

representsamany-to-many friendshiprelationshipasatable.Thisenablesthegraphorientedactionsto fetchasmaller

amountof datafrom a singletableto provide fasterresponsetimes. With a workloadconsistingof a mix of actions,

SQL-X providesa higherSoARthanNeo4jwhenthesocialgraphconsistsof no images.Whenlargeprofile images

arestoredin SQL-X, Neo4jprovidesa higherSoARthanSQL-X.1SoARis differentthantps in that theSLA canbechangeddependingon therequirementsof anapplicationwhile TPC-C’s specifiedSLA is

fixed.

2

Therestof this paperis organizedasfollows. We survey the relatedwork in SectionB. SectionC describesan

implementationof theBG benchmarkusingNeo4j,detailingfour physicalgraphdatadesignsandtheir performance

characteristicsfor differentmix of actions.SectionC.3quantifiesthetradeoffs associatedwith agraphandarelational

datadesign.Our futureresearchdirectionsarecontainedin SectionD.

B RelatedWork

Evaluationof graphdatastoreshasbeena subjectof active researchduringthepastfew years.Theaverageresponse

time of differentactionsof a microbenchmarkis presentedin [3] to comparetwo graphdatabases(Neo4j andDex)

with a RDF store(RDF-3X) andtwo relationaldatabasemanagementsystems(PostgreSQLandVirtuoso).Similarly,

in [16], the responsetime of several social networking actionsis usedto comparethe performanceof alternative

graphquerylanguagesusingNeo4jwith JavaPersistentAPI (JPA) usingtheMySQL relationaldatabasemanagement

system.BothstudiesconsiderNeo4jdeployedin eitherembeddedor aclient-server (REST)mode.

Thisstudyis differentthan[3, 16] alongtwo dimensions.First,wefocusonNeo4jCypherRESTto investigatethe

alternativephysicaldesignsof a socialgraph,seethetaxonomyof Figure2 andits discussionin SectionC.1. Second,

weusetheBG benchmarkto analyzeboththeaverageresponsetimeandSoARof thedifferentdesigns.Thisanalysis

includesboth readandwrite actions.(Both [3, 16] focuson readactionsonly.) A key findining is thata designthat

providesa high performancewith infrequentwrite actionsmaynot performwell whenthefrequency of write actions

is higher, seeTable4 andits discussionin SectionC.2. A novel featureof BG is its ability to quantifytheamountof

erroneousdataproducedby a datastore. We usethis capabilityof BG to show that onemay tradeperformancefor

accurracy of resultswith an actionsuchasGet ShortestDistance.To the bestof our knowledge,thesefindingsare

novel andhavenot beenpresentedelsewhere.

C BG benchmark and its implementation using Neo4j

Figure1 showstheconceptualdesignof BG’ssocialgraphusedfor thisevaluation.(See[6, 7, 9] for acomprehensive

descriptionof BG.) The Membersentity set containsthoseuserswith a registeredprofile. It consistsof a unique

identifier anda fixed numberof string attributes2. Onemay configureBG to createa socialgraphwith or without

images.In this paper, weconsiderbothpossibilities.With images,all experimentalresultsareobtainedusingasocial

graphconfiguredwith a 2 KB thumbnailimageanda 12 KB profile image. Thumbnailimagesaredisplayedwhen

listing friendsof a memberandthe higherresolutionprofile imageis displayedwhena membervisits a profile. A

membermayextendafriend invitationto anothermemberor befriendswith amember, representedusing“Invite” and

“Friend” relationshipsets,respectively. A resourcemaypertainto animage,apostedquestion,atechnicalmanuscript,

etc.Theseentitiesarecapturedin onesetnamed“Resources”.In orderfor a resourceto exist, a membermust“Own”2Thesizeof theseattributesis configurable[6].

3

Figure1: BG benchmark’sconceptualschema.

thatresource.A membermaypostaresource,sayanimage,ontheprofileof anothermember, representedasa“Posted

on” relationshipbetweentwo membersanda resource.A membermaycommenton a resource.This is implemented

usingthe“Manipulation” relationshipset.

BG usesa closedemulationmodelto generatea workloadof actionsfor a datastore. With this model,a thread

emulatesa MemberA who performsan actionon anothermemberor resource.This memberwho is performing

the actionis termeda socialite. A threaddoesnot emulateanothersocialiteuntil the pendingactionof the current

socialiteis processed.BG controlstheloadimposedon adatastoreby varyingthenumberof threadsusedto emulate

concurrentsocialitesperformingactions,see[7, 6] for details.

Figure2 shows four differentgraphrepresentationsof this conceptualdatamodel.We describethesealternatives

whenpresentingthedifferentactionsthatconstitutethecoreof BG’s workload.This discussionpresentstheaverage

responsetime ( ��� ) andSocial Action Rating (SoAR) of the alternative graphmodelsusing a single nodeNeo4j

deployment. ��� is quantifiedwith BG emulatinga singlesocialiteissuinga mix of actionsby issuingoneactionat

a time. It is theaverageamountof time elapsedfrom whena socialiteissuesa requestto the time Neo4jcompletes

servicingthe request.SoAR is the highestthroughputobservedwith a servicelevel agreement(SLA) that requires

95%of actionsto observea responsetimeof 100millisecondsor fasterwith no staledata.

The targethardwareplatformconsistsof two PCsconnectedusinga Gigabit switch. EachPC consistsof an i7-

4770processor, 16 GB of memory, oneTB of disk storage,anda Gigabit networking card. The operatingsystem

of eachPC is a 64 bit Windows 2012Server. The versionof Neo4j server is 2.0.1andwe usedNeo4j’s Cypher3

querylanguageto implementthe Client that performsthe interactive socialnetworking actions(termedBGClient).

All experimentsassumea social graphconsistingof 100,000memberswith 100 friendsper member( � ) and100

resourcespermember( ).We classifyBG’s actionsinto readandwrite. Below, wepresentthemin turn.

3Cypheris adeclarative languagesimilar to SQL.

4

Figure2: Fourphysicalgraphrepresentationsof BG’s database.

5

Figure3: Four physicalgraphdesigns.

C.1 BG’s ReadActions

BG’s actionsandtheir graphimplementationareasfollows. First, theView Profile (VP) actionemulatesa Socialite

with memberid A visiting theprofileof amemberwith id �� . BG generatesA and �� asinput to VP. A mayequal�� ,emulatinga socialitereferencingherown profile. Theoutputof VP is theprofile informationof �� , including �� ’sattributesandthefollowing two simpleanalytics: � ’snumberof friendsandnumberof postedresourcesonherwall.

If thesocialiteis referencingherown profile (A equals � ) thenVP retrievesa third simpleanalytic, � ’s numberof

pendingfriend invitations.

Theobservedsystemperformancewith theVP actiondependsonthephysicalrepresentationof thegraphdatabase.

Figure 3 shows four differentphysicalrepresentationsusinga two dimensionalquad,seealso Figure 2. The two

dimensionscorrespondto thealternative representationsof thesimpleanalyticsandfriendship.Onemayimplement

thesimpleanalyticsusingaCypherquerythatcomputestherequiredvalueeverytime,seethefirst columnof Table2.

Alternatively, onemay storethe valueof thesesimpleanalyticsandupdatethemin the presenceof write actions,

enablingtheVP actionto simply look up thestoredvalue,seethelastcolumnof Table2. Thesetwo alternativesare

termed4 ComputeandStored, respectively.

With the friendshiprelationship,onemay representpendingfriend invitationsandthe confirmedfriendshipsas

uniqueedges(relationships)independentof oneanother. This designis termedDistinct friendship.Alternatively, one

may representbothasoneedgeandlabel the edgeto identify eithera pendinginvitation or a confirmedfriendship.

This designis termedLabeledfriendship. Thesetwo alternativesconstitutethe rows of Figure3, resultingin four

physicalgraphdesignsshown in thequad.

Thefirst row of Table1 shows theaverageresponsetime, ��� , observedwith the alternative designsfor the VP

action. TheStoredDistinctis clearly thefastestof thealternatives. Its SoARwith VP is morethantwice higherthan4They aretermedBasicandManualin [10] with a relationalandJSONrepresentationof BG socialgraph.

6

ComputeLabeled ComputeDistinct StoredLabeled StoredDistinctView Profile(VP) 308 93 12 8List Friend(LF) 435 293 520 313

Table1: ��� , in milliseconds,for thealternativephysicalgraphrepresentationsusingNeo4jwith a100Ksocialgraph,� =100friendspermember, =100resourcespermember.

Compute,seethefirst row of Table4.

TheList Friend (LF) actionof BG emulatesa socialiteA viewing member � ’s list of friends.Similar to thedis-

cussionof VP, A mayequalto � emulatingthesocialiteviewing herown list of friends.LF retrievestheprofile infor-

mationof eachfriend includingtheir thumbnailimageandexcludingtheir profile image.We implementLF usingthe

following Cypherquery: MATCH (u1:Members)-[f:Friend]- (u2:Members) WHERE u1.userid

= � AND f.status=Confirmed RETURN u2.userid, u2.username, u2.fname, u2.lname,...,

u2.thumbnail.

Table1 showsrepresentationof afriendshipasadistinctedgeis fasterthanusinglabelededges.With thelatter, the

querymustincur theadditionaloverheadof examiningthevalueof eachlabel(pendingversusconfirmedfriendship)

to processtheLF action.However, thealternativedesignsprovidecomparableSoAR,seethesecondrow of Table4.

TheGet ShortestDistance(GSD)actionof BG computesthedistancebetweentwo membersin thesocialgraph.

If thesetwo membersare the sameuser then their shortestpath is zero. If they are friends then their shortest

path is one. If they belongto two disjoint social graphsthen their shortestpath is MAX-INT . The Cypherquery

to implementGSD is: MATCH p=shortestPath((u:Members)-[:Friend*.. depthToTraverse]

-(u2:Members)) WHERE u.userid= �� and u2.userid= �� RETURN length(p) as total. The

parameterdepthToTraverse definesthenumberof levels(termeddepth)of friendshiprelationshiptraversedby

the shortestPath function of Neo4j, striking a balancebetweenthe observedresponsetimesandthe accuracy of the

computedvalue. Increasingdepthmay enhancethe accuracy of GSD andslow down its processing,resultingin a

higherresponsetime.

Figure4.ashow theaverageresponsetimeof GSDasafunctionof thedepthtraversedwith 10and100friendsper

member. BG quantifiesthepercentageof GSDactionsthatobserveincorrectresults,termedunpredictabledata[7], � .Figure4.bshows thepercentageof GSDrequeststhatobserveaccurateresults,termedAccuracy (100-� ), asfunction

of the depthwith differentnumberof friends per member. As we increasethe traverseddepthon the x-axis, the

computeddistancebecomesmoreaccurate(i.e., � decreases[7]) andthesystembecomesslower astheshortestPath

function visits many morevertices. A sufficiently high depthvaluecausesthe shortestPath to visit all verticesand

terminate,producing100%accurateresults.Theresponsetime level off beyondthis depth.

More formally, theresponsetime levelsoff whenthedepthtraversedmultiplied by thenumberof friendsequals

the total numberof members,resultingin 100%accurateresults. For example,in Figure4.a,with the 100K social

graphand100 friendsper member, the responsetime levelsoff at a deptof 1,000. It levelsof at a depthof 10,000

7

DataModel Query

ComputeLabeled

a.MATCH (u:‘Members’)-[f:‘Friend’]-(uu:‘Members’)WHERE u.userid=profileOwnerID AND f.status=ConfirmedRETURN COUNT (uu) AS total

b. MATCH (u:‘Members’)<-[f:‘Friend’]-(uu:‘Members’)WHERE u.userid=profileOwnerID AND f.status=PendingRETURN COUNT (uu) AS total

c. MATCH (u:‘Members’)<-[c:‘Postedon’]- (r:‘Resources’)WHERE u.userid=profileOwnerID RETURN COUNT(r) AS total

d. MATCH (u:‘Members’) WHERE u.userid = profileOwnerIDRETURN u.userid, u.username, u.lname, u.fname,u.gender, u.dob, u.jdate, u.ldate, u.address,u.email, u.tel, u.pic

ComputeDistinct

a.MATCH (u:‘Members’)-[f:‘Friend’]-(uu:‘Members’)WHERE u.userid=profileOwnerIDRETURN COUNT (uu) AS total

b. MATCH (u:‘Members’)<-[f:‘Invite’]-(uu:‘Members’)WHERE u.userid= profileOwnerIDRETURN COUNT (uu) AS total

c. MATCH (u:‘Members’)<-[c:‘Postedon’]-(r:‘Resources’)WHERE u.userid= profileOwnerIDRETURN COUNT(r) AS total

d. MATCH (u:‘Members’) WHERE u.userid = profileOwnerIDRETURN u.userid, u.username, u.lname, u.address, u.gender,u.dob, u.jdate, u.ldate, u.fname, u.email, u.tel, u.picMATCH (u:‘Members’) WHERE u.userid = profileOwnerID

StoredLabeled/ RETURN u.userid, u.username, u.lname, u.fname, u.gender,StoredDistinct u.dob, u.jdate, u.ldate, u.address, u.email, u.tel,

u.friendsCount, u.pendingfCount, u.resourcesCount, u.pic

Table2: CypherqueriesthatimplementtheView Profileactionwith four differentdatamodels.

8

Figure4a: AverageResponseTime( ��� ).

Figure4b: Accuracy.

Figure4: Averageresponsetimeandaccuracy of GSDasafunctionof thetraverseddepthwith a100Kmembersocialgraphandtwo differentsettingsfor thenumberof friendspermember( � = 10 and100).

with 10 friendspermember. Thefirst row of Table3 shows theobservedresponsetime with a depthof 20,000with

differentnumberof friendspermember, � . With thisdepth,GSDprovides100%accurateresulsandits responsetime

levelsoff with all three � values.

With a fixeddepthfor theshortestPathfunction,theresponsetime is fasterwith fewer friendspermemberasthis

functionvisits fewer vertices.Hence,its accuracy is alsolower. To illustrate,considera depthof 100on thex-axis

of Figure4. Theobservedresponsetime with 10 friendspermemberis six time faster, 100versus600milliseconds.

Moreover, theaccuracy is significantlylower, 7% versus25%,asits traversalof eachdepthvisits fewer vertices(10

timeslower) andits likelihoodof visiting thevertex of interestis lower. Thefirst row of Table3 shows theresponse

time increasesasa functionof � asGSDmustprocessmany moreedges.

TheView Friend Request(VFR) actionof BG retrievesSocialiteA’spendingfriendrequest,retrieving theprofile

informationof eachmemberwho hasgenerateda friend requestfor memberA. Thebehavior of VFR with Neo4j is

similar to thediscussionof LF.

A socialiteusesthe View Commentson Resource (VCR) actionto display the attributesof commentsposted

ona resourcewith auniqueRID. Its Cypherqueryis asfollows: MATCH (u:Members)-[m:Manipulation]-

>(r:Resources) WHERE r.rid=RID RETURN u.userid, r.rid, m.mid, m.type, m.content,

m.timestamp. Thesocialitemaypostanddeletecommentson a resource(PCRandDCR) thatcreatesanddeletes

edgesbetweena memberanda resourcevertex, respectively.

The View Top-K Resources(VTR) enablesa socialite(MemberA) to retrieve anddisplayher top � resources

postedon herwall. Both thevalueof � andthedefinitionof “top” areconfigurable.Our Cypherimplementationuses

9

�=10

�=100

�=1,000

GetShortestDistance(GSD) 402 2,733 41,027List CommonFriends(LCF) 2,120 4,368 34,630List Friends-of-Friends(LFF) 12 212 7,939

Table3: ��� , in milliseconds,of theStoredDistinctphysicalgraphdesignasa functionof thenumberof friends( � )with a 100Ksocialgraphand =100resourcespermember.

the uniqueid assignedto a resource(rid) asthe definition of top: MATCH (u:Members)<-[cf:PostedOn]-

(r:Resources) WHERE u.userid=A ORDER BY r.rid LIMIT k.

The List Common Friends (LCF) actioncomputesthe commonfriendsof two members.If thesetwo mem-

bersarethe samememberthen their commonfriends is an emptyset. If they arefriendsthenLCF retrievestheir

commonfriendsexcluding themselves. Otherwise,if their distanceis threeor higher, then the result is an empty

set. The setis definedasthe memberswho area distanceof onefrom both members.Match (u1:Members),

(u2:Members),(mf:Members) WHERE u1.userid= �� AND u2.userid= �� AND (u1)-[:Friend]

-(mf)-[:Friend]-(u2) RETURN mf.userid. Theresponsetimeof LCF increasesasafunctionof thenum-

berof friendspermember, � . (Seethe secondrow of Table3.) At times,the resultof the LCF actionmight be the

emptysetasits input membersmayhaveno commonfriends.Thelikelihoodof this is lowerwith highervaluesof � ,

explainingthehigheraverageresponsetime.

The List Friends-of-Friends (LFF) actioncomputesthosememberswho area distanceof two from the spec-

ified member, including their commonfriends. The Cypherquery to implementthis action is as follows: MATCH

(u1:Members)- [:Friend *2..2]-(u2:Members) WHERE u1.userid= � and NOT (u1)-[:Friend]-

(u2) RETURN distinct u2.userid. The third row of Table3 shows the responsetime of the LFF action

increasessuperlinearlyasa functionof � . With LFF, a tenfold increasein thevalueof � resultsin a tenfold increase

in the numberof retrieveduserids.More precisely, givenM members,BG constructsthe socialgraphby assigning

members(i+j)%M asfriendsof Member� wherethevalueof j variesfrom 1 to5 � � . Hence,LFF retrieves ��� userids.

For example,with � =10 and100, LFF retrieves20 and200 members,respectively. While this explainsthe higher

responsetime asa function of � , thereappearsto be additionaloverheadthat causesthe responsetime of Neo4j to

increasesuperlinearly.

C.2 BG’s Write Actions

BG supportsfour write actionsthat impact the friendshiprelationship(edges)betweenmembers(vertices). These

are Invite Friend (IF), Accept Friend Request(AFR), Reject Friend Request(RFR), and Thaw Friendship(TF).

All involve SocialiteA invoking the actionon Member � . Theseactionsmodify either the presenceof edgesor

the attribute valueof an edgebetweenvertices. For example,the Cyphercreate edge commandfor the IF ac-5Thetoruscharacteristicsof themodfunctionguarantees� friendspermember.

10

Workload ComputeLabeled ComputeDistinct StoredLabeled StoredDistinctView Profile(VP) 971 714 2,205 2,251List Friend(LF) 93 119 112 118

0.1%Write Actions 117 459 819 8351%Write Actions 46 369 435 49910%Write Actions 32 162 0 100

Table4: SoARof thefour physicalgraphmodelswith workloadsconsistingof VP only, LF only, anda mix of readandwrite actions.

tion with thelabeleddesignis asfollows: MATCH (u1:Members), (u2:Members) WHERE u1.userid=A

AND u2.userid= � CREATE (u1)-(:Friend � status:pending � )->(u2).

With theStoredrepresentations,thesewrite actionsmustmaintainthesimpleanalyticsattributevaluesof avertex

(member)upto date.For example,theAFR actionmustincrementthenumberof friendsof theverticescorresponding

to MembersA and �� . Moreover, it mustdecrement6 thenumberof pendingfriend invitationsfor MemberA.

Table4 shows the SoAR of the alternative physicalgraphdesignsfor a mix of readandwrite actions.The first

columnincreasesthe frequency of the write actionssuchasInvite FriendandThaw Friendship,seeTable5. This

reducesthe SoAR of all designsshown in Figure 2. With a mix consistingof 10% write actions,computingthe

analyticsof the View Profile actionprovidesa higherperformancethanthe storeddesignsdueto their overheadto

maintainthevalueof simpleanalyticsup to date.

Representingpendingandconfirmedfriendshiprelationshipswith uniqueedgesprovidesa higherperformance

whencomparedwith labelededges,compareComputeDistinctandComputeLabeledcolumnsin Table4. Both slow

down asa functionof anincreasingmix of write actions.With ComputeDistinct,whena memberconfirmsa pending

friendship invitation, the systemdeletesan edgeand insertsa new one. With ComputeLabeled,the sameaction

changesthevalueassociatedwith apropertyof anedge.Thisconsumesmoreof systemresourceswith ourworkloads,

resultingin a lowerSoAR.

C.3 Comparison with SQL-X

This sectioncomparesthe performanceof an industrialstrengthrelationaldatabasemanagementsystem(RDBMS)

named7 SQL-X with Neo4jusingBG. Theschemausedfor theRDBMS to representthesocialgraphis asfollows:

� Users(userid, username,pw, fname,lname,gender, dob, jdate,ldate,address,email, tel, profileImage,thumb-

nailImage,#Friends,#FriendInvitations,#Resources)

� Friendship(inviter, invitee, status)

� Resource(rid,creatorid,walluserid, type,body, doc)6BG is astatefulbenchmarkthatgeneratesvalid actionsonly. Whenit invokestheAFR actioninvolving MemberA and ��� , it doessobasedon

its knowledgeof � � having apendingfriend invitation from A. See[7] for details.7Dueto licensingagreement,wecannotdisclosetheidentity of this system.

11

BG VeryLow Low HighSocial Type (0.1%) (1%) (10%)Actions Write Write Write

View Profile,VP Read 40% 40% 35%List Friends,LF Read 5% 5% 5%View FriendRequests,VFR Read 5% 5% 5%Invite Friend,IF Write 0.04% 0.4% 4%AcceptFriendRequest,AFR Write 0.02% 0.2% 2%RejectFriendRequest,RFR Write 0.02% 0.2% 2%Thaw Friendship,TF Write 0.02% 0.2% 2%View Top-K Resources,VTR Read 40% 40% 35%View Commentsona Resource,VCR Read 9.9% 9% 1%

Table5: Threemixesof socialnetworkingactions.

Algorithm GSD(USERID1, USERID2, MAXDEPTH):If UserID1 equals UserID2 return 0If MaxDepth equals 0 return MAX-INTInitialize Visited ���Initialize SRC � UserID1 �Initialize CurrentDepth 0While (true):(1) CurrentDepth CurrentDepth+1(2) If (CurrentDepth ! MaxDepth) return MAX-INT(3) If (Visited contains all members) return MAX-INT(4) Qry "SELECT unique inviteeid FROM Friendship WHERE "(5) For each userid in SRC:

Extend Qry with the clause "inviterid=userid"using boolean or connective

(6) Visited Visited " SRC(7) Execute Qry using RDBMS to obtain a result set �(8) If (UserID2 #$� ) return CurrentDepth(9) SRC = � - ( �&% Visited)(10)If (SRC is empty) return MAX-INT

Figure5: GetShortestDistanceusingSQL-X.

� Manipulation(mid,rid, modifierid,creatorid, timestamp,type,content)

Underlinedattributesareindexedandserve astheprimarykey of a table. An italicizedattributerepresentsa foreign

key relationship.A confirmedfriendshipbetweentwo membersis representedastwo rows.

Exceptfor the LCF andthe GSD actions,an implementationof BG’s actionsusingthe SQL query languageis

straightforward anddescribedin [7, 10, 6]. We implementLCF(A,B) usinga singlequery: SELECT DISTINCT

f1.inviteeid FROM Friendship f1, f2 WHERE f1.inviteeid=f2.inviteeid and

f1.inviterid=A and f2.inviterid=B and f1.status=Confirmed and f2.status=Confirmed.

Figure5showsanimplementationof theBreadthFirstSearch(BFS)algorithmto implementGSDusingtheSQLquery

language.ThisalgorithmissuesaSQLqueryfor eachlevel of BFSstartingwith onememberof thesocialgraph,iden-

tified by UserID1. It terminatesonceit encounterstheothermemberof thesocialgraph(UserID2),exhaustsall the

12

SQL-X Neo4jGetShortestDistance(GSD) 718 2,588List CommonFriends(LCF) 14 4,317List Friends-of-Friends(LFF) 26 163

Table6: ��� in millisecondswith maximumdepth=1,000.

With Images No ImagesSQL-X Neo4j SQL-X Neo4j

0.1%Write Action 360 835 20,550 1,4601%Write Action 290 499 16,135 68810%Write Action 0 100 2,095 150

Table7: SoARwith 100Kmembers,� =100fpm, and =100rpm,with andwithout images.

membersof thesocialgraph,or exceedsits maximumalloweddepth.

Table6 shows theaverageresponsetime of GSD,LCF, andLFF with SQL-X andNeo4j for a socialgraphcon-

sistingof 100Kmembers,100fpm, and100rpm. SQL-X is fasterthanNeo4jfor processingeachof thesecommands.

An SQL implementationof thesecommandsreferencea singletable,Friendship,that is a vertical slice of the data.

For example,TheGSDalgorithmof Figure5 queriestheFriendshiptablerepeatedlyin Step5. Neo4j,on theother

hand,mayretrieve a vertex thatcontainsseveralpropertyvaluesof a memberincludinga 12 KB profile image. It is

possibleto furtherenhancethereportedGSDnumberswith SQL-X by implementingthealgorithmof Figure5 asa

storedprocedure.

Table7 shows theobservedSoARwith SQL-X andNeo4j (usingtheStoredDistinctdesign,seeFigure3) for the

threemix of write actionsshown in Table5. We considera BG databaseconfiguredwith eitherimagesor no images.

The latter lacks the 12 KB profile imageandthe 2 KB thumbnailimage. With both, the schemaof SQL-X stores

the simpleanalyticsof a memberasan attribute valueof a row andrequiresa write actionto maintainthesevalues

up-to-date[10].

SQL-X performspoorly whenrequiredto storeimageslarger than4 KB [19, 10] andNeo4j outperformsit by

a wide margin. With a socialgraphthat hasno images,SQL-X outperformsNeo4j by a wide margin, seelast two

columnsof Table7. SoAR of Neo4j is alsoenhancedwhenthe socialgraphhasno images.In [10], we show that

storingprofile imagesin thefile system,termedBoostedSQL-X design,enhancestheSoARof SQL-X by morethan

tenfolds. A futureresearchdirectionis to analyzeNeo4jwith imagesstoresin thefile system(similar to thediscussion

of BoostedSQL-X). We speculateits performanceto fall betweenthetwo extremesshown in Table7.

D Futur e Research Dir ection

We areextendingthis studyby consideringadditionalgraphdatastores,characterizingtheir scalabilityandtheir role

in processingmorecomplex socialnetworking actions.We describethesein turn.

13

WeareusingBG to completeanevaluationof Neo4jandothergraphdatabasessuchasG* [18] andOrientDB[23].

This includesananalysisof their scalabilitycharacteristicsanda comparisonwith datastoresthatsupportalternative

datamodels,e.g., documentstores,extensiblestores,key-valuestoresand relationalDBMSs. We also intend to

analyzetheoverheadof anObjectGraphModel(OGM) suchasBlueprintwhencomparedto usingthenativeinterface

of a graphdatastore[16].

Moreover, weintendto investigatealternativephysicalgraphdesignsfor processingmorecomplex socialnetwork-

ing actions,namely, feedfollowing actionssuchasShareResource(SR)andView New Feed(VNF) [9]. Thesemodel

a memberproducingeventsfor consumptionby othersanddisplayingtheeventsgeneratedby othermembersanden-

tities, typically their friendsor thosethatthey follow. Both thehighly variablefan-outof thefollowsgraphalongwith

its dynamicallychangingstructure(e.g.,a memberthaws friendshipwith anothermember)makesanimplementation

of feedfollowing challenging[20, 9]. Onemayintroducedifferentdesignsandimplementationsto addressthesechal-

lenges[21, 17, 5]. Oneis to materializethefeedof amemberandmaintainit up to datewhennew eventsareproduced

by thoseshefollows[21]. A graphdatabasesuchasNeo4jmaybesuitablefor thisPushparadigmbecauseit supports

extensionsof a vertex with new attributes.A designmaysplit a vertex into multiple verticesonceit increasesbeyond

a certainsize[13]. Finally, edgesmay maintainthe relationshipbetweenolder andnewer feedasa member’s feed

grows in size.An alternative to Pushis to Pull eventsandmayincludeclever designsthatsynergizesthosemembers

with mutualfriendsby maintainingonenews feedfor them.We planto investigatethesealternative implementations

with Neo4jandothergraphdatastores.

References

[1] Z. Amsden,N. Bronson,G. CabreraIII, P. Chakka,P. Dimov, H. Ding, J.Ferris,A. Giardullo,J.Hoon,S.Kulka-

rni, N. Lawrence,M. Marchukov, D. Petrov, L. Puzar, andV. Venkataramani.TAO: How FacebookServesthe

SocialGraph.In SIGMODConference, 2012.

[2] R. Angles, P. Boncz, J. Larriba-Pey, I. Fundulaki,T. Neumann,O. Erling, P. Neubauer, N. Martinez-Bazan,

V. Kostev, andI. Toma.TheLinkedDataBenchmarkCouncil:A GraphandRDFIndustryBenchmarkingEffort.

SIGMODRec., 43:27–31,March2014.

[3] R. Angles,A. Prat-Perez,D. Dominguez-Sal,andJ. Larriba-Pey. BenchmarkingDatabaseSystemsfor Social

Network Applications. In First InternationalWorkshopon GraphData ManagementExperiencesandSystems,

GRADES’13, 2013.

[4] T. Armstrong,V. Ponnekanti,D. Borthakur, andM. Callaghan.LinkBench: A DatabaseBenchmarkBasedon

theFacebookSocialGraph.ACM SIGMOD, June2013.

[5] X. Bai, F. P. Junqueira,andA. Silberstein.CacheRefreshingfor OnlineSocialNewsFeeds.In CIKM, 2013.

14

[6] S. Barahmand.BenchmarkingInteractive SocialNetworking Actions,Ph.D.thesis,ComputerScienceDepart-

ment,USC,2014.

[7] S. BarahmandandS. Ghandeharizadeh.BG: A Benchmarkto EvaluateInteractive SocialNetworking Actions.

Proceedingsof 2013CIDR, January2013.

[8] S. BarahmandandS. Ghandeharizadeh.BenchmarkingCorrectnessof Operationsin Big DataApplications.

Proceedingsof IEEE MASCOTS, 2014.

[9] S. BarahmandandS. Ghandeharizadeh.Extensionsof BG for TestingandBenchmarkingAlternative Imple-

mentationsof FeedFollowing. ACM SIGMODWorkshopon ReliableData ServicesandSystems(RDSS), June

2014.

[10] S. Barahmand,S. Ghandeharizadeh,andJ. Yap. A Comparisonof Two PhysicalDataDesignsfor Interactive

SocialNetworking Actions. In CIKM, 2013.

[11] P. Boncz.LDBC: Benchmarkfor GraphandRDFDataManagement.IDEAS, October2013.

[12] TransactionProcessingPerformanceCouncil.TPCBenchmarks,http://www.tpc.org/information/benchmarks.asp.

[13] R. Nishtalaet.al. ScalingMemcacheat Facebook.NSDI, 2013.

[14] S.GhandeharizadehandS.Barahmand.A Mid-Flight Synopsisof theBG SocialNetworkingBenchmark.Fourth

WorkshoponBig Data Benchmarking, October2013.

[15] J. Gray. The BenchmarkHandbookfor DatabaseandTransactionSystems(2nd Edition), MorganKaufmann

1993,ISBN 1055860-292-5.

[16] F. HolzschuherandR. Peinl. Performanceof GraphQueryLanguages:Comparisonof Cypher, Gremlin and

NativeAccessin Neo4J.In Proceedingsof theJoint EDBT/ICDT2013Workshops, EDBT ’13, 2013.

[17] F. P. Junqueira,V. Leroy, M. Serafini,andA. Silberstein.ShepherdingSocialFeedGenerationwith Sheep.In

SNS, 2012.

[18] A. Labouseur, P. Olsen,andJ. Hwang. ScalableandRobustManagementof DynamicGraphData. In VLDB

WorkshoponBig DynamicDistributedData, 2013.

[19] R. Sears,C. V. Ingen, and J. Gray. To BLOB or Not To BLOB: Large Object Storagein a Databaseor a

Filesystem.TechnicalReportMSR-TR-2006-45,MicrosoftResearch,2006.

[20] A. Silberstein,A. Machanavajjhala,andR. Ramakrishnan.FeedFollowing: TheBig DataChallengein Social

Applications.In DBSocial, 2011.

15

[21] A. Silberstein,J.Terrace,B. F. Cooper, andR. Ramakrishnan.FeedingFrenzy:Selectively MaterializingUsers’

EventFeeds.In SIGMODConference, 2010.

[22] TheNeo4jTeam.TheNeo4jManualV2.1.1,May 29,2014,http://www.neo4j.org.

[23] C. Tesoriero.GettingStartedwith OrientDB. PacktPublishingLtd, 2013.

16