The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli,...

28
The Little Engines(s) That The Little Engines(s) That Could: Scaling Online Could: Scaling Online Social Networks Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder Chhabra, Pablo Rodriguez Telefonica Research From SIGCOMM 2010 1

Transcript of The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli,...

Page 1: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

The Little Engines(s) That The Little Engines(s) That Could: Scaling Online Social Could: Scaling Online Social

NetworksNetworksJosep M. Pujol, Vijay Erramilli, Georgos

Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder Chhabra, Pablo Rodriguez

Telefonica Research

From SIGCOMM 2010

1

Page 2: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

IntroductionIntroduction• Online Social Networks (OSNs) differ from

traditional web applications in two ways:o Handle highly personalized content.o Deal with highly interconnected data due to the presence of a strong

community structure among their end users.

• Scaling OSNs is needed. E.g. Twitter grew by 1382% between Feb and Mar 2009.

• Scaling solutions include “vertical” and “horizontal” scaling.o Vertical scaling: upgrade existing hardware. (infeasible)o Horizontal scaling: adding cheap commodity servers. (on stateless,

independent data)

2

Page 3: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

IntroductionIntroduction• OSNs ‘s data is not independent, and it is hard to

partition it into disjoint sets.o Because most of the operations in OSNs are based on the data of a

user and her neighbors, and most of users are belong to more then one community.

• Random partitioning is applied in many current OSNs.o Inter-server traffic is heavy.

• Replicating user’s data:o Eliminates the inter-server traffic.o Have many side-effects. (overhead, query time, update traffic)

3

Page 4: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

OutlineOutline• Introduction• SPAR And Problem Statement• SPAR On The Fly• Measurement• SPAR System Architecture And Evaluation• Related Work And Conclusions

4

Page 5: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPARSPAR• The main contribution of this paper.• A Social Partitioning And Replication middle-ware

for social applications.• What does SPAR do?

1. Solves the Designer’s Dilemma for early stage OSNs. (Adding features/Highly scalable system)

2. Avoids performance bottlenecks in established OSNs.3. Minimizes the effect of provider lock-ins.

• How does SPAR do it?o “Local semantics”o Joint partitioning and replication.

5

Page 6: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPARSPAR(a) Full Replication (b) Partition using DHT (random partitioning)(c) Random Partitioning (DHT) with replication of the neighbors(d) SPAR, socially aware partition and replication of the neighbors.

6

Page 7: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

Problem StatementProblem Statement• Requirement:

1. Maintain local semantics. 2. Balance loads. 3. Be resilient to machine failures. 4. Be amenable to online operations. 5. Be stable. 6. Minimize replication overhead.

• Replica: a copy of the user’s dataoMaster replica serves read/write. oSlave replica: for redundancy and “local semantics”.

• Formulation:oWe can formulate the solution as an optimization problem of minimizing the number of required replicas. (cast as an integer linear program)oSymbols: oG(V,E): a social graph G with node set V and edge set EoN = |V|, the numbers of users. M, the number of servers.

7

Page 8: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

Problem StatementProblem Statement• Formulation

o Symbols:3. pij, a binary decision variable that becomes 1 if and only if the primary

of user i is assigned to partition j, 1 ≤ j ≤ M.4. rij , a similar decision variable for a replica of user i assigned to

partition j.5. εii′ = 1 if {i,i′} ∈ E capture the friendship relationships.

6. K slaves to ensure redundancy.

1) Ensure one master copy.2) Ensure local semantics.3) Ensure load balance.4) Ensure redundancy demand.

8

Page 9: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

Problem StatementProblem Statement• Graph partitioning algorithms and modularity optimization

algorithms can applied to this problem, however…o Most graph partitioning algorithms are not incremental. (offline)o Algorithms based on community detection are known to be very

sensitive to input conditions. (not stable)o Reducing the number of inter-partition edges directly = reducing

the number of replicas? E.g.

o The partition P1 and P2 results in 3 edges between partitions and 5 nodes to be replicated (e to i). The partitions P3 and P4 results in 4 inter-partition edges but only 4 nodes need to be replicated (c to f).

9

Page 10: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

OutlineOutline• Introduction• SPAR And Problem Statement• SPAR On The Fly• Measurement• SPAR System Architecture And Evaluation• Related Work AndConclusions

10

Page 11: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPAR On The FlySPAR On The Fly• The online heuristic solution is applied when

social networks and servers change. (six events)o Node addition, node removal, edge addition, edge removal, server

addition, server removal.

• Edge addition: When a new edge is created between nodes u and v,1. Check whether both masters are already co-located with each other

or with a master’s slave, if not, next…2. Calculate the number of replicas that would be generated for each of

the three possible configurations: 1) no movements of masters. 2) the master of u goes to the partition containing the master of v. 3) the opposite.

3. Choose the configuration that yields the smallest number of replicas subject to the constraint of load balancing the masters.

11

Page 12: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPAR On The FlySPAR On The Fly• Edge addition, three configurations Moving 6 to

M1 minimizes the total Number of replicas. However, such a configuration violates the load balancing condition.

12

Page 13: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPAR On The FlySPAR On The Fly• Server addition: two choice: 1) force redistribution of the

masters from the other servers to the new one. 2) do nothing.o In first choice, the algorithm…

1. Select the least replicated masters from the M server and move them to the new server M+1.

2. For each moved masters, ensure there is a slave replica in the original server.

3. A fraction of the edges are replay the edge creation event.

• Server removal, the algorithm…

1. Reallocates the masters nodes hosted in that server to the remaining M-1 servers equally.

o Highly connected nodes, with potentially many replicas to be moved, get to first choose the server they go to.

2

N

M M

N

M

13

Page 14: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

OutlineOutline• Introduction• SPAR And Problem Statement• SPAR On The Fly• Measurement• SPAR System Architecture And Evaluation• Related Work And Conclusions

14

Page 15: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

MeasurementMeasurement• Datasets

o Twitter, by crawling Twitter between Nov 25 – Dec 4, 2008, comprises 2,408,534 nodes and 48,776,888 edges, 12M tweets.

o Facebook, New Orleans Facebook network, between Dec 2008 and Jan 2009, includes nodes, friendship links, wall posts, comprises 59,297nodes and 477,993 edges.

o Orkut, between Oct 3 and Nov 11 2006, comprises 3,073,441 nodes and 223,534,301 edges.

• Evaluation Procedureo Input includes: a social graph, the number of desired partitions M and

the desired minimum number of replicas per user’s profile K.o The partitions are produced by executing each of the algorithms on the

input.o For those offline algorithm, since local semantics are required, replicas

are added as necessary. o Generate edge creation trace for each data set.

15

Page 16: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

MeasurementMeasurement• Metrics, replication overhead r0:

o The number of slave replicas that need to be created to guarantee local semantics while observing the K-redundancy condition.

o SPAR naturally achieves a low replication overhead since this is the optimization objective. The competing algorithms optimize for minimal inter-server edges.

16

Page 17: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

MeasurementMeasurement• Comparison on Random vs SPAR for K = 2.

• Dynamic operations:1. Load balance of SPAR

o Twitter dataset with 128 servers and K = 2.o Average replication overhead = 3.69, 75.8% of users have 3 replicas.o Aggregate read load of servers. Coefficient of variation (COV) of masters among

servers is 0.0019.o The COV of writes per server is 0.37.

17

Page 18: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

MeasurementMeasurement2. Action taken by SPAR when edge creation event occurs.

(K=2 and 16 servers.)

3. The movement cost on the system

18

Page 19: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

MeasurementMeasurement• Adding servers:1. Wait for new arrivals:

o go from 16 to 32 servers by adding one server every 150K new users. overhead = 2.78, the overhead started with 32 servers = 2.74. COV of the number of masters among servers = 0.004

2. Re-distribute existing masters: o double the number of servers at once and force re-distribution of the

masters, and replay 50% of the edge creation. => overhead = 2.82, 2% higher than started with 32 servers.

• Removing a server:o Overhead increase from 2.74 to 2.87 marginally.o Replay the edge => overhead 2.77.

19

Page 20: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

OutlineOutline• Introduction• SPAR And Problem Statement• SPAR On The Fly• Measurement• SPAR System Architecture And Evaluation• Related Work And Conclusions

20

Page 21: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPAR System SPAR System Architecture And Architecture And

EvaluationEvaluation• The only interaction between application and SPAR is

through the Middle-ware. • Four components of SPAR: the Directory Service DS, the

Local Directory Service LDS, the Partition Manager PM and the Replication Manager RM.

21

Page 22: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPAR System SPAR System Architecture And Architecture And

EvaluationEvaluation1. Directory Service (DS): Handle data

distribution, implemented as a DHT.o Look-up table of key-> user (server).o Look-up table of user->replica servers.

2. Local Directory Service (LDS) contains a partial view of the DS.

3. Partitioning Manager (PM): PM run the SPAR algorithm and performs the following functions:o Map the user’s key to its replicas, whether master or slaves.o Re-distribute replicas and schedule the movement of replicas.o Adding and removing servers.o Only component that can update DS.

22

Page 23: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPAR System SPAR System Architecture And Architecture And

EvaluationEvaluation4. Ensure Data Consistency while moving and

writing.5. Heart-beat system to monitor servers failure.

o Permanent failure – treat as a server removal.o Transient failure – without triggering the recreation of neighbors,

sacrifice some local data semantics.

• SPAR EVALUATIONo On two data-stores: MySQL (RDBMS) and Cassandra (Key-Value).o Use an open-source Twitter-clone called Statusnet.o Statusnet’s canonical operation – retrieving the last 20 updates for a

given user.o Testbed consists of a cluster of 16 low-end commodity servers.

1. Evaluation with Cassandrao Original Cassandra, random partitioning.

23

Page 24: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPAR System SPAR System Architecture And Architecture And

EvaluationEvaluation• Response time between SPAR and original Cassandra.

• Multi-get increase the network load in Random partitioning.

24

Page 25: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

SPAR System SPAR System Architecture And Architecture And

EvaluationEvaluation2. Evaluation with MySQL• Use Tsung and two servers to emulate users.• Comparison to full replication

o Full replication:113ms for 16 req/s, 151ms for 150req/s, 245ms for 320 req/s.(95th percentile of response time)

o SPAR: 150 ms for 2500 req/s. (99th percentile)

• Adding writeso Full replication: performs very poor.o With 16 updates/s, 260ms for 50 req/s and 380 ms for 200 read req/s

(95th percentile)o Median response time = 2ms.

25

Page 26: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

OutlineOutline• Introduction• SPAR And Problem Statement• SPAR On The Fly• Measurement• SPAR System Architecture And Evaluation• Related Work And Conclusions

26

Page 27: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

Related WorkRelated Work• The first work to address the problem of

scalability of the data back-end for OSNs. Other related issues are…

• Scaling out: o Amazon EV2 and Google AppEngines (stateless)

• Key-Value Stores: o Many OSNs today rely on DHTs and Key-Value store. SPAR improves

performance over Key-Value stores.

• Distributed File Systems and Databases:o Ficus and Coda are distributed file systems that replicate files for high

availability. Distributed RDBMS: MySQL cluster and Bayou. o SPAR doesn’t distribute data, but maintains it locally via replication.

27

Page 28: The Little Engines(s) That Could: Scaling Online Social Networks Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder.

ConclusionsConclusions• The data of user in OSNs is highly interconnected and

hard to subjected to a clean partition.• SPAR ensures “local semantics”

o All relevant data of the direct neighbors of a user is co-located in the same server hosting the user.

• “Local semantics” enableso Queries can be resolved locally on a server and breaks the dependency

between users.o Transparent scale of the OSN at a low cost.o Throughput (requests per second) increases multifold and network I/O is

avoided.

• Validated SPARo Replication overhead is low.o Dealing with the dynamics of OSN gracefully.o Implemented a Twitter-like application and evaluated SPAR with it.

28