SDEC2011 Going by TACC

24
Going by TACC: Beyond KeyValue to FaultTolerant Stores with Easily Customizable Semantics Henk Goosen, CEO [email protected]

description

Key-value stores are widely used in applications that only require primary key data access, which is common in many web applications. Because developing an industrial grade key value store is expensive, the conventional solution is to use one of the existing key-value stores and layer application semantics on top of the primitives provided by the store. This approach leads to potential inefficiencies, because application specific semantics can often allow optimizations in the implementation of the store. We present an alternative approach, using the TACC platform to provide a key-value store implementation that is both performant and easily customizable. The TACC programming model separates state from logic: state is stored in a collection of distributed in-memory database instances, while logic is performed by distributed agents that react asynchronously to changes in objects stored in the database instances. Agents can selectively subscribe to updates using a fine-grain hierarchical directory system to mount objects into a local namespace. TACC provides performance comparable to hand-coded C while reducing the actual source code size to a fraction of that. We describe the implementation and performance of a scalable and fault tolerant key-value store using TACC, pointing out the benefits realized by using TACC's strong, user-defined types and triggering/notification.Key-value stores are widely used in applications that only require primary key data access, which is common in many web applications. Because developing an industrial grade key value store is expensive, the conventional solution is to use one of the existing key-value stores and layer application semantics on top of the primitives provided by the store. This approach leads to potential inefficiencies, because application specific semantics can often allow optimizations in the implementation of the store. We present an alternative approach, using the TACC platform to provide a key-value store implementation that is both performant and easily customizable. The TACC programming model separates state from logic: state is stored in a collection of distributed in-memory database instances, while logic is performed by distributed agents that react asynchronously to changes in objects stored in the database instances. Agents can selectively subscribe to updates using a fine-grain hierarchical directory system to mount objects into a local namespace. TACC provides performance comparable to hand-coded C while reducing the actual source code size to a fraction of that. We describe the implementation and performance of a scalable and fault tolerant key-value store using TACC, pointing out the benefits realized by using TACC's strong, user-defined types and triggering/notification.http://sdec.kr/

Transcript of SDEC2011 Going by TACC

Page 1: SDEC2011 Going by TACC

Going  by  TACC:  Beyond  Key-­‐Value  to  Fault-­‐Tolerant  Stores  with  Easily  Customizable  

Semantics  

Henk  Goosen,  CEO  [email protected]  

Page 2: SDEC2011 Going by TACC

*  Many  applications  only  need  primary  key  data  access  *  Examples:  catalogs,  shopping  carts,  web  session  state  *  No  need  for  the  complexity,  performance  overhead,  and  lack  of  scalability  of  a  full  database  *  Hence:  Key-­‐value  stores  are  everywhere  *  Dynamo,  CouchDB,  Cassandra,  Project  Voldemort,  Riak,  

Redis,  memcached,  MongoDB,  …  

Key-­‐value  stores  rule  the  Web  

OptumSoft, Inc. Proprietary and Confidential

2  

Improving key-value stores is important

Page 3: SDEC2011 Going by TACC

*  Developing  a  key-­‐value  store  from  scratch  using  conventional  languages  is  expensive:  *  scalability,  performance,  and  fault  tolerance  *  Conventional  solution:  use  existing  key-­‐value  store  *  Layer  on  get()  and  put()  semantics  *  Mismatches  between  application  requirements  and  library:  either  accept  or  extensively  modify  library  code    

Key-­‐value  stores  in  practice  

OptumSoft, Inc. Proprietary and Confidential

3  

Applications are more complex, performance suffers

Page 4: SDEC2011 Going by TACC

*  Use  a  very  high-­‐level  language  to  specify  the  key-­‐value  store  *  Then  customize  the  store,  applying  application-­‐specific  semantics  *  Benefits:  *  Simplifies  the  application  business  logic  *  Improves  the  performance  of  both  store  and  application  

TACC  provides  a  different  model  

OptumSoft, Inc. Proprietary and Confidential

4  

TACC model is better!

Page 5: SDEC2011 Going by TACC

*  User-­‐defined  type:  a  list  of  attributes  (nouns)  *  Read  or  write  attributes  (there  are  no  methods/verbs)  *  Logic  primarily  implemented  via  constraints  *  imperative  code  is  also  supported  *  Compact  code  *  First  class  high  level  data  types  (eg,  queues,  hash  tables)  *  Several  design  patterns  directly  supported  in  language  

(eg  observer  pattern)  

TACC  is  an  object-­‐oriented,  strongly  typed  language  

5  

Compact code fewer bugs, quicker to market

Page 6: SDEC2011 Going by TACC

*  Reduce  development  time  by  a  factor  of  2x  to  3x  *  Reduce  lines  of  code  by  10x  or  more  *  Eliminate  most  synchronization  and  concurrency  bugs  *  High,  predictable  performance  using  optimized  code  generation  *  Fault-­‐Tolerance  built  into  the  model,  and  easy  to  implement  

TACC:  efficient  development  of  distributed  systems  

6  

TACC is a general purpose language, focused on distributed systems

Page 7: SDEC2011 Going by TACC

Stateful  remote  proxy  objects  

*  Proxy:  local  copy  of  data  *  Writes  are  asynchronously  copied  to  SysDB  *  SysDB  changes  are  copied  to  “interested”  agents  *  R/W  access  is  local,  fast  *  No  remote  access  exceptions  

OptumSoft, Inc. Proprietary and Confidential

7  

LR  1   LR  2  

1  

1  

1  Agents

SysDB

collection

object added to collection

Simple semantics, and fast

Page 8: SDEC2011 Going by TACC

SysDB:  a  hierarchical  in-­‐memory  object  database  

*  Stores  state  (ideally  no  logic)  *  Minimizes  risk  of  program  

logic  bugs,  hence  reliable  

*  Concise  specification  of  user-­‐defined  types  *  TACC  compiler  automatically  generates  all  required  code  for  remote  access  *  Agents  receive  automatic  notification  when  values  change  

OptumSoft, Inc. Proprietary and Confidential

8  

Agents

SysDB

Page 9: SDEC2011 Going by TACC

*  SysDB  defines  and  exports  an  hierarchical  name  space  (similar  to  a  distributed  file  system)  *  Remote  agents  can  “mount”  remote  directories  into  a  local  namespace  *  Each  object  is  instantiated  into  a  directory,  state  is  made  available  remotely  via  proxy  objects  *  Updates    propagate  asynchronously,  notifications  are  delivered  on  changes  

Distributed,  hierarchical  name  space  

OptumSoft, Inc. Proprietary and Confidential

9  

Simple, powerful, proven way to provide large, structured name space

Page 10: SDEC2011 Going by TACC

Fast  recovery  for  high  availability  

Fault-­‐tolerance  is  built  in  

*  When  an  agent  restarts,  it  recovers  its  state  from  SysDB  *  Agents  implement  invariants,  therefore  can  be  restarted  at  any  time,  on  any  server  *  Any  number  of  backup  SysDBs  are  supported  

10  

SP   SB  

A1   A2   A3   A4  

Page 11: SDEC2011 Going by TACC

*  Application  needs  to  track  real-­‐time  location  of  user  *  User  allowed  in  only  one  location  at  a  time  *  Three  operations:  *  ENTER  <user  id>  <session  id>  <location  id>  *  LEAVE  <user  id>  *  QUERY  <user  id>  *  Throughput  >  10,000  requests/sec,  latency  <  1  ms  

Example:  Location  Service  as  customized  key-­‐value  store  

OptumSoft, Inc. Proprietary and Confidential

11  

High throughput, low latency required

Page 12: SDEC2011 Going by TACC

Location  Service  Overview  

*  HTTP  access  to  service  *  Application  (GS)  contacts  any  LR  server  via  load  balancer  *  LR  servers  replicated  for  scalability  and  for  fault  tolerance  

OptumSoft, Inc. Proprietary and Confidential

12  

GS  

GS  

GS  

GS  

GS  

GS  

LR  

Load  balancer  

LR  

LR  

LR  

Get  location  

Enter  

Enter  

Leave  

Challenge: ensure responses from multiple LR servers are handled correctly

Page 13: SDEC2011 Going by TACC

Key-­‐value  store  tracks  location  for  each  user  

OptumSoft, Inc. Proprietary and Confidential

13  

GS  

GS  

GS  

GS  

GS  

GS  

LR  

Load  balancer  

LR  

LR  

LR  

Enter  Smith,1  

Enter    Smith,2  

get(),  put()  

Key-­‐value  store  

get(),  put()  

Shard  A-­‐J  

Shard  K-­‐R  

Shard  S-­‐Z  

Smith,1  

Smith  Smith,2  

Has  to  be    atomic  

Page 14: SDEC2011 Going by TACC

*  Each  partition  stores  a  unique  subset  of  the  user  state  *  We  directly  implement  ENTER,  LEAVE,  and  QUERY  semantics,  using  a  TACC  Constrainer  *  No  locking  or  inter-­‐agent  synchronization  required  *  Requests  and  responses  sent  asynchronously  *  High  performance:  there  is  no  waiting  or  blocking  

TACC  allows  easy  customization  of  key-­‐value  update  semantics  

OptumSoft, Inc. Proprietary and Confidential

14  

Specializing the key-value store semantics simplifies the application and improves performance

Page 15: SDEC2011 Going by TACC

LR  

LR  

Single-­‐writer  collections:  no  need  for  synchronization  

OptumSoft, Inc. Proprietary and Confidential

15  

LR  

Shard  A-­‐J  

Shard  K-­‐R  

RS  

RS  

RS  

RS  

RS  

RS  

RS  

RS  

RS  

RS  

RS  

RS  

R

S  

Request  Collection  

Response  Collection  

Page 16: SDEC2011 Going by TACC

The  Serializer  Constrainer  

OptumSoft, Inc. Proprietary and Confidential

16  

Request  Collection  

Enter  U1,  R5  

Enter  U1,  R5  

Enter  U8,  R9  

A  

K  

D  

Response  Collection  

OK  

NOT  ALLOWED  

OK  

A  

K  

D  

Status  Collection  

R5  U1  

U8   R9  

Notify  

Logic  

Update user status Write result

Really simple!

Page 17: SDEC2011 Going by TACC

*  Code  for  the  Serializer  constrainer  defines  three  collections:    *  Input  collection:  requests  *  Output  collections:  responses  and  user  status  

*  A  dependency  constraint  causes  imperative  code  to  be  executed  when  a  new  request  arrives  from  LR  server  *   The  imperative  code  in  the  constrainer  implements  the  application  specific  semantics  

Details  of  Constrainer  implementation  

OptumSoft, Inc. Proprietary and Confidential

17  

This code is a minor tweak on put() implementation

Page 18: SDEC2011 Going by TACC

*  Constraint  handling  code  automatically  inserted  by  compiler  *  No  need  to  manually  maintain  invariants  in  many  call  sites  *  User-­‐defined  types  organize  constraint  handling  code  and  protect  against  mistakes  *  TACC  coroutine  further  simplifies  event  handling  

Constraints,  strong  typing  improves  event  handling  code    

OptumSoft, Inc. Proprietary and Confidential

18  

TACC changes event-handling spaghetti into well-structured, type-safe code

Page 19: SDEC2011 Going by TACC

*  Stress  Agent  and  SysDB  instrumented  to  collect  timestamps  (stored  in  memory,  I/O  after  test)  *  tcpdump  run  on  Stress  Agent  and  SysDB  servers  *  Correlate  timestamps  with  tcpdump  

Instrumentation  and  Measurements  

OptumSoft, Inc. Proprietary and Confidential

19  

Page 20: SDEC2011 Going by TACC

*  Network  and  TCP  behavior  *  Many  TCP  settings  have  a  dramatic  and  non-­‐linear  

performance  impact  

*  Memory  management  *  Memory  allocation/deallocation  *  Avoid  garbage  collection  

Low  latency  pitfalls  to  avoid  

OptumSoft, Inc. Proprietary and Confidential

20  

“The devil is in the details”

Page 21: SDEC2011 Going by TACC

Zero-­‐load  Latency  (μs)  

OptumSoft, Inc. Proprietary and Confidential

21  

SysDB   Time   Latency  

Receive  request  3   0.0  

Notification  4   42.3   42.3  

Response  enqueued  5  

75.1   32.8  

Response  packet  6   108.5   33.4  

End-­‐to-­‐end   Time   Latency  

Request  created  1  

0  

Request  packet  2  

48   48  

Response  packet  7  

248   200  

Notification  8   288   40  

Latencies are low and predictable

Page 22: SDEC2011 Going by TACC

Latency,  throughput  vs  SysDBs  

OptumSoft, Inc. Proprietary and Confidential

22  

High scalability under strict latency bound

Latency converges to zero-load latency

Page 23: SDEC2011 Going by TACC

*  Tacc  enables  developers  to  efficiently  create  predictably  high  performance,  scalable,  fault-­‐tolerant  distributed  applications  *  Eliminates  synchronization  and  locking  bugs  *  Fewer  lines  of  code  *  Faster  to  develop,  shorter  time  to  market  *  Easier  to  maintain  *  Fewer  bugs  

Summary  

23

Page 24: SDEC2011 Going by TACC

[email protected]  

Contact  me  for  more  information  about  TACC  and  OptumSoft!  

OptumSoft, Inc. Proprietary and Confidential

24