Riak Handbookjohnchukwuma.com/training/Riak Handbook.pdfIntroduction I first heard about Riak in...

166

Transcript of Riak Handbookjohnchukwuma.com/training/Riak Handbook.pdfIntroduction I first heard about Riak in...

  • Riak Handbook

    Mathias Meyer

    Revision 1.1

  • Table of ContentsIntroduction ................................................................................................... 8

    Thank You ............................................................................................. 8How to read the book............................................................................. 9Feedback ................................................................................................. 9Code........................................................................................................ 9Changelog .............................................................................................. 9

    CAP Theorem.............................................................................................. 11The CAP Theorem is Not Absolute ........................................................ 12Fine-Tuning CAP with Quorums .......................................................... 13N, R, W, Quorums, Oh My!.................................................................... 13How Quorums Affect CAP ..................................................................... 14A Word of CAP Wisdom......................................................................... 15Further Reading ....................................................................................... 15

    Eventual Consistency................................................................................... 15Consistency in Quorum-Based Systems ................................................. 16

    Consistent Hashing ...................................................................................... 16Sharding and Rehashing........................................................................... 16A Better Way............................................................................................ 17Enter Consistent Hashing ........................................................................ 17Looking up an Object .............................................................................. 19Problems with Consistent Hashing ......................................................... 20Dealing with Overload and Data Loss ..................................................... 21

    Amazon's Dynamo....................................................................................... 22Basics......................................................................................................... 22Virtual Nodes ........................................................................................... 22Master-less Cluster ................................................................................... 23Quorum-based Replication ..................................................................... 24Read Repair and Hinted Handoff ............................................................ 24Conflict Resolution using Vector Clocks ................................................ 24Conclusion ............................................................................................... 26

    What is Riak?................................................................................................ 27Riak: Dynamo, And Then Some ................................................................. 27Installation .................................................................................................... 28

    Installing Riak using Binary Packages ..................................................... 28Talking to Riak............................................................................................. 29

    Buckets ..................................................................................................... 29Fetching Objects ...................................................................................... 29Creating Objects ...................................................................................... 30

  • Object Metadata ....................................................................................... 31Custom Metadata ..................................................................................... 32Linking Objects........................................................................................ 33Walking Links.......................................................................................... 34Walking Nested Links ............................................................................. 35The Anatomy of a Bucket ........................................................................ 36List All Of The Keys................................................................................. 37

    How Do I Delete All Keys in a Bucket?............................................... 38How Do I Get the Number of All Keys in a Bucket? .......................... 39

    Querying Data ............................................................................................. 39MapReduce............................................................................................... 40MapReduce Basics .................................................................................... 41Mapping Tweet Attributes ...................................................................... 41Using Reduce to Count Tweets .............................................................. 42Re-reducing for Great Good ................................................................... 43Counting all Tweets................................................................................. 44Chaining Reduce Phases .......................................................................... 44Parameterizing MapReduce Queries....................................................... 46Chaining Map Phases ............................................................................... 48MapReduce in a Riak Cluster................................................................... 48Efficiency of Buckets as Inputs................................................................. 50Key Filters................................................................................................. 51Using Riak's Built-in MapReduce Functions.......................................... 53Intermission: Riak's Configuration Files ................................................. 54Errors Running JavaScript MapReduce................................................... 55Deploying Custom JavaScript Functions ................................................ 56Using Erlang for MapReduce .................................................................. 57

    Writing Custom Erlang MapReduce Functions ................................. 58On Full-Bucket MapReduce and Key-Filters Performance ................... 61

    Querying Data, For Real.............................................................................. 61Riak Search ............................................................................................... 62

    Enabling Riak Search ........................................................................... 62Indexing Data ....................................................................................... 62

    Indexing from the Command-Line................................................. 63The Anatomy of a Riak Search Document.......................................... 63Querying from the Command-Line ................................................... 64

    Other Command-Line Features ...................................................... 64The Riak Search Document Schema ....................................................... 64

    Analyzers .............................................................................................. 65Writing Custom Analyzers .................................................................. 66

  • Other Schema Options..................................................................... 69An Example Schema............................................................................. 70Setting the Schema ............................................................................... 72

    Indexing Data from Riak ......................................................................... 72Using the Solr Interface............................................................................ 74

    Paginating Search Results .................................................................... 75Sorting Search Results .......................................................................... 76Search Operators .................................................................................. 76Summary of Solr API Search Options.................................................. 79Summary of the Solr Query Operators ................................................ 80Indexing Documents using the Solr API ............................................. 81Deleting Documents using the Solr API ............................................. 82Using Riak's MapReduce with Riak Search ........................................ 83The Overhead of Indexing................................................................... 83

    Riak Secondary Indexes ........................................................................... 84Indexing Data with 2i........................................................................... 84Querying Data with 2i ......................................................................... 86Using Riak 2i with MapReduce........................................................... 87Storing Multiple Index Values ............................................................. 87Managing Object Associations: Links vs. 2i ........................................ 88How Does Riak 2i Compare to Riak Search? ...................................... 89Riak Search vs. Riak 2i vs. MapReduce................................................ 90

    How Do I Index Data Already in Riak?................................................... 91Using Pre- and Post-Commit Hooks...................................................... 92

    Validating Data..................................................................................... 92Enabling Pre-Commit Hooks ............................................................. 93Pre-Commit Hooks in Erlang ............................................................. 94Modifying Data in Pre-Commit Hooks.............................................. 95Accessing Riak Objects in Commit Hooks ......................................... 97Enabling Post-Commit Hooks .......................................................... 100Deploying Custom Erlang Functions................................................ 100Updating External Sources in Post-Commit Hooks......................... 102

    Riak in its Setting........................................................................................ 102Building a Cluster................................................................................... 102

    Adding a Node to a Riak Cluster ....................................................... 103Configuring a Riak Node .............................................................. 103Joining a Cluster ............................................................................. 104

    Anatomy of a Riak Node.................................................................... 104What Happens When a Node Joins a Cluster ................................... 105Leaving a Cluster ................................................................................ 105

  • Eventually Consistent Riak .................................................................... 106Handling Consistency........................................................................ 106

    Writing with a Non-Default Quorum.......................................... 106Durable Writes ............................................................................... 107Primary Writes ............................................................................... 108Tuning Default-Replication and Quorum Per Bucket................. 108Choosing the Right N Value ......................................................... 110Reading with a Non-Default Quorum.......................................... 110Read-Repair.................................................................................... 111

    Modeling Data for Eventual Consistency ................................................. 111Choosing the Right Data Structures ...................................................... 112

    Conflicts in Riak................................................................................. 115Siblings............................................................................................ 116

    Reconciling Conflicts......................................................................... 117Modeling Counters and Other Data Structures ................................ 118

    Problems with Timestamps for Conflict Resolution ..................... 119Strategies for Reconciling Conflicts .................................................. 123

    Reads Before Writes ....................................................................... 124Merging Strategies ......................................................................... 124

    Sibling Explosion................................................................................ 124Building a Timeline with Riak .......................................................... 125

    Multi-User Timelines..................................................................... 128Avoiding Infinite Growth.................................................................. 129Intermission: How to Fetch Multiple Objects in one Request.......... 129Intermission: Paginating Using MapReduce .................................... 130

    Handling Failure .................................................................................... 131Operating Riak....................................................................................... 132

    Choosing a Ring Size ......................................................................... 132Protocol Buffers vs. HTTP ................................................................ 133Storage Backends................................................................................ 133

    Innostore......................................................................................... 134Bitcask............................................................................................. 134LevelDB.......................................................................................... 135

    Load-Balancing Riak ......................................................................... 136Placing Riak Nodes across a Network ............................................... 138Monitoring Riak................................................................................. 140

    Request Times ................................................................................ 141Number of Requests ....................................................................... 142Read Repairs, Object Size, Siblings................................................ 143Monitoring 2i ................................................................................. 144

  • Miscellany ....................................................................................... 144Monitoring Reference.................................................................... 144

    Managing a Riak Cluster with Riak Control..................................... 147Enabling Riak Control ................................................................... 147Intermission: Generating an SSL Certificate ................................. 148Riak Control Cluster Overview..................................................... 149Managing Nodes with Riak Control ............................................. 150Managing the Ring with Riak Control ......................................... 151To Be Continued............................................................................ 152

    When To Riak? .......................................................................................... 152Riak Use Cases in Detail......................................................................... 153

    Using Riak for File Storage ................................................................ 153File Storage Access Patterns ........................................................... 154Object Size...................................................................................... 154Storing Large Files in Riak ............................................................. 155Riak Cloud Storage ........................................................................ 155

    Using Riak to Store Logs.................................................................... 156Modeling Log Records................................................................... 157Logging Access Patterns ................................................................ 157Indexing Log Data for Efficient Access ......................................... 158Secondary Index Ranges as Key Filter Replacement..................... 159Searching Logs ............................................................................... 160Riak for Log Storage in the Wild ................................................... 161Deleting Historical Data ................................................................ 161What about Analytics? ................................................................... 162

    Session Storage ................................................................................... 162Modeling Session Data ................................................................... 163Session Storage Access Patterns...................................................... 164Bringing Session Data Closer to Users .......................................... 164

    URL Shortener ................................................................................... 164URL Shortening Access Patterns ................................................... 165Modeling Data................................................................................ 165Riak URL Shortening in the Wild ................................................. 165

    Where to go from here............................................................................... 165

  • IntroductionI first heard about Riak in September 2009, right after it was unveiled to thepublic, at one of the early events around NoSQL in Berlin. I tip my hat toMartin Scholl for introducing the attendees (myself included) to this newdatabase. It's distributed, written in Erlang, supports JSON, and MapReduce.That's all we needed to know.

    Riak fascinated me right from the beginning. Its roots in Amazon's Dynamoand the distributed nature were intriguing. It was fun to see it develop sincethen, it's been more than two years.

    Over that time, Riak went from a simple key-value store you can use toreliably store sessions to a full-blown database with lots of bells and whistles.I was more and more intrigued, and started playing with it more, diving intoits feature set and into Dynamo too.

    Add to that the friendly Basho folks, makers of Riak, whom I had the greatpleasure of meeting a few times and even working with.

    But something was missing. Every database should have a book dedicated toit. I never thought that it would even be possible to write a whole book aboutRiak, let alone that I would be the one to write it, yet here we are.

    What you're looking at is my collective brain dump on all things Riak,covering everything from basic usage, by way of MapReduce, full-textsearch and indexing data, to advanced topics like modeling data to fit in wellwith Riak's eventually consistent distribution model.

    So here we are, I hope you'll enjoy what you're about to read as much as Ienjoyed writing it.

    This is a one-man operation, please respect the time and effort that went intothis book. If you came by a free copy and find it useful, please buy the book.

    Thank You

    This book wouldn't be here, on your screen, without the help and support ofquite a few people. To be honest, I was surprised how much work goes intoa book, and how many people are more than willing to help you finish it. Forthat I am incredibly grateful.

    First and foremost I want to thank my wife Jördis, who not only was verysupportive, but also helped a great deal by doing all the design work in andaround the book, the cover, the illustrations, and the website. She gave me

    Introduction

    Riak Handbook | 8

    http://riakhandbook.com

  • that extra push when I needed it. My daughter Mari was supportive in hervery own way, probably without realizing it, but supportive nonetheless.She was great to have around when writing this book.

    Thank you so very much to everyone who reviewed the initial and advancedversions of the book, devoting their valuable time to giving invaluablefeedback. You never realize until later how many typos you end up creating.

    Thank you for your great feedback, for tirelessly answering my questions,and for all the support you guys gave me: Florian Ebeling, Eric Lindvall, TillKlampäckel, Steve Vinoski, Russell Brown, Sean Cribbs, Reid Draper, RyanZezeski, John Vincent, Rick Olson, Corey Donohoe, Mark Philips, Ralphvon der Heyden, Patrick Hüsler, Robin Mehner, Stefan Schmidt, KellyMcLaughlin, Brian Shumate, Jeremiah Peschka, Marc Heiligers. I bow toyou!

    How to read the book

    Start at the front, read the book all the way to the back.

    Feedback

    If you think you found a typo, have some suggestions to make for thingsyou think are missing and whatnot, or generally would like to say hi, sendan email to [email protected]. Be sure to include the revisionyou're referring to, it's printed on the second page.

    Code

    This book includes a lot of code, but only in small chunks, easy to grasp.There are only two listings in the entire book that stretch close to a page.Most of the code doesn't build on top of each other but tries to stand alone,though there's the occasional assumption that some piece of code has beenrun at some point. What was worth breaking out into small programs orwhat would require tedious copy and paste has been moved into a coderepository that accompanies this book. You can find it on GitHub.

    Changelog

    Version 1.1

    • Added a section on load balancing• Added a section on network placement of Riak nodes• Added a section on monitoring

    Introduction

    Riak Handbook | 9

    mailto:[email protected]://github.com/mattmatt/nosql-handbook-examples/tree/master/08-riak

  • • Added a section on storing multiple index values and using 2i to manageobject relationships

    • Fixed code examples in the ePub and Kindle versions• Added a section on Riak Control• Added a section on pre- and post-commit hooks• Added a section on deploying custom Erlang code• Added a section describing an issue that may come up when running

    JavaScript MapReduce requests• Added a section on Riak use cases explained in detail. Includes file

    storage, log storage, session storage, and URL shortening.• Added a section explaining primary writes• The book is now included as a man page for easy reading and searching

    on the command line.

    Introduction

    Riak Handbook | 10

  • CAP TheoremCAP is an abbreviation for consistency, availability, and partition tolerance.The basic idea is that in a distributed system, you can have only two of theseproperties, but not all three at once. Let's look at what each property means.

    • ConsistencyData access in a distributed database is considered to be consistent whenan update written on one node is immediately available on another node.Traditional ways to achieve this in relational database systems aredistributed transactions. A write operation is only successful when it'swritten to a master and at least one slave, or even all nodes in the system.Every subsequent read on any node will always return the data written bythe update on all nodes.

    • AvailabilityThe system guarantees availability for requests even though one or morenodes are down. For any database with just one node, this is impossibleto achieve. Even when you add slaves to one master database, there's stillthe risk of unavailability when the master goes down. The system can stillreturn data for reads, but can't accept writes until the master comes backup. To achieve availability data in a cluster must be replicated to a numberof nodes, and every node must be ready to claim master status at any time,with the cluster automatically rebalancing the data set.

    • Partition ToleranceNodes can be physically separated from each other at any given point andfor any length of time. The time they're not able to reach each other,due to routing problems, network interface troubles, or firewall issues, iscalled a network partition. During the partition, all nodes should still beable to serve both read and write requests. Ideally the system automaticallyreconciles updates as soon as every node can reach every other node again.

    Given features like distributed transactions it's easy to describe consistencyas the prime property of relational databases. Think about it though, in amaster-slave setup data is usually replicated down to slaves in a lazy manner.Unless your database supports it (like the semi-synchronous replication inMySQL 5.5) and you enable it explicitly, there's no guarantee that a writeto the master will be immediately visible on a slave. It can take crucialmilliseconds for the data to show up, and your application needs to be ableto handle that. Unless of course, you've chosen to ignore the potential

    CAP Theorem

    Riak Handbook | 11

  • inconsistency, which is fair enough, I'm certainly guilty of having done thatmyself in the past.

    While Brewer's original description of CAP was more of a conjecture, bynow it's accepted and proven that a distributed database system can onlyallow for two of the three properties. For example, it's considered impossiblefor a database system to offer both full consistency and 100% availabilityat the same time, there will always be trade-offs involved. That is, untilsomeone finds the universal cure against network partitions, networklatency, and all the other problems computers and networks face.

    The CAP Theorem is Not AbsoluteWhile consistency and availability certainly aren't particularly friendly witheach other, they should be considered tuning knobs instead of binaryswitches. You can have some of one and some of the other. This approachhas been adopted by quorum-based, distributed databases.

    A quorum is the minimum number of parties that need to be successfullyinvolved in an operation for it to be considered successful as a whole. Inreal life it can be compared to votes to make decisions in a democracy,only applied to distributed systems. By distributed systems I'm referring tosystems that use more than one computer, a node, to get a job done. A jobcan be many things, but in our case we're dealing with storing a piece of data.

    Every node in a cluster gets a vote, and the number of required votes canbe specified for the system as a whole, and for every operation separately. Ifthe latter isn't specified, a sensible default is chosen based on a configuredconsensus, a path that oftentimes is not successfully applied to a democracy.

    In the world of quorum database systems, every piece of data is replicated to anumber of nodes in a cluster. This number is specified using a value called N.It represents a default for the whole cluster, and can be tuned for every readand write operation.

    Consider a cluster with five nodes and an N value of 3. The N value is thenumber of replicas, and you can tune every operation with a quorum, whichdetermines the number of nodes that are required for that operation to besuccessful.

    The CAP Theorem is Not Absolute

    Riak Handbook | 12

  • Fine-Tuning CAP with QuorumsWhen you write (that is, update or create) a piece of data, you can specify avalue W. With W you can specify how many replicas the write must go tofor it to be considered successful. That number can't be higher than N.

    The W value is a tuning knob for consistency and availability. The higheryou pick your W, the more consistent the written data will be across allreplicas, but an operation may fail because some nodes are currentlyunreachable or down due to maintenance.

    Lowering the W value will affect consistency of the data set, as a subsequentread on a different replica is not guaranteed to return the updated data.Choosing a higher W also affects speed. The more nodes need to be involvedfor a single write the more network latency is involved. A lower W involvesfewer nodes, so it will take some time for a write to propagate to the replicasnot affected by the quorum. Operations can be parallelized for speed but anoperation is still only as fast as its slowest link. Subsequent reads on otherreplicas may return an outdated value. When I say time, I'm talkingmilliseconds, but in an application with quickly-changing data, that may stillbe a factor.

    For reads, the value is called R. The first R nodes to return the requested valuemake up the read quorum. The higher the R value, the more nodes need toreturn the same value for the read to be considered successful.

    Again, choosing a higher value for R affects performance, but offers astronger consistency level. Even with a low W value, a high R value canforce the cluster to reconcile outdated pieces of data so that they'reconsistent. That way, there are no situations where a read will returnoutdated information. It's a trade-off between low write consistency andhigh read consistency. Choosing a lower R makes a read less prone toavailability issues and lowers read latency. The optimum for consistency liesin choosing R and W in a way that R + W > N. That way, data will alwaysbe kept consistent.

    N, R, W, Quorums, Oh My!In the real world, it will depend on your particular use case which N, W,and R values you're going to pick. Need high insert and update speed? Pick alow W and maybe a higher R value. Care about consistent reads and a bit lessabout increased read latency? Pick a high R. If speed is all you're after in readsand writes, but you still want to have data replicated for availability, pick a

    Fine-Tuning CAP with Quorums

    Riak Handbook | 13

  • low W and R value, but an N of 3 or higher. Apart from the N value, theother quorums are not written in stone, they can be tuned for every read andwrite operation separately.

    In a paper that takes a more detailed look at Brewer's conjecture, Gilbert andLynch quite fittingly state that in the real world, most systems have settled ongetting "most of the data, most of the time." You will see how this works outin the practical part of this book.

    How Quorums Affect CAPAs you can see, a quorum offers a way to fine-tune both availability andconsistency. You can pick a level that is pretty loose on both ends, makingthe whole cluster less prone to availability issues. Lower values also tuneconsistency to a level where the application and the user are more likely tobe affected by outdated replicas. You can increase both to the same level, oruse a low W and a high R for high speed and high consistency, but you'll besubject to higher read latency.

    Quorums allow fine-tuning partition tolerance against consistency andavailability. With a higher quorum, you increase consistency but sacrificeavailability, as more nodes are required to participate in an operation. Ifone replica required to win the quorum is not available, the operation fails.A more fitting name for this is yield, the percentage of requests answeredsuccessfully, coined by Brewer in a follow-up paper on CAP.

    With a lower quorum, you increase availability but lower your consistencyexpectations. You accept that a response may not include all the data, thatyour harvest varies. Harvest measures the completeness of a response bylooking at the percentage of data included.

    Both scenarios have different trade-offs, but both are means to fine-tunepartition tolerance. The lower expectations an application has on yield orharvest, the more resilient it is to network partitions, and the lower theexpectations towards consistency during normal operations. Whichcombination you pick depends on your use case, there is no one truecombination of values.

    Tuning both up to 100% means a distributed system is not tolerant topartitions, as they'd result in either a decreased yield or decreased harvest, ormaybe even a combination of both. As Coda Hale put it: "You can't sacrificepartition tolerance."

    How Quorums Affect CAP

    Riak Handbook | 14

    http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdfhttp://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c18.pdfhttp://codahale.com/you-cant-sacrifice-partition-tolerance/

  • A Word of CAP WisdomWhile CAP is something I think you should be aware of, it's not worthwasting time fighting over which database falls into which category. Whatmatters is how every database works in reality, how it works for your usecases, and what your requirements are. Collect assumptions andrequirements, and compare them to what a database you're interested hasto offer. It's simple like that. What particular attributes of CAP it chose intheory is less important than that.

    Further ReadingTo dive deeper into the ideas behind CAP, read Seth Gilbert's and NancyLynch's dissection of Brewer's original conjecture. They're doing a greatjob of proving the correctness of CAP, all the while investigating alternativemodels, trying to find a sweet spot for all three properties along the way.

    Julian Browne wrote a more illustrated explanation on CAP, going as far ascomparing the coinage of CAP to the creation of punk rock, something I cancertainly get on board with. Coda Hale recently wrote an update on CAP,which is a lot less formal and aims towards practical applicability, a highlyrecommended read. And last but not least, you can peek at Brewer's originalslides too.

    Daniel Abadi brings up some interesting points regarding CAP, arguingthat CAP should consider latency as well. Eric Brewer and Armando Foxfollowed up the CAP discussion with a paper on harvest and yield, which isalso worth your while, as it argues for a need of a weaker version of CAP.One that focuses on dialing down one property while increasing anotherinstead of considering them binary switches.

    Eventual ConsistencyIn the last chapter we already talked about updates that are not immediatelypropagated to all replicas in a cluster. That can have lots of reasons, one beingthe chosen R or W value, while others may involve network partitions,making parts of the cluster unreachable or increasing latency. In otherscenarios, you may have a database running on your laptop, whichconstantly synchronizes data with another node on a remote server. Or youhave a master-slave setup for a MySQL or PostgreSQL database, where allwrites go to a master, and subsequent reads only go to the slave. In thisscenario the master will first accept the write and then populates it to anumber of slaves, which takes time. We're usually talking about a couple of

    A Word of CAP Wisdom

    Riak Handbook | 15

    http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdfhttp://www.julianbrowne.com/article/viewer/brewers-cap-theoremhttp://codahale.com/you-cant-sacrifice-partition-tolerance/http://codahale.com/you-cant-sacrifice-partition-tolerance/http://www.cs.berkeley.edu/%7Ebrewer/cs262b-2004/PODC-keynote.pdfhttp://www.cs.berkeley.edu/%7Ebrewer/cs262b-2004/PODC-keynote.pdfhttp://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.htmlhttp://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c18.pdf

  • milliseconds, but as you never know what happens, it could end up beinghours. Sound familiar? It's what DNS does, a system you deal with almostevery day.

    Consistency in Quorum-Based SystemsIn a truly distributed environment, and when writes involve quorums, youcan tune how many nodes need to have successfully accepted a write so thatthe operation as a whole is a success. If you choose a W value less than thenumber of replicas, the remaining replicas that were not involved in thewrite will receive the data eventually. Again, we're talking milliseconds incommon cases, but it can be a noticeable lag, and your application should beready to deal with cases like that.

    In every scenario, the common thing is that a write will reach all the relevantnodes eventually, so that all nodes have the same data. It will take sometime, but eventually the data in the whole cluster will be consistent for thisparticular piece of data, even after network partitions. Hence the nameeventual consistency. Once again it's not really a specific feature of NoSQLdatabases, every time you have a setup involving masters and slaves, eventualconsistency will strike with furious anger.

    The term was originally coined by Werner Vogels, Amazon's CTO, in2007. The paper he wrote about it is well worth reading. Being the biggeste-commerce site out there, Amazon had a big influence on a whole slew ofdatabases.

    Consistent HashingThe invention of consistent hashing is one of these things that only happenonce a century. At least that's how Andy Gross from Basho Technologieslikes to think about it. When you deal with a distributed databaseenvironment and have to deal with an elastic cluster, where nodes come andgo, I'm pretty sure you'll agree with him. But before we delve into detail, let'shave a look at how data distribution is usually done in a cluster of databasesor cache farms.

    Sharding and RehashingRelational databases, or even just the caches you put in between yourapplication and your database, don't really have a way to rebalance a clusterautomatically as nodes come and go. In traditional setups you either had acollection of masters synchronizing data with each other, with you sitting on

    Consistency in Quorum-Based Systems

    Riak Handbook | 16

    http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

  • the other end, hoping that they never get out of sync (which they will). Oryou started sharding your data.

    In a sharded setup, you split up your dataset using a predefined key. Thesimplest version of that could be to simply use the primary key of any table asyour shard key. Using modulo math you calculate the modulo of the key andthe number of shards (i.e. nodes) in the cluster. So the key of 103 in a clusterof 5 nodes would go to the fourth node, as 103 % 5 = 3. This is the simplestway of sharding.

    To get a bit more fancy, add a hash function, which is applied to the shardkey. Like before, calculate the modulo of the result and the number ofservers. The problems start when you want to add a new node. Almost all ofthe data needs to be moved to another server, because the modulo needs tobe recalculated for every record, and the result is very likely to be different,in fact, it's about N / (N + 1) likely to be different, with N being the numberof nodes currently being in the cluster. For going from three to four nodesthat's 75% of data affected, from four to five nodes it's 80%. The result getsworse as you add more nodes.

    Not only is that a very expensive operation, it also defeats the purpose ofadding new nodes, because for a while your cluster will be mostly busyshuffling data around, when it should really deliver that data to yourcustomers.

    A Better WayAs you will surely agree, this doesn't pan out too well in a production system.It works, but it's not great.

    In the late nineties Akamai needed a way to increase and decrease cachingcapacity on demand without having to go through a full rebalancing processevery time. Sounds like the scenario I just described, doesn't it? They neededit for caches, but it's easily applicable to databases too. The result is calledconsistent hashing, and it's a technique that's so beautifully simple yetincredibly efficient in avoiding moving unnecessary amounts of dataaround, it blows my mind every time anew.

    Enter Consistent HashingTo understand consistent hashing, stop thinking of your data's keys as aninfinite stream of integers. Consistent hashing's basic idea is to turn thatstream into a ring that starts with 0 and ends with a number like 2^64, leaving

    A Better Way

    Riak Handbook | 17

    http://goo.gl/HV4MR

  • room for plenty of keys in between. No really, that's a lot. Of course theactual ring size depends on the hash function you're using. To use SHA–1 forexample, the ring must have a size of 2^160. The keys are ordered counter-clockwise, starting at 0, ending at 2^160 and then folding over again.

    The Ring.

    Consistent hashing, as the name suggests, uses a hash function to determinewhere an object belongs on the ring with a given key. Other than with themodulus approach, the key is simply mapped onto the ring using its integerrepresentation.

    Mapping a key to the ring using a hash function.

    Enter Consistent Hashing

    Riak Handbook | 18

  • When a node joins the cluster, it picks a random key on the ring. The nodewill then be responsible for handling all data between this and the next keychosen by a different node. If there's only one node, it will be responsible forall the keys in the ring.

    One node responsible for the entire ring.

    Add another node, and it will once again pick a random key on the ring. Allit needs to do now is fetch the data between this key and the one picked bythe first node.

    The ring is therefore sliced into what is generally called partitions. If a pizzaslice is a nicer image to you, it works as well. The difference is though, thatwith a pizza everyone loves to have the biggest slice, while in a databaseenvironment having that slice could kill you.

    Now add a third node and it needs to transfer even less data because thepartitions created by the randomly picked keys on the ring get smaller andsmaller as you add more nodes. See where this is going? Suddenly we'reshuffling around much less data than with traditional sharding. Sure, we'restill shuffling, but somehow data has to be moved around, there's noavoiding that part. You can only try and reduce the time and effort neededto shuffle it.

    Looking up an ObjectWhen a client goes to fetch an object stored in the cluster, it needs to beaware of the cluster structure and the partitions created in it. It uses the same

    Looking up an Object

    Riak Handbook | 19

  • hash function as the cluster to choose the correct partition and therefore thecorrect physical node the object resides on.

    To do that, it hashes the key and then walks clockwise until it finds a keythat's mapped to a node, which will be the key the node randomly pickedwhen it joined the cluster. Say, your key hashes to the value 1234, and youhave two nodes in the cluster, one claiming the key space from 0 to 1023, theother claiming the space from 1024 to 2048. Yes, that's indeed a rather smallkey space, but much better suited to illustrate the example.

    To find the node responsible for the data, you go clockwise from 1234 to1024, the next lowest key picked by a node in the cluster, the second node inour example.

    Problems with Consistent HashingEven though consistent hashing itself is rather ingeniously simple, therandomness of it can cause problems if applied as the only technique,especially in smaller clusters.

    As each node picks a random key, there's no guarantee how close together orfar apart the nodes really are on the ring. One may end up with only a millionkeys, while the other has to carry the weight of all the remaining keys. Thatcan turn into a problem with load, one node gets swamped with requestsfor the majority of keys, while the other idles around desperately waiting forclient requests to serve.

    Two nodes in a ring, one with less keys than the other.

    Problems with Consistent Hashing

    Riak Handbook | 20

  • Also, when a node goes down, due to hardware failure, a network partition,who knows what's going to happen in production, there is still the questionof what happens to the data that it was responsible for. The solution onceagain is rather simple.

    Dealing with Overload and Data LossEvery node that joins the cluster not only grabs its own slice of the ring, italso becomes responsible for a number of slices from other nodes, it turnsinto a replica of their data. It now not only serves requests for its own data, itcan also serve clients asking for data originally claimed by other nodes in thecluster.

    This simple concept is called a virtual node and solves two problems at once.It helps to spread request load evenly across the cluster, as more nodes are ableto serve any given request, increasing capacity as you add more nodes. It alsohelps to reduce the risk of losing data by replicating it throughout the cluster.

    Some databases and commercial concepts take consistent hashing evenfurther to reduce the potential of overloading and uneven spread of data, anidea first adopted (as far as I know of) by Amazon's Dynamo database. We'lllook into the details in the next chapter.

    Dealing with Overload and Data Loss

    Riak Handbook | 21

  • Amazon's DynamoOne of the more influential products and papers in the field has beenAmazon's Dynamo, responsible for, among other things, storing yourshopping cart. It takes concepts like eventual consistency, consistenthashing, and the CAP theorem, and slaps a couple of niceties on top. Theresult is a distributed, fault-tolerant, and highly available data store.

    BasicsDynamo is meant to be easily scalable in a linear fashion by adding andremoving nodes, to be fully fault-tolerant, highly available and redundant.The goal was for it to survive network partitions and be easily replaceableeven across data centers.

    All these requirements stemmed from actual business requirements, so eitherway, it pays off to read the full paper to see how certain features relate toproduction use cases Amazon has.

    Dynamo is an accumulation of techniques and technologies, throwntogether to offer just what Amazon wanted for some of their business usecases. Let's go through the most important ones, most notably virtual nodes,replication, read repairs, and conflict resolution using vector clocks.

    Virtual NodesDynamo takes the idea of consistent hashing and adds virtual nodes to themix. We already came across them as a solution to spread load in a clusterusing consistent hashing. Dynamo takes it a step further. When a cluster isdefined, it splits up the ring into equally sized partitions. It's like an evenlysliced pizza, and the slice size never changes.

    Amazon's Dynamo

    Riak Handbook | 22

    http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

  • A hash ring with equally sized partitions.

    The advantage of choosing a partitioning scheme like that is that the ringsetup is known and constant throughout the cluster's life. Whenever a nodejoins, it doesn't need to pick a random key, it picks random partitions instead,therefore avoiding the risk of having partitions are that are either too smallor too large for a single node.

    Say you have a cluster with 3 nodes and 32 partitions, every node will holdeither 10 or 11 partitions. When you bring a fourth node into the ring, youwill end up with 8 partitions on each node. A partition is hosted by a virtualnode, which is only responsible for that particular slice of the data. As thecluster grows and shrinks the virtual node may or may not move to otherphysical nodes.

    Master-less ClusterNo node in a Dynamo cluster is special. Every client can request data fromany node and write data to any node. Every node in the cluster hasknowledge of the partitioning scheme, that is which node in the cluster isresponsible for which partitions.

    Whenever a client requests data from a node, that node becomes thecoordinator node, even if it's not holding the requested piece of data. Whenthe data is stored in a partition on a different node, the coordinator nodesimply forwards the requests to the relevant node and returns its response tothe client.

    Master-less Cluster

    Riak Handbook | 23

  • This has the added benefit that clients don't need to know about the waydata is partitioned. They don't need to keep track of a table with partitionsand their respective nodes. They simply ask any node for the data they'reinterested in.

    Quorum-based ReplicationAs explained above in the section on consistent hashing, partitioning makesreplicating data quite easy. A physical node may not only hold the data in thepartitions it picked, it will hold a total of up to P / PN * RE partitions, whereP is the number of partitions in the ring, PN the number of physical nodes,and RE is the number of replicas configured for the cluster.

    So if every piece of data is replicated three times across the cluster, a singlephysical node in cluster of four may hold up to 48 virtual nodes, given that itcontains 64 partitions.

    The quorum is the consistency-availability tuning knob in a Dynamocluster. Amazon leaves it up to a specific engineering team's preference howto deal with read and write consistency in their particular setup. As Imentioned already, it's a setting that's different for every use case.

    Read Repair and Hinted HandoffRead repair is a way to ensure consistency of data between replicas. It's apassive process that kicks in during a read operation to ensure all replicashave an up-to-date view of the data.

    Hinted handoff is an active process, used to transfer data that has beencollected by other nodes while one or more nodes were down. While thenode is down, others can accept writes for it to ensure availability. When thenode comes back up, the others that collected data send hints to it that theycurrently have data that's not theirs to keep.

    Conflict Resolution using Vector ClocksBefore you're nodding off with all the theoretical things we're goingthrough here, let me just finish this part on Dynamo with the way it handlesconflicts. In a distributed database system, a situation can easily arise wheretwo clients update the same piece of data through two different nodes in thecluster.

    A vector clock is a pair of a server identifier and a version, an initial pairbeing assigned to a piece of data the moment it is created. Whenever a client

    Quorum-based Replication

    Riak Handbook | 24

  • updates an object it provides the vector clock it's referring to. Let's have alook at an example.

    A simplified view of a vector clock.

    When the object is updated the coordinating node adds a new pair withserver identifier and version, so an object's vector clock can growsignificantly over time when it's updated frequently. As long as the paththrough the pairs is the same, an update is considered to be a descendant ofthe previous one. All of Bob's updates descent from one another.

    The fun starts when two different clients update the same objects. Each clientadds a new identifier to the list of pairs, and now there are two different listsof pairs from each node. We've run into a conflict. We now have two vectorclocks that aren't descendants of each other. Like the conflicts created byAlice and then Carol in the picture above.

    Dynamo doesn't really bother with the conflict, it can simply store bothversions and let the next reading client know that there are multiple versionsthat need to be reconciled. Vector clocks can be pretty mind-bending, butthey're actually quite simple. There are two great summaries on the Bashoblog, and Kresten Krab Thorup wrote another one, where he refers to themas version vectors instead, which actually makes a lot of sense and, I'm sure,will help you understand vector clocks better.

    The basic idea of vector clocks goes way back into the seventies, when LeslieLamport wrote a paper on using time and version increments as a means torestore order in a distributed system. That was in 1978, think about that fora minute. But it wasn't until 1988 that the idea of vector clocks that includeboth time and a secondary means of deriving ordering was published, in apaper by Colin J. Fidge.

    Conflict Resolution using Vector Clocks

    Riak Handbook | 25

    http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/http://www.javalimit.com/2011/01/understanding-vector-clocks.htmlhttp://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdfhttp://sky.scitech.qut.edu.au/%7Efidgec/Publications/fidge88a.pdf

  • Vector clocks are confusing, no doubt, and you hardly have to deal with theirinner workings. They're just a means for a database to discover conflictingupdates.

    ConclusionDynamo throws quite a punch, don't you agree? It's a great collection ofdifferent algorithms and technologies, brought together to solve real lifeproblems. Even though it's a lot to take in, you'll find that it influenceda good bunch of databases in the NoSQL field and is referenced or citedequally often.

    There have been several open source implementations, namely Dynomite(abandonded these days due to copyright issues, but the first open sourceDynamo clone), Project Voldemort, and Riak. Cassandra also drew someinspiration from it.

    Conclusion

    Riak Handbook | 26

  • What is Riak?Riak does one thing, and one thing really well: it ensures data availabilityin the face of system or network failure, even when it has only the slightestchance to still serve a piece of data available to it, even though parts of thewhole dataset might be missing temporarily.

    At the very core, Riak is an implementation of Amazon's Dynamo, made bythe smart folks from Basho. The basic way to store data is by specifying akey and a value for it. Simple as that. A Riak cluster can scale in a linear andpredictable fashion, because adding more nodes increases capacity thanks toconsistent hashing and replication. Throw on top the whole shebang of faulttolerance, no special nodes, and boom, there's Riak.

    A value stored with a key can be anything, Riak is pretty agnostic, butyou're well advised to provide a proper content type for what you're storing.To no-one's surprise, for any reasonably structured data, using JSON isrecommended.

    Riak: Dynamo, And Then SomeThere's more to Riak than meets the eye though. Over time, the folks atBasho added some neat features on top. One of the first things they addedwas the ability to have links between objects stored in Riak, to have a simplerway to navigate an association graph without having to know all the keysinvolved.

    Another noteworthy feature is MapReduce, which has traditionally been thepreferred way to query data in Riak, based for example, on the attributes ofan object. Riak utilizes JavaScript, though if you're feeling adventurous youcan also use Erlang to write MapReduce functions. As a means of indexingand querying data, Riak offers full-text search and secondary indexes.

    There are two ways I'm referring to Riak. Usually when I say Riak, I'mtalking about the system as a whole. But when I mention Riak KV, I'mtalking about Riak the key-value store (the original Riak if you will). Riak'sfeature set has grown beyond just storing keys and values. We're lookingat the basic feature set of Riak KV first, and then we'll look at things thatwere added over time, such as MapReduce, full-text search, and secondaryindexes.

    What is Riak?

    Riak Handbook | 27

  • InstallationWhile you can use Homebrew and a simple brew install riak to installRiak, you can also use one of the binary packages provided by Basho. Riakrequires Erlang R14B03 or newer, but using the binary packages orHomebrew, that's already taken care of for you. As of this writing, 1.1.2is the most recent version, and we'll stick to its feature set. Be aware thatRiak doesn't run on Windows, so you'll need some flavor of Unix to make itthrough this book.

    When properly installed and started using riak start, it should be upand running on port 8098, and you should be able to run the followingcommand and get a response from Riak.

    $ curl localhost:8098/riak{}

    While you're at it, install Node.js as well. We'll talk to Riak using Node.jsand the riak-js library, a nice and clean asynchronous library for Riak, whilewe peek under the covers to figure out exactly what's going on.

    Running npm install http://nosql-handbook.s3.amazonaws.com/pkg/riak-js-7d3b8bbf.tar.gz installs the latest version of riak-js (we're usingthe custom version as it includes some important fixes). After you're done,you should be able to start a Node shell by running the command node andexecuting the line below without causing any errors.

    var riak = require('riak-js').getClient()

    As we work our way through its feature set we'll store tweets in Riak. Firstwe'll just use the tweet's identifier to reference tweets, then we'll dig deeperand store tweets per user, making them searchable along the way.

    Installing Riak using Binary PackagesRiak is known to be easy to handle from an operational perspective. Thatincludes the installation process too. Basho provides a bunch of binarypackages for common systems like Debian, Ubuntu, RedHat, and Solaris.All of them neatly include the Erlang distribution required to run Riak, soyou don't have to install anything other than the package itself. That savesyou the trouble of dealing with Linux distributions that come with outdatedversions of Erlang. Which is most of them, really.

    Installation

    Riak Handbook | 28

    http://mxcl.github.com/homebrew/http://downloads.basho.com/riak/http:/nodejs.orghttp://riakjs.org

  • So when you're on Ubuntu or Debian, simply download the .deb file andinstall it using dpkg.

    $ wget downloads.basho.com/riak/riak-1.1.2/riak_1.1.2-1_amd64.deb$ dpkg -i riak_1.1.2-1_amd64.deb

    Now you can start Riak using the provided init script.

    $ sudo /etc/init.d/riak start

    The procedures are pretty similar, no matter if you're on Ubuntu, Debian,RedHat, or Solaris. The beauty of this holistic approach to packaging Riak isthat it's easy to automate.

    Talking to RiakThe easiest way to become friends with Riak is to use its HTTP interface.Later, in production, you're more likely to turn to the Protocol Buffersinterface for better performance and throughput, but HTTP is just a nice andvisual way to explore the things you can do with Riak.

    Riak's HTTP implementation is as RESTful as it gets. Important details(links, vector clocks, modification times, ETags, etc.) are nicely exposedthrough proper HTTP headers, and Riak utilizes multi-part responses whereapplicable.

    BucketsOther than a key and a value, Riak divides data into buckets. A bucket isnothing more than a way to logically separate physical data, so for example,all user objects can go into a bucket named users. A bucket is also a way toset different properties for things like replication for different types of data.This allows you to have stricter rules for objects that are of more importancein terms of consistency and replication than data for which a lack ofimmediate replication is acceptable, such as sessions.

    Fetching ObjectsNow that we got that out of the way, let's talk to our database. That's whyI love using HTTP to get to know it better; it's such a nice and human-readable format, with no special libraries required. We'll start off the basicsusing both a client library and curl, so you'll see what's going on under thecovers.

    Talking to Riak

    Riak Handbook | 29

  • When you're installing and starting Riak, it installs a bunch of URL handlers,one of them being /riak, which we'll play with for the next couple ofsections. Again, the client libraries are hiding that from us, but when you'replaying on your own, using curl, my favorite browser, it's good to know.

    If you haven't done so already, fire up the Node shell, and let's start withsome basics. After this example I'm assuming the riak library is loaded in theNode.js console and points to the riak-js library.

    var riak = require('riak-js').getClient()riak.get('tweets', '41399579391950848')

    We're looking for a tweet with granted, a rather odd looking key, but it'sa real tweet, and the key conforms to Twitter's new scheme for tweetidentifiers, so there you have it.

    What riak-js does behind the curtains is send a GET request to the URL/riak/tweets/41399579391950848. Riak, being a good HTTP sport,returns a status code of 404. You can try this yourself using curl.

    $ curl localhost:8098/riak/tweets/41399579391950848

    As you'll see it doesn't return anything yet, so let's create the object in Riak.

    Creating ObjectsTo create or update an object using riak-js: we'll simply use the functionsave() and specify the object to save.

    riak.save('tweets', '41399579391950848', {user: "roidrage",tweet:"Using @riakjs for the examples in the Riak chapter!",

    tweeted_at: new Date(2011, 1, 26, 8, 0)})

    Under the covers, riak-js sends a PUT request to the URL /riak/tweets/41399579391950848, with the object we specified as the body. It alsoautomatically uses application/json as the content type and serializes theobject to a JSON string, as this is clearly what we're trying to store in Riak.Here's how you'd do that using curl.

    curl -X PUT localhost:8098/riak/tweets/41399579391950848 \-H 'Content-Type: application/json' -d @-

    Creating Objects

    Riak Handbook | 30

    http://twitter.com/roidrage/status/41399579391950848http://twitter.com/roidrage/status/41399579391950848

  • {"user":"roidrage","tweet":"Using @riakjs for the examples in the Riak chapter!","tweeted_at":"Mon Dec 05 2011 17:31:40 GMT+0100 (CET)"}

    Phew, this looks a tiny bit more confusing. We're telling curl to PUT to thespecified URL, to add a header for the content type, and to read the requestbody from stdin (that's the odd-looking parameter -d @-). Type Ctrl-D afteryou're done with the body to send the request.

    Riak will automatically create the bucket and use the key specified in theURL the PUT was sent to. Sending subsequent PUT requests to the sameURL won't recreate the object, they'll update it instead. Note that you can'tupdate single attributes of a JSON document in Riak. You always need tospecify the full object when writing to it.

    Object MetadataEvery object in Riak has a default set of metadata associated with it. Examplesare the vector clock, links, date of last modification, and so on. Riak alsoallows you to specify your own metadata, which will be stored with theobject. When HTTP is used, they'll be specified and returned as a set ofHTTP headers.

    To fetch the metadata in JavaScript, you can add a third parameter to the callto get(): a function to evaluate errors, the fetched object, and the metadatafor that object. By default, riak-js dumps errors and the object to the console.Let's peek into the metadata and look at what we're getting.

    riak.get('tweets', '41399579391950848',function(error, object, meta) {console.log(meta);

    })

    The result will look something like the output below.

    { usermeta: {},debug: false,api: 'http',encodeUri: false,host: 'localhost',clientId: 'riak-js',accept: 'multipart/mixed, application/json;q=0.7, */*;q=0.5',binary: false,raw: 'riak',connection: 'close',

    Object Metadata

    Riak Handbook | 31

  • responseEncoding: 'utf8',contentEncoding: 'utf8',links: [],port: 8098,bucket: 'tweets',key: '41399579391950848',headers: {Accept: 'multipart/mixed, application/json;q=0.7, */*;q=0.5',Host: 'localhost', Connection: 'close' },

    contentType: 'application/json',vclock: 'a85hYGBgzGDKBVIcypz/fvptYKvIYEpkymNl4NxndYIvCwA=',lastMod: 'Fri, 18 Nov 2011 11:31:21 GMT',contentRange: undefined,acceptRanges: undefined,statusCode: 200,etag: '68Ze86EpWbh8dbAcpMBpZ0' }

    The vector clock is indeed a biggie, and as you update an object, you'll seeit grow even more. Try updating our tweet a few times, just for fun andgiggles.

    for (var i = 0; i < 5; i++) {riak.get('tweets', '41399579391950848',function(error, object, meta) {riak.save('tweets', '41399579391950848', object);

    })}

    Now if you dump the object's metadata on the console one more time, you'llsee that it has grown a good amount with just five updates.

    Custom MetadataYou can specify a set of custom metadata yourself. riak-js makes that processfairly easy: simply specify a fourth parameter when calling save(). Let'sattach some location information to the tweet.

    var tweet = {user: 'roidrage',tweet: 'Using riakjs for the examples in the Riak chapter!',tweeted_at: new Date (2011, 1, 26, 8, 0)}

    riak.save('tweets', '41399579391950848', tweet,{latitude: '52.523324', longitude: '13.41156'})

    When done via HTTP, you simply specify additional headers in the formof X-Riak-Meta-Y, where Y is the name of the metadata you'd like to bestored with the object. So in the example above, the headers would be X-

    Custom Metadata

    Riak Handbook | 32

  • Riak-Meta-Latitude and X-Riak-Meta-Longitude. If you don't believe me,we can ask our good friend curl for verification.

    $ curl -v localhost:8098/riak/tweets/41399579391950848...snip...< X-Riak-Meta-Longitude: 13.41156< X-Riak-Meta-Latitude: 52.523324...snap...

    Note that, just like with the object itself, you always need to specify the fullset of metadata when updating an object, as it's always written anew. Whichmakes using riak-js all the better, because the meta object you get from thecallback when fetching an object lends itself nicely to be reused when savingthe object again later.

    Linking ObjectsLinking objects is one of the neat additions of Riak over and above Dynamo.You can create logical trees or even graphs of objects. If you fancy object-oriented programming, this can be used as the equivalent of objectassociations.

    By default, every object has only one link: a reference to its bucket. Whenusing HTTP, links are expressed using the syntax specified in the HTTPRFC. A link can be tagged to give the connection context. Riak doesn'tenforce any referential integrity on links though, it's up to your applicationto catch and handle nonexisting ends of links.

    In our tweets example however, one thing we could nicely express with linksis a tweet reply. Say frank06, author of riak-js, responded to my tweet, sayingsomething like "@roidrage Dude, totally awesome!" We'd like to store thereference to the original tweet as a link for future reference. We could ofcourse simply store the original tweet's identifier, but where's the fun in that?

    To store a link, riak-js allows us to specify them as a list of JavaScript hashes(some call them objects, but I like to mix it up).

    var reply = {user: 'frank06',tweet: '@roidrage Dude, totally awesome!',tweeted_at: new Date (2011, 1, 26, 8, 0)};

    riak.save('tweets', '41399579391950849', reply,{links: [{tag: 'in_reply_to',

    Linking Objects

    Riak Handbook | 33

    http://www.martinpayne.me.uk/experiments/HTTP/Link-Header/HTTP-Link-example.htmlhttp://www.ietf.org/rfc/rfc2068.txthttp://www.ietf.org/rfc/rfc2068.txt

  • key: '41399579391950848',bucket: 'tweets'}]})

    A link is a simple set consisting of a tag, a key and a bucket. The tag inthis case identifies this tweet as a reply to the one we had before, we'reusing the tag in_reply_to to mark it as such. This way we can store entireconversations as a combination of links and key-value, walking the path upto the root tweet at any point.

    Now when you fetch the new object via HTTP, you'll notice that the headerfor links has grown and contains the link we just defined.

    $ curl -v localhost:8098/riak/tweets/41399579391950849...Link: ; riaktag="in_reply_to",; rel="up"

    ...

    You can fetch them with riak-js too, using the metadata object, which willgive you a nice array of objects containing bucket, tag and key.

    riak.get('tweets', '41399579391950849',function(error, object, meta) {console.log(meta.links)

    })

    An object can have an arbitrary number of links attached to it, but there aresome boundaries. It's not recommended to have more than 10000 links ona single object. Consider for example that all the links are sent through theHTTP API, which makes a couple of HTTP clients explode, because thesingle header for links is much larger than expected. The number of links onan object also adds to its total size, making an object with thousands of linksmore and more expensive to fetch and send over the network.

    Walking LinksSo now that we have links in place, how do we walk them, how can wefollow the graph created by links? Riak's HTTP API offers a simple way tofetch linked objects through an arbitrary number of links. When you requesta single object, you attach one or more additional parameters to the URL,specifying the target bucket, the tag and whether you would like the linkedobject to be included in the response.

    Walking Links

    Riak Handbook | 34

  • riak-js doesn't have support to walk links from objects in this way yet, sowe'll look at the URLs instead. Play along to see what the results look like.Let's have a look at an example.

    $ curl .../riak/tweets/41399579391950849/tweets,in_reply_to,_/

    There are three parameters involved in this link phase.

    • tweets tells Riak that we only want to follow links pointing to the buckettweets

    • in_reply_to specifies the link tag we're interested in• The last parameter (_ in this example) tells Riak whether or not you want

    the object pointed to by this link returned in the response. It defaults to 0,meaning false, but setting it to 1 gives you a more complete response asyou walk deeper nested links.

    When you run the command above you'll receive a multi-part response fromRiak which is not exactly pretty to look at. The response includes all theobjects that are linked to from this tweet.

    Given the nature of a Twitter conversation it will usually be just one, but youcould also include links to the people mentioned in this tweet, giving them adifferent tag and give the whole tweet even more link data to work with.

    If you have multiple tags you're interested in, or don't specifically care aboutthe target bucket, you can replace both with _, and it will follow links toany bucket or with any tweet respectively. The following query will simplyreturn all linked objects.

    $ curl localhost:8098/riak/tweets/41399579391950849/_,_,_/

    Walking Nested LinksYou aren't limited to walking just one level of links, you can walk aroundthe resulting graph of objects at any depth. Just add more link specificationsto the URL. Before we try it out, let's throw in another tweet, that's a replyto the reply, so we have a conversation chain of three tweets. We'll do this inour Node console.

    var reply = {user: 'roidrage',tweet: "@frank06 Thanks for all the work you've put into it!",tweeted_at: new Date(2011, 1, 26, 10, 0)};

    riak.save('tweets', '41399579391950850', reply, {links:

    Walking Nested Links

    Riak Handbook | 35

  • [{tag: 'in_reply_to', key: '41399579391950849', bucket: 'tweets'}]})

    $ curl localhost:8098/riak/tweets/41399579391950850/_,_,_/_,_,_/

    This query will walk two levels of links, so given a conversation with onereply to another reply to the original tweet, you can get the original tweetfrom the second reply. Mind-bending in a way, but pretty neat, because withthis query you'll also receive all the objects in between with the response, notjust the original tweet, but all replies too.

    The Anatomy of a BucketThere is great confusion around what a bucket in Riak actually is, and whatyou can and cannot do with it. A bucket is just a name in Riak, that's it.It's a name that allows you to set some configuration properties like datadistribution, quorum and such, but it's just a name.

    A bucket is not a physical entity. Whenever you reference a bucket-keyvalue in Riak to fetch a value, both are one and the same. To look up dataRiak always uses both the bucket and the key, only together do they makeup the complete key, which is also used for hashing the key to find the noderesponsible for the data. The lookup order in Riak is always hash(bucket +key) and not bucket/hash(key).

    A bucket is nothing like a table in a relational database. A table is a physicalentity that's stored in a different location than other tables. So when youthink of a bucket, don't think of it as a table or anything else that relates to aphysical separation of data. It's just a namespace, nothing more. And yes, thename "bucket" is rather unfortunate in that regard, as it suggests a physicalseparation of data in the first place.

    All this has a couple of implications, most of them easily thwartingexpectations people coming to Riak usually have.

    • You can't just get all the keys of objects stored in a particular bucket. Todo that, Riak has to go through its entire dataset, filtering out the ones thatmatch the bucket.

    • You can't just delete all data in a bucket, as there is no physical distinctionbetween bucket and a key. If you need to keep track of the data, you needto keep additional indexes on it, or you can list keys, though the latter isnot exactly recommended either.

    The Anatomy of a Bucket

    Riak Handbook | 36

  • • You can set specific properties on a per-bucket basis, such as the numberof replicas, quorum and other niceties, which override the defaults for allbuckets in the cluster. The configuration for every bucket created over thelifetime of a cluster is part of the whole ring configuration that all nodes ina Riak cluster share.

    List All Of The KeysNow that we got that out of the way, you're bound to ask: "but how do I getall of my keys out of Riak?" Or: "how can I count all the keys in my Riak?"Before we dive into that, let me reply with this: "don't try this at home orrather, don't use this in production, or at least keep using it to a necessaryminimum."

    For fetching all keys, even of a single bucket, the whole Riak cluster has togo through its entire key set, either reading it from disk or from memory,but through the whole set nonetheless, finding the ones belonging to thatparticular bucket by looking at the full key. Depending on the number ofkeys in your cluster, this can take time. Going through millions of keys isnot a feat done in one second, and it puts a bit of load on your cluster too.Performance also depends on the storage backend chosen, as some keep allthe keys in memory, while others have to load them from disk.

    Now that we got the caveats out of the way, the way to fetch all keysin a bucket is to request the bucket with an additional query parameterkeys=true. That will cause the whole cluster to load the keys and returnthem in one go. riak-js has a keys() method:

    riak.keys('tweets')

    A word of warning though, this will choke with Node.js when there are a lotof objects in the bucket. This is because listing all keys generates a really longheader with links to all the objects in the bucket. You'll probably want to usethe streaming version of listing keys as shown further down.

    The same as a plain old HTTP request using curl:

    $ curl 'localhost:8098/riak/tweets?keys=true'

    This will return pretty quickly if you have only a couple of objects stored inRiak, several tens of thousands are not a big problem either, but what youprobably want to do instead is to stream the keys as they're read on each nodein the cluster. You won't get all keys in one response, but the Riak node

    List All Of The Keys

    Riak Handbook | 37

  • coordinating the request will send the keys to the clients as they are sent byall the other nodes. To do that set the parameter keys to the value stream.

    $ curl 'localhost:8098/riak/tweets?keys=stream'

    With curl, it will keep dumping keys on your console as long as theconnection is kept open. In riak-js, due to its asynchronous nature, thingsneed some more care. It takes an EventEmitter object, a Node.js specifictype that triggers events when it receives data. We'll do the simplest thingpossible and dump the keys onto the console.

    riak.keys('tweets', {keys: 'stream'}).on('keys', console.log).start()

    If you really must list keys, you want to use the streaming version. riak-jsuses the streaming mechanism to give you a means of counting all objects ina bucket by way of a count('tweets') function.

    In general, if you find yourself wanting to list keys in a bucket a lot, it's verylikely you actually want to use something like a full-text search or secondaryindexes. Thankfully, Riak comes with both. When you do list keys, keep agood eye on the load in your cluster. With tens of millions of keys, the loadwill increase for sure, and the request may eventually even time out. So youneed to do your homework, tens of millions of keys are a lot to gather andcollect over a network.

    How Do I Delete All Keys in a Bucket?

    As you probably realize by now, this is no easy feat. As bucket and keys areone and the same, the only way to delete all data in a bucket is to list all thekeys in that buck, or to keep a secondary index of the keys, by using somesecondary data store. Redis has been used for this in the past, for example.You can also keep a list of keys as a separate Riak object, or use some of Riak'sbuilt-in query features. As they are quite comprehensive, I'll give them theattention they deserve in the next section.

    The approach of using key listings to delete data has certainly been usedin the past, but again involves loading all keys in a bucket. If you use itcautiously with streaming key listings, it might work well enough.

    There's one thing to be aware of when deleting based on listing keys. Youmay see ghost keys showing up when listing keys immediately after deleting

    List All Of The Keys

    Riak Handbook | 38

  • objects. The list of keys is always an indication, it may not always be 100%accurate when it comes to the objects stored with the keys.

    How Do I Get the Number of All Keys in a Bucket?

    The short version: see above. There is no built-in way of getting the exactnumber of keys in a bucket. Atomically incrementing an integer value isa feat that's not easy to achieve in a distributed system as it requirescoordination. That won't help you right now, as the exact number is whatyou're after.

    The longer version involves either building indices, using Riak Search orRiak Secondary Indexes (which we'll get to soon enough). You could usea range large enough, maybe by utilizing the object's key (assuming there'ssome sort of range there), and then feed the data into a reduce phase,avoiding loading the actual objects from Riak, counting the objects as yougo. The downside of this approach is that it may not catch all keys, that dataneeds to be fully indexed, and that you need to use Erlang for a MapReducequery. The latter is simple enough, especially for this particular use case, andwe'll look at the details in the MapReduce section.

    You can stream the keys in a bucket to the client and keep counting theresult, but it certainly won't give you an ad-hoc view if you're storing tensof millions of objects, as it will take time.

    Or you keep track of the number of objects through some external means,for example using counters in Redis. If you need statistics on the number ofobjects, you should keep separate statistics around. You could feed them intoa monitoring tool like Graphite or Munin, use a number of Redis instancesto keep track of them, or something entirely different. You could even usebuilt-in mechanisms, namely post-commit hooks, to update your counterswhen data was updated or deleted. If ad-hoc numbers are what you need, thisis a good way to get them. Otherwise you'll pay with decreased performancenumbers, as your cluster is busy combing through the keys.

    The bottom line is, you need to think about these things upfront, beforeputting Riak in production. Retrofitting solutions gets harder and harder themore data you store in Riak.

    Querying DataNow that we got the basics out of the way, let's look at how you can getdata out of Riak. We already covered how you can get an object out of Riak,

    Querying Data

    Riak Handbook | 39

  • simply by using its key. The problem with that approach is that you have toknow the key. It's somewhat of the dilemma of using a key-value store.

    There are some inherent problems involved when wanting to run a queryacross the entire data set stored in a Riak cluster, especially when you'redealing with millions of objects.

    Because Justin Bieber is so wildly popular, and because we need some datato play with, I whipped up a script to use Twitter's streaming API to fetchall the tweets mentioning him. You can change the search term to anythingyou want, but trust me, with Bieber in it, you'll end up having thousands oftweets in your Riak database in no time.

    The script requires your Twitter username and password to be set asenvironment variables TWITTER_USER and TWITTER_PASSWORD respectively.Now you can just run node 08-riak/twitter-riak.js and watch as pureawesomeness is streaming into your database. Leave it running for an houror so, believe me, it's totally worth it.

    If you can't wait, five minutes will do. You'll still have at least a hundredtweets as a result. The script will also store replies as proper links, so thelonger it runs the more likely you'll end up at least having some discussionsin there.

    MapReduceAssuming you have a whole bunch of tweets in your local Riak, the easiestway to sift through them is by using MapReduce. Riak's MapReduceimplementation supports both using JavaScript and Erlang to runMapReduce, with JavaScript being more suitable for ad hoc style queries,whereas Erlang code needs to be known to all physical nodes in the clusterbefore you can use it, but comes with some performance benefits.

    Speaking of Riak's MapReduce as a means to query data is actually a bit of alie, as it's rather a way to analyze and aggregate data. There are some caveatsinvolved, especially when you're trying to run an analysis on all the data inyour cluster, but we'll look at them in a minute.

    A word of warning up-front: there is currently a bug in Riak that mightcome up when you have stored several thousand tweets, and you're runninga JavaScript MapReduce request on them. Should you run into an errorrunning the examples below, there is a section dedicated to the issue andworkarounds.

    MapReduce

    Riak Handbook | 40

    https://github.com/mattmatt/nosql-handbook-examples/blob/master/08-riak/twitter-riak.js

  • MapReduce BasicsA MapReduce query consists of an arbitrary number of phases, each feedingdata into the next. The first part is usually specifying an input, which can bean entire bucket or a number of keys. You can choose to walk links fromthe objects returned from that phase too, and use the results as the basis for aMapReduce request.

    Following that can be any number of map phases, which will usually do anykind of transformation of the data fed into them from buckets, link walks or aprevious map phase. A map phase will usually fetch attributes of interest andtransform them into a format that is either interesting to the user, or that willbe used and aggregated by a following reduce phase.

    It can also transform these attributes into something else, like only fetch theyear and month from a stored date/time attribute. A map phase is called forevery object returned by the previous phase, and is expected to return a list ofitems, even if it contains only one. If a map phase is supposed to be chainedwith a subsequent map phase, it's expected to return a list of bucket and keypairs.

    Finally, any number of reduce phases can aggregate the data handed to themby the map phases in any way, sort the results, group by an attribute, orcalculate maximum and minimum values.

    Mapping Tweet AttributesNow it's time to sprinkle some MapReduce on our tweet collection. Let'sstart by running a simple map function. A MapReduce request sent to Riakusing the HTTP API is nothing more than a JSON document specifyingthe inputs and the phases to be executed. For JavaScript functions, you cansimply include their stringified source in the document, which makes it abit tedious to work with. But as you'll see in a moment, riak-js handles thismuch more JavaScript-like.

    Let's build a map function first. Say, we're interested in tweets that containthe word "love", because let's be honest, everyone loves Justin Bieber.Riak.mapValuesJson(), used in the code snippet below, is a built-infunction that extracts and parses the value of serialized JSON object intoJavaScript objects.

    var loveTweets = function(value) {try {var doc = Riak.mapValuesJson(value)[0];

    MapReduce Basics

    Riak Handbook | 41

  • if (doc.tweet.match(/love/i)) {return [doc];

    } else {return [];

    }} catch (error) {return [];

    }}

    Before we the look at the raw JSON that's sent to Riak, let's run this in theNode console, feeding it all the tweets in the tweets bucket.

    riak.add('tweets').map(loveTweets).run()

    Imagine a long list of tweets mentioning Justin Bieber scrolling by, or try itout yourself. The number of tweets you'll get will vary from day to day, butgiven that so many people are in love with Justin, I don't have the slightestdoubt that you'll see a result here.

    Using Reduce to Count TweetsWhat if we want to count the tweets using the output we got from the mapfunction above? Why, we write a reduce function of course.

    Reduce functions will usually get a list of values from the map function, notjust one value. So to aggregate the data in that list, you iterate over it andwell, reduce it. Thankfully JavaScript has got us covered here. Let's whip outthe code real quick.

    var countTweets = function(values) {return [values.reduce(function(total, value) {return total + 1;

    }, 0)];}

    Looks simple enough, right? We iterate over the list of values usingJavaScript's built-in reduce function and keep a counter for all the results fedto the function from the map phase.

    Now we can run this in our console.

    riak.add('tweets').map(loveTweets).reduce(countTweets).run()

    // Output: [ 8 ]

    Using Reduce to Count Tweets

    Riak Handbook | 42

    https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Array/Reduce

  • The result is weird, the number is a lot smaller than expected, when youcompare the number to the list of actual tweets containing "love" you'llnotice that the number is a lot smaller. There's a reason for this, and it'sgenerally referred to as re-reduce. We can fix this no problem, but let's lookat what it actually is first.

    Re-reducing for Great GoodIt's not unlikely that a map function will return a pretty large number ofresults. For efficiency reasons, Riak's MapReduce doesn't feed all results intothe reduce functions immediately, instead splits them up into chunks. Say thelist of tweets returned by the map function is split into chunks of 100. Eachchunk is fed into the reduce function as an array, then the results are collectedinto a new array, which again is fed into the same reduce function.

    This may or may not happen, depending on how large the initial combinedresults from the reduce functions are. But in general your reduce functionshould be prepared to receive two different inputs, unless it returns the samekind of result as the map function.

    This can be the cause of great confusion, because it means your reducefunction needs to be somewhat aware of its own output and the output of themap function, and be able to differentiate both to calculate a correct result.

    Now, let's make the above reduce function safe for re-reducing. All we reallyneed to do to make it work is make it aware that values can be either objectso