Call me maybe: Jepsen and flaky networks

Call me maybe: Jepsen and flaky networks

Shalin Shekhar Mangar @shalinmangar Lucidworks Inc.

Typical first year for a new cluster

— Jeff Dean, Google

• ~5 racks out of 30 go wonky (50% packetloss)

• ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

• ~3 router failures (have to immediately pull traffic for an hour)

LADIS 2009

Reliable networks are a myth

• GC pause

• Process crash

• Scheduling delays

• Network maintenance

• Faulty equipment

Network

n1

n2

n3

n4

n5

Network partition

n1

n2

n3

n4

n5

Messages can be lost, delayed, reordered and duplicated

n1

n2X

n1

n2

Time

Drop

Delay

n1

n2

Duplicate

n1

n2

Reorder

CAP recap

• Consistency (Linearizability): A total order on all operations such that each operation looks as if it were completed at a single instant.

• Availability: Every request received by a non-failing node in the system must result in a response.

• Partition Tolerance: Arbitrary many messages between two nodes may be lost. Mandatory unless you can guarantee that partitions don’t happen at all.

Have you planned for these?

Availability

Consistency

X

X

• Errors

• Connection timeouts

• Hung requests (read timeouts)

• Stale results

• Dirty results

• Data lost forever!

During and after a partition

Jepsen: Testing systems under stress

• Network partitions

• Random process crashes

• Slow networks

• Clock skewhttp://github.com/aphyr/jepsen

http://github.com/aphyr/jepsen

Anatomy of a Jepsen test

• Automated DB setup

• Test definitions a.k.a Client

• Partition types a.k.a Nemesis

• Scheduler of operations (client & nemesis)

• History of operations

• Consistency checker

Data store specific

(Mongo/Solr/Elastic)

Provided by Jepsen

n1

n2

n3

c1

c2

c3

OK

X

DatastoreClients

History

?

nem.e.sis

the inescapable agent of someone’s downfall

Nemesis

n1

n2

n3

n4

n5

partition-random-node

n1

n2

n3

n4

n5

kill-random-node clock-scrambler

Nemesis

n1

n2

n3

n4

n5

partition-halves

n1

n4

n5

n2

n3

partition-random-halves

n1

n2

n4

n5

bridge

n3

A set of integers: cas-set-client

• S = {1, 2, 3, 4, 5, …}

• Stored as a single document containing all the integers

• Update using compare-and-set

• Multiple clients try to update concurrently

• Create and restore partitions

• Finally, read the set of integers and verify consistency

Compare and Set client

cas({}, 1)

cas(1, 2)

{1}

{1, 2}

cas(1, 3) X

Time

Client 1

Client 2

cas(2, 4) X

cas(2, 5) {1, 2, 5}

Client 1

Client 2

t=0 t=1 t=x

Compare and Set client

cas({}, 1)

cas(1, 2)

{1}

{1, 2}

cas(1, 3) X

Time

Client 1

Client 2

cas(2, 4) X

cas(2, 5) {1, 2, 5}

Client 1

Client 2

t=0 t=1 t=x

History = [(t, op, result)]

Solr

• Search server built on Lucene

• Lucene index + transaction log

• Optimistic concurrency, linearizable CAS ops

• Synchronous replication to all ‘live’ nodes

• ZooKeeper for ‘consensus’

• http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-networks/

http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-networks/

Add an integer every second, partition network every 30 seconds for 200 seconds

Solr - Are we safe?

• Leaders become unavailable for upto ZK session timeout, typically 30 seconds (expected)

• Some write ‘hang’ for a long time on partition. Timeouts are essential. (unexpected)

• Final reads under CAS are consistent but we haven’t proved linearizability (good!)

• Loss of availability for writes in minority partition. (expected)

• No data loss (yet!) which is great!

Solr - Bugs, bugs & bugs

• SOLR-6530: Commits under network partition can put any node into ‘down’ state.

• SOLR-6583: Resuming connection with ZK causes log replay

• SOLR-6511: Requests threads hang under network partition

• SOLR-7636: A flaky cluster status API - times out during partitions

• SOLR-7109: Indexing threads stuck under network partition can mark leader as down

Elastic

• Search server built on Lucene

• It has a Lucene index and a transaction log

• Consistent single doc reads, writes & updates

• Eventually consistent search but a flush/commit should ensure that changes are visible

Elastic

• Optimistic concurrency control a.k.a CAS linearizibility

• Synchronous acknowledgement from a majority of nodes

• “Instantaneous” promotion under a partition

• Homegrown ‘ZenDisco’ consensus

Elastic - Are we safe?

• “Instantaneous” promotion is not. 90 seconds timeouts to elect a new primary (worse in <1.5.0)

• Bridge partition: 645/1961 writes acknowledged and lost in 1.1.0. Better in 1.5.0, only 22/897 lost.

• Isolated primaries: 209/947 updates lost

• Repeated pauses (simulating GC): 200/2143 updates lost

• Getting better but not quite there. Good documentation on resiliency problems.

MongoDB

• Document-oriented database

• Replica set has a single primary which accepts writes

• Primary asynchronously replicates writes to secondaries

• Replica decide between themselves to promote/demote primaries

• Applies to 2.4.3 and 2.6.7

MongoDB

• Claims atomic writes per document and consistent reads

• But strict consistency only when reading from primaries

• Eventual consistency when reading from secondaries

MongoDB - Are we safe?

Source: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads

https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads

MongoDB - Are we really safe?

• Inconsistent reads are possible even with majority write concern

• Read-uncommitted isolation

• A minority partition will allow both stale reads and dirty reads

Conclusion

• Network communication is flaky! Plan for it.

• Hackernews driven development (HDD) is not a good way of choosing data stores!

• Test the guarantees of your data stores.

• Help me find more Solr bugs!

References

• Kyle Kingsbury’s posts on Jepsen: https://aphyr.com/tags/jepsen

• Solr & Jepsen: http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-networks/

• Jepsen on github: github.com/aphyr/jepsen

• Solr fork of Jepsen: https://github.com/LucidWorks/jepsen

https://aphyr.com/tags/jepsen

http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-networks/

http://github.com/aphyr/jepsen

https://github.com/LucidWorks/jepsen

Solr/Lucene Meetup on 25th July 2015Venue: Target Corporation, Manyata Embassy Business Park

Time: 9:30am to 1pm

Talks:

Crux of eCommerce Search and Relevancy

Creating Search Analytics Dashboards

Signup at http://meetu.ps/2KnJHM

http://meetu.ps/2KnJHM

Thank you [email protected]

@shalinmangar

mailto:[email protected]

Call me maybe: Jepsen and flaky networks

Software

Transcript of Call me maybe: Jepsen and flaky networks