Couchbase and Apache Spark

Couchbase and Apache Spark

efficient data crunching in a fast moving world

©2015 Couchbase Inc. 2

Matt IngenthronWorked on large site scalability problems at previous company…memcached contributorJoined Couchbase very early and helped define key parts of system @ingenthr

A Quick Architectural Introduction to Couchbase


Couchbase is a Document Oriented Database

High availability

cache

Key-value store

Document

database

Embedded database

Sync management

Couchbase can be used a number of ways.

Developers often need a simple distributed hashtable, then grow to need secondary indexing and are either mobile-first or need to address mobile

deployment.


What makes Couchbase unique?

5

Performance & scalability

leaderSub millisecond latency with high throughput; memory-centric architecture

Multi-purpose

Simplified administrationEasy to deploy & manage; integrated Admin Console, single-click cluster expansion & rebalance

Cache, key value store, document database, and local/mobile database in single platform

Always-on availability

Data replication across nodes, clusters, and data centers

Enterprises choose Couchbase for several key advantages

24x365


Consolidated cache and database

Tune memory required based on application requirements

Multi-purpose database supports many uses

66

Tunable built-in

cacheFlexible

schemas with JSON

Couchbase Lite

Represent data with varying schemas using JSON on the server or on the device

Index and query data with Javascript views

Light weight embedded DB for always available apps

Sync Gateway syncs data seamlessly with Couchbase Server


Couchbase leads in performance and scalability

Auto Sharding Memory-

memory XDCRSingle

Node Type No manual sharding Database manages data

movement to scale out – not the user

Market’s only memory-to-memory database replication across clusters and geos

Provides disaster recover / data locality

Hugely simplifies management of clusters

Easy to scale clusters by adding any number of nodes


24x365

Couchbase delivers always-on availability

8

High Availability

Disaster Recovery

Backup & Restore

In-memory replication with manual or automatic fail over

Rack-zone awareness to minimize data unavailability

Memory-to-memory cross cluster replication across data centers or geos

Active-active topology with bi-directional setup

Full backup or Incremental backup with online restore

Delta node catch-ups for faster recovery after failures


Simplified administration for exceptional ease of use

Online upgrades and operations

Built-in enterprise

class admin console

Restful APIs

Online software, hardware and DB upgrades

Indexing, compaction, rebalance, backup & restore

Perform all administrative tasks with the click of a button

Monitor status of the system visual at cluster level, database level, server level

All admin operations available via UI, REST APIs or CLI commands

Integrate third party monitoring tools easily using REST


Couchbase Server ArchitectureSingle-node type means easier administration and scaling Single installation Two major

components/processes: Data manager cluster manager

Data manager: C/C++ Layer consolidation of caching

and persistence Cluster manager:

Erlang/OTP Administration UI’s Out-of-band for data requests


APPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

Write Operation

11

DOC 1

DOC 1DOC 1

Single-node type means easier administration and scaling Writes are async by default Application gets

acknowledgement when successfully in RAM and can trade-off waiting for replication or persistence per-write

Replication to 1, 2 or 3 other nodes

Replication is RAM-based so extremely fast

Off-node replication is primary level of HA

Disk written to as fast as possible – no waiting


ACTIVE ACTIVE ACTIVE

REPLICA REPLICA REPLICA

Couchbase Server 1 Couchbase Server 2 Couchbase Server 3

Basic Operation

12

SHARD5

SHARD2

SHARD9

SHARD SHARD SHARD

SHARD4

SHARD7

SHARD8

SHARD SHARD SHARD

SHARD1

SHARD3

SHARD6

SHARD SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD SHARD

SHARD6

SHARD3

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD SHARD

Application has single logical connection to cluster (client object)• Data is automatically sharded resulting in

even document data distribution across cluster

• Each vbucket replicated 1, 2 or 3 times (“peer-to-peer” replication)

• Docs are automatically hashed by the client to a shard

• Cluster map provides location of which server a shard is on

• Every read/write/update/delete goes to same node for a given key

• Strongly consistent data access (“read your own writes”)

• A single Couchbase node can achieve 100k’s ops/sec so no need to scale reads


Cache Ejection

13

APPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

DOC 1

DOC 2DOC 3DOC 4DOC 5

DOC 1

DOC 2 DOC 3 DOC 4 DOC 5

Single-node type means easier administration and scaling Layer consolidation means

read through and write through cache

Couchbase automatically removes data that has already been persisted from RAM


APPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

DOC 1

Cache Miss

14



GETDOC 1

DOC 1

DOC 1

Single-node type means easier administration and scaling Layer consolidation means

1 single interface for App to talk to and get its data back as fast as possible

Separation of cache and disk allows for fastest access out of RAM while pulling data from disk in parallel


Add Nodes to Cluster

15




ACTIVE ACTIVE

REPLICA REPLICA

Couchbase Server 4 Couchbase Server 5

SHARD5

SHARD2

SHARD SHARD

SHARD4

SHARD SHARD

SHARD1

SHARD3

SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD SHARD

SHARD6

SHARD3

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD SHARD

SHARD7

SHARD

SHARD6

SHARD

SHARD8

SHARD9

SHARD

READ/WRITE/UPDATE

Application has single logical connection to cluster (client object) Multiple nodes added

or removed at once One-click operation Incremental

movement of active and replica vbuckets and data

Client library updated via cluster map

Fully online operation, no downtime or loss of performance


Node Unresponsive / Lost


Fail Over Node

17




ACTIVE ACTIVE

REPLICA REPLICA

Couchbase Server 4 Couchbase Server 5

SHARD5

SHARD2

SHARD SHARD

SHARD4

SHARD SHARD

SHARD1

SHARD3

SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD

SHARDSHARD6

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD

SHARD

SHARD7

SHARD

SHARD6

SHARDSHARD8

SHARD9

SHARD

SHARD3

SHARD1

SHARD3

SHARD

Application has single logical connection to cluster (client object) When node goes down,

some requests will fail Failover is either

automatic or manual` Client library is

automatically updated via cluster map

Replicas not recreated to preserve stability

Best practice to replace node and rebalance

What about Hadoop?


Big Data = Operational + Analytic (NoSQL + Hadoop)

20

Online Web/Mobile/IoT apps Millions of

customers/consumers

Offline Analytics apps Hundreds of business

analysts

COMPLEX EVENT PROCESSING

Real TimeREPOSITORY

PERPETUALSTORE

ANALYTICALDB

BUSINESSINTELLIGENCE

MONITORING

CHAT/VOICE

SYSTEM

BATCHTRACK

REAL-TIMETRACK

DASHBOARD

TRACKING and COLLECTION

ANALYSIS ANDVISUALIZATION

REST FILTER METRICS


Apache Spark: The Big Picture


Apache Spark… is a fast and general purpose engine for small and large scale data processing …


Components: Spark Core

Resilient Distributed DatasetsClusteringExecution


Components: Spark SQL

Structured through DataFrames

Distributed querying with SQL


Components: Spark Streaming

Fault-tolerant streaming applications


Components: Spark MLib

Built-In Machine Learning Algorithms


Components: Spark GraphX

Graph processing and graph-parallel computations


How does it work?

Source: http://spark.apache.org/docs/latest/cluster-overview.html


Spark Benefits

Linearly scalable to 1000+ worker nodes Simpler to use than Hadoop MR Only partial recompute on failure

For developers and data scientists– machine learning– R integration

Tight but not mandatory Hadoop integration– Sources, Sinks– Scheduler


Spark vs Hadoop

Spark is RAM while Hadoop is mainly HDFS (disk) bound

Fully compatible with Hadoop Input/Output

Easier to develop against thanks to functional composition

Hadoop certainly more mature, but Spark ecosystem growing fast


Couchbase in the Spark Landscape Transparent generation and persistence of

– RDDs– DataFrames– Dstreams

Spark SQL and N1QL are a natural fit Linearly scale your data and application layer Share data between Spark Applications

The perfect storage companion for your spark applications.

Source: http://spark.apache.org/docs/latest/cluster-overview.html


Cluster Communication

STORAGE

Couchbase Server 1

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache

Cluster ManagerCluster

Manager

Managed Cache

Storage

Data Service

Index Service

Query Service STORAGE

Couchbase Server 2

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache


Manager

Managed Cache

Storage

Data Service

Index Service


Couchbase Server 3

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache


Manager

Managed Cache

Storage

Data Service

Index Service


Couchbase Server 4

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache


Manager

Managed Cache

Storage

Data Service

Index Service


Couchbase Server 5

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache


Manager

Managed Cache

Storage

Data Service

Index Service


Couchbase Server 6

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache


Manager

Managed Cache

Storage

Data Service

Index Service

Query Service

Spark Worker Spark Worker


Ecosystem Flexibility

RDBMS

StreamsWeb APIs

DCPKVN1QLViews

BatchingData Archive

OLTP Data


Infrastructure Consolidation


The Connector


Couchbase Connector Spark Core

– Automatic Cluster and Resource Management– Creating and Persisting RDDs– Java APIs in addition to Scala

Spark SQL– Easy JSON handling and querying– Tight N1QL Integration

Spark Streaming– Persisting DStreams– DCP source (experimental)


Facts Current Version: 1.0.0-beta

Code: https://github.com/couchbaselabs/couchbase-spark-connector

Docs until GA: http://developer.couchbase.com/documentation/server/4.0/connectors/spark-1.0/spark-intro.html

https://github.com/couchbaselabs/couchbase-spark-connector

https://github.com/couchbaselabs/couchbase-spark-connector

http://developer.couchbase.com/documentation/server/4.0/connectors/spark-1.0/spark-intro.html




Connection Management


Creating RDDs


Persisting RDDs


Spark SQL Integration


Spark Streaming with DCP


What‘s next?


Couchbase Connector Learn More:

– Couchbase and Spark at Couchbase Connect 2015:http://connect15.couchbase.com/agenda/spark-couchbase-electrify-data-processing/

1.1.0 plans– Upgrade to Spark 1.5– Stabilize DCP Support– Extend, Optimze, Fix bugs…

We need your feedback!

http://connect15.couchbase.com/agenda/spark-couchbase-electrify-data-processing/



Couchbase and Apache Spark

Data & Analytics

Transcript of Couchbase and Apache Spark