Couchbase and Apache Spark

47
Couchbase and Apache Spark efficient data crunching in a fast moving world

Transcript of Couchbase and Apache Spark

Page 1: Couchbase and Apache Spark

Couchbase and Apache Spark

efficient data crunching in a fast moving world

Page 2: Couchbase and Apache Spark

©2015 Couchbase Inc. 2

Matt IngenthronWorked on large site scalability problems at previous company…memcached contributorJoined Couchbase very early and helped define key parts of system @ingenthr

Page 3: Couchbase and Apache Spark

A Quick Architectural Introduction to Couchbase

Page 4: Couchbase and Apache Spark

©2015 Couchbase Inc. 4

Couchbase is a Document Oriented Database

High availability

cache

Key-value store

Document

database

Embedded database

Sync management

Couchbase can be used a number of ways.

Developers often need a simple distributed hashtable, then grow to need secondary indexing and are either mobile-first or need to address mobile

deployment.

Page 5: Couchbase and Apache Spark

©2015 Couchbase Inc. 5

What makes Couchbase unique?

5

Performance & scalability

leaderSub millisecond latency with high throughput; memory-centric architecture

Multi-purpose

Simplified administrationEasy to deploy & manage; integrated Admin Console, single-click cluster expansion & rebalance

Cache, key value store, document database, and local/mobile database in single platform

Always-on availability

Data replication across nodes, clusters, and data centers

Enterprises choose Couchbase for several key advantages

24x365

Page 6: Couchbase and Apache Spark

©2015 Couchbase Inc. 6

Consolidated cache and database

Tune memory required based on application requirements

Multi-purpose database supports many uses

66

Tunable built-in

cacheFlexible

schemas with JSON

Couchbase Lite

Represent data with varying schemas using JSON on the server or on the device

Index and query data with Javascript views

Light weight embedded DB for always available apps

Sync Gateway syncs data seamlessly with Couchbase Server

Page 7: Couchbase and Apache Spark

©2015 Couchbase Inc. 7

Couchbase leads in performance and scalability

Auto Sharding Memory-

memory XDCRSingle

Node Type No manual sharding Database manages data

movement to scale out – not the user

Market’s only memory-to-memory database replication across clusters and geos

Provides disaster recover / data locality

Hugely simplifies management of clusters

Easy to scale clusters by adding any number of nodes

Page 8: Couchbase and Apache Spark

©2015 Couchbase Inc. 8

24x365

Couchbase delivers always-on availability

8

High Availability

Disaster Recovery

Backup & Restore

In-memory replication with manual or automatic fail over

Rack-zone awareness to minimize data unavailability

Memory-to-memory cross cluster replication across data centers or geos

Active-active topology with bi-directional setup

Full backup or Incremental backup with online restore

Delta node catch-ups for faster recovery after failures

Page 9: Couchbase and Apache Spark

©2015 Couchbase Inc. 9

Simplified administration for exceptional ease of use

Online upgrades and operations

Built-in enterprise

class admin console

Restful APIs

Online software, hardware and DB upgrades

Indexing, compaction, rebalance, backup & restore

Perform all administrative tasks with the click of a button

Monitor status of the system visual at cluster level, database level, server level

All admin operations available via UI, REST APIs or CLI commands

Integrate third party monitoring tools easily using REST

Page 10: Couchbase and Apache Spark

©2015 Couchbase Inc. 10

Couchbase Server ArchitectureSingle-node type means easier administration and scaling Single installation Two major

components/processes: Data manager cluster manager

Data manager: C/C++ Layer consolidation of caching

and persistence Cluster manager:

Erlang/OTP Administration UI’s Out-of-band for data requests

Page 11: Couchbase and Apache Spark

©2015 Couchbase Inc. 11

APPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

Write Operation

11

DOC 1

DOC 1DOC 1

Single-node type means easier administration and scaling Writes are async by default Application gets

acknowledgement when successfully in RAM and can trade-off waiting for replication or persistence per-write

Replication to 1, 2 or 3 other nodes

Replication is RAM-based so extremely fast

Off-node replication is primary level of HA

Disk written to as fast as possible – no waiting

Page 12: Couchbase and Apache Spark

©2015 Couchbase Inc. 12

ACTIVE ACTIVE ACTIVE

REPLICA REPLICA REPLICA

Couchbase Server 1 Couchbase Server 2 Couchbase Server 3

Basic Operation

12

SHARD5

SHARD2

SHARD9

SHARD SHARD SHARD

SHARD4

SHARD7

SHARD8

SHARD SHARD SHARD

SHARD1

SHARD3

SHARD6

SHARD SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD SHARD

SHARD6

SHARD3

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD SHARD

Application has single logical connection to cluster (client object)• Data is automatically sharded resulting in

even document data distribution across cluster

• Each vbucket replicated 1, 2 or 3 times (“peer-to-peer” replication)

• Docs are automatically hashed by the client to a shard

• Cluster map provides location of which server a shard is on

• Every read/write/update/delete goes to same node for a given key

• Strongly consistent data access (“read your own writes”)

• A single Couchbase node can achieve 100k’s ops/sec so no need to scale reads

Page 13: Couchbase and Apache Spark

©2015 Couchbase Inc. 13

Cache Ejection

13

APPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

DOC 1

DOC 2DOC 3DOC 4DOC 5

DOC 1

DOC 2 DOC 3 DOC 4 DOC 5

Single-node type means easier administration and scaling Layer consolidation means

read through and write through cache

Couchbase automatically removes data that has already been persisted from RAM

Page 14: Couchbase and Apache Spark

©2015 Couchbase Inc. 14

APPLICATION SERVER

MANAGED CACHE

DISK

DISKQUEUE

REPLICATIONQUEUE

DOC 1

Cache Miss

14

DOC 2 DOC 3 DOC 4 DOC 5

DOC 2 DOC 3 DOC 4 DOC 5

GETDOC 1

DOC 1

DOC 1

Single-node type means easier administration and scaling Layer consolidation means

1 single interface for App to talk to and get its data back as fast as possible

Separation of cache and disk allows for fastest access out of RAM while pulling data from disk in parallel

Page 15: Couchbase and Apache Spark

©2015 Couchbase Inc. 15

Add Nodes to Cluster

15

ACTIVE ACTIVE ACTIVE

REPLICA REPLICA REPLICA

Couchbase Server 1 Couchbase Server 2 Couchbase Server 3

ACTIVE ACTIVE

REPLICA REPLICA

Couchbase Server 4 Couchbase Server 5

SHARD5

SHARD2

SHARD SHARD

SHARD4

SHARD SHARD

SHARD1

SHARD3

SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD SHARD

SHARD6

SHARD3

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD SHARD

SHARD7

SHARD

SHARD6

SHARD

SHARD8

SHARD9

SHARD

READ/WRITE/UPDATE

Application has single logical connection to cluster (client object) Multiple nodes added

or removed at once One-click operation Incremental

movement of active and replica vbuckets and data

Client library updated via cluster map

Fully online operation, no downtime or loss of performance

Page 16: Couchbase and Apache Spark

©2015 Couchbase Inc. 16

Node Unresponsive / Lost

Page 17: Couchbase and Apache Spark

©2015 Couchbase Inc. 17

Fail Over Node

17

ACTIVE ACTIVE ACTIVE

REPLICA REPLICA REPLICA

Couchbase Server 1 Couchbase Server 2 Couchbase Server 3

ACTIVE ACTIVE

REPLICA REPLICA

Couchbase Server 4 Couchbase Server 5

SHARD5

SHARD2

SHARD SHARD

SHARD4

SHARD SHARD

SHARD1

SHARD3

SHARD SHARD

SHARD4

SHARD1

SHARD8

SHARD SHARD

SHARDSHARD6

SHARD2

SHARD SHARD SHARD

SHARD7

SHARD9

SHARD5

SHARD SHARD

SHARD

SHARD7

SHARD

SHARD6

SHARDSHARD8

SHARD9

SHARD

SHARD3

SHARD1

SHARD3

SHARD

Application has single logical connection to cluster (client object) When node goes down,

some requests will fail Failover is either

automatic or manual` Client library is

automatically updated via cluster map

Replicas not recreated to preserve stability

Best practice to replace node and rebalance

Page 18: Couchbase and Apache Spark

Demo

Page 19: Couchbase and Apache Spark

What about Hadoop?

Page 20: Couchbase and Apache Spark

©2015 Couchbase Inc. 20

Big Data = Operational + Analytic (NoSQL + Hadoop)

20

Online Web/Mobile/IoT apps Millions of

customers/consumers

Offline Analytics apps Hundreds of business

analysts

Page 21: Couchbase and Apache Spark

COMPLEX EVENT PROCESSING

Real TimeREPOSITORY

PERPETUALSTORE

ANALYTICALDB

BUSINESSINTELLIGENCE

MONITORING

CHAT/VOICE

SYSTEM

BATCHTRACK

REAL-TIMETRACK

DASHBOARD

Page 22: Couchbase and Apache Spark

TRACKING and COLLECTION

ANALYSIS ANDVISUALIZATION

REST FILTER METRICS

Page 23: Couchbase and Apache Spark

©2015 Couchbase Inc. 23

Apache Spark: The Big Picture

Page 24: Couchbase and Apache Spark

©2015 Couchbase Inc. 24

Apache Spark… is a fast and general purpose engine for small and large scale data processing …

Page 25: Couchbase and Apache Spark

©2015 Couchbase Inc. 25

Components: Spark Core

Resilient Distributed DatasetsClusteringExecution

Page 26: Couchbase and Apache Spark

©2015 Couchbase Inc. 26

Components: Spark SQL

Structured through DataFrames

Distributed querying with SQL

Page 27: Couchbase and Apache Spark

©2015 Couchbase Inc. 27

Components: Spark Streaming

Fault-tolerant streaming applications

Page 28: Couchbase and Apache Spark

©2015 Couchbase Inc. 28

Components: Spark MLib

Built-In Machine Learning Algorithms

Page 29: Couchbase and Apache Spark

©2015 Couchbase Inc. 29

Components: Spark GraphX

Graph processing and graph-parallel computations

Page 30: Couchbase and Apache Spark

©2015 Couchbase Inc. 30

How does it work?

Source: http://spark.apache.org/docs/latest/cluster-overview.html

Page 31: Couchbase and Apache Spark

©2015 Couchbase Inc. 31

Spark Benefits

Linearly scalable to 1000+ worker nodes Simpler to use than Hadoop MR Only partial recompute on failure

For developers and data scientists– machine learning– R integration

Tight but not mandatory Hadoop integration– Sources, Sinks– Scheduler

Page 32: Couchbase and Apache Spark

©2015 Couchbase Inc. 32

Spark vs Hadoop

Spark is RAM while Hadoop is mainly HDFS (disk) bound

Fully compatible with Hadoop Input/Output

Easier to develop against thanks to functional composition

Hadoop certainly more mature, but Spark ecosystem growing fast

Page 33: Couchbase and Apache Spark

©2015 Couchbase Inc. 33

Couchbase in the Spark Landscape Transparent generation and persistence of

– RDDs– DataFrames– Dstreams

Spark SQL and N1QL are a natural fit Linearly scale your data and application layer Share data between Spark Applications

The perfect storage companion for your spark applications.

Source: http://spark.apache.org/docs/latest/cluster-overview.html

Page 34: Couchbase and Apache Spark

©2015 Couchbase Inc. 34

Cluster Communication

STORAGE

Couchbase Server 1

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache

Cluster ManagerCluster

Manager

Managed Cache

Storage

Data Service

Index Service

Query Service STORAGE

Couchbase Server 2

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache

Cluster ManagerCluster

Manager

Managed Cache

Storage

Data Service

Index Service

Query Service STORAGE

Couchbase Server 3

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache

Cluster ManagerCluster

Manager

Managed Cache

Storage

Data Service

Index Service

Query Service STORAGE

Couchbase Server 4

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache

Cluster ManagerCluster

Manager

Managed Cache

Storage

Data Service

Index Service

Query Service STORAGE

Couchbase Server 5

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache

Cluster ManagerCluster

Manager

Managed Cache

Storage

Data Service

Index Service

Query Service STORAGE

Couchbase Server 6

SHARD7

SHARD9

SHARD5

SHARDSHARDSHARD

Managed Cache

Cluster ManagerCluster

Manager

Managed Cache

Storage

Data Service

Index Service

Query Service

Spark Worker Spark Worker

Page 35: Couchbase and Apache Spark

©2015 Couchbase Inc. 35

Ecosystem Flexibility

RDBMS

StreamsWeb APIs

DCPKVN1QLViews

BatchingData Archive

OLTP Data

Page 36: Couchbase and Apache Spark

©2015 Couchbase Inc. 36

Infrastructure Consolidation

Page 37: Couchbase and Apache Spark

©2015 Couchbase Inc. 37

The Connector

Page 38: Couchbase and Apache Spark

©2015 Couchbase Inc. 38

Couchbase Connector Spark Core

– Automatic Cluster and Resource Management– Creating and Persisting RDDs– Java APIs in addition to Scala

Spark SQL– Easy JSON handling and querying– Tight N1QL Integration

Spark Streaming– Persisting DStreams– DCP source (experimental)

Page 40: Couchbase and Apache Spark

©2015 Couchbase Inc. 40

Connection Management

Page 41: Couchbase and Apache Spark

©2015 Couchbase Inc. 41

Connection Management

Page 42: Couchbase and Apache Spark

©2015 Couchbase Inc. 42

Creating RDDs

Page 43: Couchbase and Apache Spark

©2015 Couchbase Inc. 43

Persisting RDDs

Page 44: Couchbase and Apache Spark

©2015 Couchbase Inc. 44

Spark SQL Integration

Page 45: Couchbase and Apache Spark

©2015 Couchbase Inc. 45

Spark Streaming with DCP

Page 46: Couchbase and Apache Spark

©2015 Couchbase Inc. 46

What‘s next?

Page 47: Couchbase and Apache Spark

©2015 Couchbase Inc. 47

Couchbase Connector Learn More:

– Couchbase and Spark at Couchbase Connect 2015:http://connect15.couchbase.com/agenda/spark-couchbase-electrify-data-processing/

1.1.0 plans– Upgrade to Spark 1.5– Stabilize DCP Support– Extend, Optimze, Fix bugs…

We need your feedback!