Using Spark in a Couchbase Environment: Couchbase Connect 2015
Couchbase and Apache Spark
-
Upload
matt-ingenthron -
Category
Data & Analytics
-
view
273 -
download
0
Transcript of Couchbase and Apache Spark
Couchbase and Apache Spark
efficient data crunching in a fast moving world
©2015 Couchbase Inc. 2
Matt IngenthronWorked on large site scalability problems at previous company…memcached contributorJoined Couchbase very early and helped define key parts of system @ingenthr
A Quick Architectural Introduction to Couchbase
©2015 Couchbase Inc. 4
Couchbase is a Document Oriented Database
High availability
cache
Key-value store
Document
database
Embedded database
Sync management
Couchbase can be used a number of ways.
Developers often need a simple distributed hashtable, then grow to need secondary indexing and are either mobile-first or need to address mobile
deployment.
©2015 Couchbase Inc. 5
What makes Couchbase unique?
5
Performance & scalability
leaderSub millisecond latency with high throughput; memory-centric architecture
Multi-purpose
Simplified administrationEasy to deploy & manage; integrated Admin Console, single-click cluster expansion & rebalance
Cache, key value store, document database, and local/mobile database in single platform
Always-on availability
Data replication across nodes, clusters, and data centers
Enterprises choose Couchbase for several key advantages
24x365
©2015 Couchbase Inc. 6
Consolidated cache and database
Tune memory required based on application requirements
Multi-purpose database supports many uses
66
Tunable built-in
cacheFlexible
schemas with JSON
Couchbase Lite
Represent data with varying schemas using JSON on the server or on the device
Index and query data with Javascript views
Light weight embedded DB for always available apps
Sync Gateway syncs data seamlessly with Couchbase Server
©2015 Couchbase Inc. 7
Couchbase leads in performance and scalability
Auto Sharding Memory-
memory XDCRSingle
Node Type No manual sharding Database manages data
movement to scale out – not the user
Market’s only memory-to-memory database replication across clusters and geos
Provides disaster recover / data locality
Hugely simplifies management of clusters
Easy to scale clusters by adding any number of nodes
©2015 Couchbase Inc. 8
24x365
Couchbase delivers always-on availability
8
High Availability
Disaster Recovery
Backup & Restore
In-memory replication with manual or automatic fail over
Rack-zone awareness to minimize data unavailability
Memory-to-memory cross cluster replication across data centers or geos
Active-active topology with bi-directional setup
Full backup or Incremental backup with online restore
Delta node catch-ups for faster recovery after failures
©2015 Couchbase Inc. 9
Simplified administration for exceptional ease of use
Online upgrades and operations
Built-in enterprise
class admin console
Restful APIs
Online software, hardware and DB upgrades
Indexing, compaction, rebalance, backup & restore
Perform all administrative tasks with the click of a button
Monitor status of the system visual at cluster level, database level, server level
All admin operations available via UI, REST APIs or CLI commands
Integrate third party monitoring tools easily using REST
©2015 Couchbase Inc. 10
Couchbase Server ArchitectureSingle-node type means easier administration and scaling Single installation Two major
components/processes: Data manager cluster manager
Data manager: C/C++ Layer consolidation of caching
and persistence Cluster manager:
Erlang/OTP Administration UI’s Out-of-band for data requests
©2015 Couchbase Inc. 11
APPLICATION SERVER
MANAGED CACHE
DISK
DISKQUEUE
REPLICATIONQUEUE
Write Operation
11
DOC 1
DOC 1DOC 1
Single-node type means easier administration and scaling Writes are async by default Application gets
acknowledgement when successfully in RAM and can trade-off waiting for replication or persistence per-write
Replication to 1, 2 or 3 other nodes
Replication is RAM-based so extremely fast
Off-node replication is primary level of HA
Disk written to as fast as possible – no waiting
©2015 Couchbase Inc. 12
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
Basic Operation
12
SHARD5
SHARD2
SHARD9
SHARD SHARD SHARD
SHARD4
SHARD7
SHARD8
SHARD SHARD SHARD
SHARD1
SHARD3
SHARD6
SHARD SHARD SHARD
SHARD4
SHARD1
SHARD8
SHARD SHARD SHARD
SHARD6
SHARD3
SHARD2
SHARD SHARD SHARD
SHARD7
SHARD9
SHARD5
SHARD SHARD SHARD
Application has single logical connection to cluster (client object)• Data is automatically sharded resulting in
even document data distribution across cluster
• Each vbucket replicated 1, 2 or 3 times (“peer-to-peer” replication)
• Docs are automatically hashed by the client to a shard
• Cluster map provides location of which server a shard is on
• Every read/write/update/delete goes to same node for a given key
• Strongly consistent data access (“read your own writes”)
• A single Couchbase node can achieve 100k’s ops/sec so no need to scale reads
©2015 Couchbase Inc. 13
Cache Ejection
13
APPLICATION SERVER
MANAGED CACHE
DISK
DISKQUEUE
REPLICATIONQUEUE
DOC 1
DOC 2DOC 3DOC 4DOC 5
DOC 1
DOC 2 DOC 3 DOC 4 DOC 5
Single-node type means easier administration and scaling Layer consolidation means
read through and write through cache
Couchbase automatically removes data that has already been persisted from RAM
©2015 Couchbase Inc. 14
APPLICATION SERVER
MANAGED CACHE
DISK
DISKQUEUE
REPLICATIONQUEUE
DOC 1
Cache Miss
14
DOC 2 DOC 3 DOC 4 DOC 5
DOC 2 DOC 3 DOC 4 DOC 5
GETDOC 1
DOC 1
DOC 1
Single-node type means easier administration and scaling Layer consolidation means
1 single interface for App to talk to and get its data back as fast as possible
Separation of cache and disk allows for fastest access out of RAM while pulling data from disk in parallel
©2015 Couchbase Inc. 15
Add Nodes to Cluster
15
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD5
SHARD2
SHARD SHARD
SHARD4
SHARD SHARD
SHARD1
SHARD3
SHARD SHARD
SHARD4
SHARD1
SHARD8
SHARD SHARD SHARD
SHARD6
SHARD3
SHARD2
SHARD SHARD SHARD
SHARD7
SHARD9
SHARD5
SHARD SHARD SHARD
SHARD7
SHARD
SHARD6
SHARD
SHARD8
SHARD9
SHARD
READ/WRITE/UPDATE
Application has single logical connection to cluster (client object) Multiple nodes added
or removed at once One-click operation Incremental
movement of active and replica vbuckets and data
Client library updated via cluster map
Fully online operation, no downtime or loss of performance
©2015 Couchbase Inc. 16
Node Unresponsive / Lost
©2015 Couchbase Inc. 17
Fail Over Node
17
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD5
SHARD2
SHARD SHARD
SHARD4
SHARD SHARD
SHARD1
SHARD3
SHARD SHARD
SHARD4
SHARD1
SHARD8
SHARD SHARD
SHARDSHARD6
SHARD2
SHARD SHARD SHARD
SHARD7
SHARD9
SHARD5
SHARD SHARD
SHARD
SHARD7
SHARD
SHARD6
SHARDSHARD8
SHARD9
SHARD
SHARD3
SHARD1
SHARD3
SHARD
Application has single logical connection to cluster (client object) When node goes down,
some requests will fail Failover is either
automatic or manual` Client library is
automatically updated via cluster map
Replicas not recreated to preserve stability
Best practice to replace node and rebalance
Demo
What about Hadoop?
©2015 Couchbase Inc. 20
Big Data = Operational + Analytic (NoSQL + Hadoop)
20
Online Web/Mobile/IoT apps Millions of
customers/consumers
Offline Analytics apps Hundreds of business
analysts
COMPLEX EVENT PROCESSING
Real TimeREPOSITORY
PERPETUALSTORE
ANALYTICALDB
BUSINESSINTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCHTRACK
REAL-TIMETRACK
DASHBOARD
TRACKING and COLLECTION
ANALYSIS ANDVISUALIZATION
REST FILTER METRICS
©2015 Couchbase Inc. 23
Apache Spark: The Big Picture
©2015 Couchbase Inc. 24
Apache Spark… is a fast and general purpose engine for small and large scale data processing …
©2015 Couchbase Inc. 25
Components: Spark Core
Resilient Distributed DatasetsClusteringExecution
©2015 Couchbase Inc. 26
Components: Spark SQL
Structured through DataFrames
Distributed querying with SQL
©2015 Couchbase Inc. 27
Components: Spark Streaming
Fault-tolerant streaming applications
©2015 Couchbase Inc. 28
Components: Spark MLib
Built-In Machine Learning Algorithms
©2015 Couchbase Inc. 29
Components: Spark GraphX
Graph processing and graph-parallel computations
©2015 Couchbase Inc. 30
How does it work?
Source: http://spark.apache.org/docs/latest/cluster-overview.html
©2015 Couchbase Inc. 31
Spark Benefits
Linearly scalable to 1000+ worker nodes Simpler to use than Hadoop MR Only partial recompute on failure
For developers and data scientists– machine learning– R integration
Tight but not mandatory Hadoop integration– Sources, Sinks– Scheduler
©2015 Couchbase Inc. 32
Spark vs Hadoop
Spark is RAM while Hadoop is mainly HDFS (disk) bound
Fully compatible with Hadoop Input/Output
Easier to develop against thanks to functional composition
Hadoop certainly more mature, but Spark ecosystem growing fast
©2015 Couchbase Inc. 33
Couchbase in the Spark Landscape Transparent generation and persistence of
– RDDs– DataFrames– Dstreams
Spark SQL and N1QL are a natural fit Linearly scale your data and application layer Share data between Spark Applications
The perfect storage companion for your spark applications.
Source: http://spark.apache.org/docs/latest/cluster-overview.html
©2015 Couchbase Inc. 34
Cluster Communication
STORAGE
Couchbase Server 1
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 2
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 3
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 4
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 5
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 6
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster
Manager
Managed Cache
Storage
Data Service
Index Service
Query Service
Spark Worker Spark Worker
©2015 Couchbase Inc. 35
Ecosystem Flexibility
RDBMS
StreamsWeb APIs
DCPKVN1QLViews
BatchingData Archive
OLTP Data
©2015 Couchbase Inc. 36
Infrastructure Consolidation
©2015 Couchbase Inc. 37
The Connector
©2015 Couchbase Inc. 38
Couchbase Connector Spark Core
– Automatic Cluster and Resource Management– Creating and Persisting RDDs– Java APIs in addition to Scala
Spark SQL– Easy JSON handling and querying– Tight N1QL Integration
Spark Streaming– Persisting DStreams– DCP source (experimental)
©2015 Couchbase Inc. 39
Facts Current Version: 1.0.0-beta
Code: https://github.com/couchbaselabs/couchbase-spark-connector
Docs until GA: http://developer.couchbase.com/documentation/server/4.0/connectors/spark-1.0/spark-intro.html
©2015 Couchbase Inc. 40
Connection Management
©2015 Couchbase Inc. 41
Connection Management
©2015 Couchbase Inc. 42
Creating RDDs
©2015 Couchbase Inc. 43
Persisting RDDs
©2015 Couchbase Inc. 44
Spark SQL Integration
©2015 Couchbase Inc. 45
Spark Streaming with DCP
©2015 Couchbase Inc. 46
What‘s next?
©2015 Couchbase Inc. 47
Couchbase Connector Learn More:
– Couchbase and Spark at Couchbase Connect 2015:http://connect15.couchbase.com/agenda/spark-couchbase-electrify-data-processing/
1.1.0 plans– Upgrade to Spark 1.5– Stabilize DCP Support– Extend, Optimze, Fix bugs…
We need your feedback!