OSCON TALK: Becoming Friends with Cassandra and Spark

Post on 13-Apr-2017

1.293 views 0 download

Transcript of OSCON TALK: Becoming Friends with Cassandra and Spark

BECOMING FRIENDS WITH CASSANDRA & SPARK

DANI TRAPHAGEN & JON HADDAD

YOU

SPARK

C*

BECOMING FRIENDS WITH CASSANDRA & SPARK

DANI TRAPHAGEN & JON HADDAD

YOU

SPARKC*

HOUSEKEEPING

RAISE YOUR HAND IF YOU DON’T HAVE THE VM OSCON2016.ZIP

1.copy the vm files to a place of your choosing

2.open virtual ovf

VM INSTRUCTIONS

3.import the .ovf as prompted

3.open the packer ovf in VirtualBox

4.check out the vm

LET’S GET STARTED

WHAT ARE WE GOING TO COVER?1. CASSANDRA ARCHITECTURE,

CQL, DATA MODELING 2. SPARK DATAFRAMES

RDBMS & YOU

SQLITE, PYTHON SCRIPTS, LOG FILES

SUCH AS?

SMALL DATA

MOST WEB SITES

RDBMS

MEDIUM DATA

CAN RDBMS WORK FOR BIG DATA?

YOU BIG DATA

VERTICAL SCALE

VERTICAL SCALESTARTING

MY BUSINESS

YAY!

VERTICAL SCALESTARTING

MY BUSINESS

YAY!

VERTICAL SCALESTARTING

MY BUSINESS

YAY!

OH, WHOA, THINGS ARE KICKING UP

VERTICAL SCALESTARTING

MY BUSINESS

YAY!

OH, WHOA, THINGS ARE KICKING UP

VERTICAL SCALESTARTING

MY BUSINESS

YAY!

OH, WHOA, THINGS ARE KICKING UP

ACID IS A LIE

ACID IS A LIEATOMICITY

ACID IS A LIEATOMICITYCONSISTENCY

ACID IS A LIEATOMICITYCONSISTENCYISOLATION

ACID IS A LIEATOMICITYCONSISTENCYISOLATIONDURABILITY

ACID IS A LIEATOMICITYCONSISTENCYISOLATIONDURABILITY

ASYNC REPLICATION != CONSISTENCY

ASYNC REPLICATION != CONSISTENCY

CLIENT

ASYNC REPLICATION != CONSISTENCY

CLIENT

ASYNC REPLICATION != CONSISTENCY

CLIENTMASTER

ASYNC REPLICATION != CONSISTENCY

CLIENTMASTER SLAVE

ASYNC REPLICATION != CONSISTENCY

CLIENTMASTER SLAVE

ASYNC REPLICATION != CONSISTENCY

CLIENTMASTER SLAVE

REPLICATION LAG

CONSISTENT?

ASYNC REPLICATION != CONSISTENCY

CLIENTMASTER SLAVE

REPLICATION LAG

CONSISTENT?

ASYNC REPLICATION != CONSISTENCY

CLIENTMASTER SLAVE

REPLICATION LAG

IDK?

CONSISTENT?

ASYNC REPLICATION != CONSISTENCY

CLIENTMASTER SLAVE

REPLICATION LAG

LOL NO! IDK?

THIRD NORMAL FORM DOESN’T SCALE

▸ UNPREDICTABLE

▸ DATA > MEMORY?

▸ DISK SEEKS ALL DAY

▸ USERS = ANGRY

THIRD NORMAL FORM DOESN’T SCALE

AWFUL▸ UNPREDICTABLE

▸ DATA > MEMORY?

▸ DISK SEEKS ALL DAY

▸ USERS = ANGRY

SHARDING

SHARDING

CLIE

NT

SHARDING

CLIE

NT

SHARDING

CLIE

NTNIGHTMARE

AVAILABILITY?

AVAILABILITY?NOT WITH

THESE KNUCKLEHEADS

CONCLUSION: SCALING IS HARD

FRIEND #1: CASSANDRA

FRIEND #1: CASSANDRA

ARCHITECTURE

ARCHITECTURE

PEER TO PEER

▸ With Cassandra there is no Master Slave Hierarchy

▸ Every node is the captain of it’s own ship

▸ Processes within Cassandra make this possible

▸ Replication

▸ Consistency Level

NODE1

NODE2

NODE3

NODE4

ARCHITECTURE

PEER TO PEER

▸ With Cassandra there is no Master Slave Hierarchy

▸ Every node is the captain of it’s own ship

▸ Processes within Cassandra make this possible

▸ Replication

▸ Consistency Level

NODE1

NODE2

NODE3

NODE4

WHAT DOES THIS GET US?

WHAT DOES THIS GET US?

LINEAR SCALABILITY

WHAT DOES THIS GET US?

LINEAR SCALABILITY

HIGH AVAILABILITY

TOPOLOGY

CLIENT

TOPOLOGY

CLIENT

TOPOLOGY

OPERATION

CLIENT

TOPOLOGY

OPERATION

CLIENT

TOPOLOGY

OPERATION

CLIENT

TOPOLOGY

OPERATION

NODE3

NODE4

▸ Replication factor is the number of replicas/puppies

ARCHITECTURE

REPLICATION IS HOW CASSANDRA DISTRIBUTES DATA

NODE1

NODE2

NODE3

NODE4

▸ Replication factor is the number of replicas/puppies

ARCHITECTURE

REPLICATION IS HOW CASSANDRA DISTRIBUTES DATA

NODE1

NODE2

NODE3

NODE4

▸ Replication factor is the number of replicas/puppies

ARCHITECTURE

REPLICATION IS HOW CASSANDRA DISTRIBUTES DATA

NODE1

NODE2

NODE3

NODE4

▸ The coordinator talks to the client, sending an ack for the write

ARCHITECTURE

HOW DO WE ACKNOWLEDGE REPLICATION?

NODE1

NODE2

COORDINATOR

NODE3

NODE4

▸ The coordinator talks to the client, sending an ack for the write

ARCHITECTURE

HOW DO WE ACKNOWLEDGE REPLICATION?

NODE1

NODE2

COORDINATOR

NODE3

NODE4

▸ The coordinator talks to the client, sending an ack for the write

ARCHITECTURE

HOW DO WE ACKNOWLEDGE REPLICATION?

NODE1

NODE2

COORDINATOR

ack

ARCHITECTURE

TUNABLE CONSISTENCY LEVELS

NODE1

NODE2

NODE3

NODE4

▸ One

▸ Quorum

▸ All

ONE

ARCHITECTURE

NODE1

NODE2

NODE3

NODE4

▸ One replica acks adorable puppy data

ONE

ARCHITECTURE

NODE1

NODE2

NODE3

NODE4

▸ One replica acks adorable puppy data

▸ All replicas ack adorable puppy data

NODE3

NODE4

ARCHITECTURE

ALL

NODE1

NODE2

▸ All replicas ack adorable puppy data

NODE3

NODE4

ARCHITECTURE

ALL

NODE1

NODE2

▸ All replicas ack adorable puppy data

NODE3

NODE4

ARCHITECTURE

ALL

NODE1

NODE2

ARCHITECTURE

QUORUM

NODE1

NODE2

NODE3

▸ Quorum = (sum_of_replication_factors / 2) + 1

▸ How many nodes get puppies if our replication factor is 3, & we want quorum?

NODE4

ARCHITECTURE

QUORUM

NODE1

NODE2

NODE3

▸ Quorum = (sum_of_replication_factors / 2) + 1

▸ How many nodes get puppies if our replication factor is 3, & we want quorum?

NODE4

MULTI-DC PARAMETERS▸Quorum vs. Local_Quorum

▸One vs. Local_One

US-EAST US-WEST

PARTITIONER

CONSISTENT HASHINGJust how is data actually distributed around the cluster?

PARTITIONER

CONSISTENT HASHINGJust how is data actually distributed around the cluster?

PARTITIONER

CONSISTENT HASHINGJust how is data actually distributed around the cluster?

PARTITIONER

CONSISTENT HASHINGJust how is data actually distributed around the cluster?

PARTITIONER

CONSISTENT HASHINGJust how is data actually distributed around the cluster?

CASSANDRA DATA MODELING SOUNDS HARD

CASSANDRA DATA MODELING SOUNDS HARDNOT REALLY

GAIN QUERY POWERSWITH CQL

GAIN QUERY POWERSWITH CQL

DATA STRUCTURES IN CASSANDRA

KEYSPACE

DATA STRUCTURES IN CASSANDRA

KEYSPACE

DATA STRUCTURES IN CASSANDRA

TABLE

KEYSPACE

DATA STRUCTURES IN CASSANDRA

ROWS TABLE

KEYSPACE

DATA STRUCTURES IN CASSANDRA

ROWS

TABLE

KEYSPACE

PARTITIONS

DATA STRUCTURES IN CASSANDRA

ROWS

TABLE

KEYSPACE

PARTITIONS

DATA STRUCTURES IN CASSANDRA

ROWS

TABLE

KEYSPACE

PARTITIONS

DATA STRUCTURES IN CASSANDRA

ROWS

TABLE

PRIMARY KEY = PARTITION KEY + CLUSTERING COLUMNS

PARTITION KEY

PARTITION KEYTHIS IS HOW YOU RETRIEVE A PARTITION

CLUSTERING COLUMNS

CLUSTERING COLUMNSTHIS IS HOW YOU GET SORTING, ORDER AND UNIQUE IDENTIFICATION

WHY ARE CLUSTERING COLUMNS SO COOL?

HOW DO I USE CQL?

CQLSH

HOW DO I USE CQL?

SOME EXAMPLES FROM A MOVIE DB

CREATE A KEYSPACECREATE KEYSPACE movielens_small WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};

CREATE A TABLECREATE TABLE movies ( id uuid PRIMARY KEY, avg_rating float, genres set<text>, name text, release_date date, url text, video_release_date date)

PRIMARY KEY IN WHITE

CREATE A TABLECREATE TABLE ratings_by_movie ( movie_id uuid, user_id uuid, rating int, ts int, PRIMARY KEY (movie_id, user_id))

PRIMARY KEY IN WHITE

INSERT STATEMENT EXAMPLEinsert into movies (id, name, genres) values (976de5da-93ae-4bf0-b127-d19eea1c8ea4, 'My Awesome Movie (2016)', {'Comedy'});

THIS ALL LOOKS TOO FAMILIAR, DOESN’T IT?

BUT REMEMBER…

THIRD NORMAL FORM DOESN’T SCALE

▸ UNPREDICTABLE

▸ DATA > MEMORY?

▸ DISK SEEKS ALL DAY

▸ USERS = ANGRY

THIRD NORMAL FORM DOESN’T SCALE

AWFUL▸ UNPREDICTABLE

▸ DATA > MEMORY?

▸ DISK SEEKS ALL DAY

▸ USERS = ANGRY

DATA MODELING PRO TIPS

DATA MODELING PRO TIPS▸no joins

DATA MODELING PRO TIPS▸no joins

▸query driven methodology, instead

DATA MODELING PRO TIPS▸no joins

▸query driven methodology, instead

▸denormalize

DATA MODELING PRO TIPS▸no joins

▸query driven methodology, instead

▸denormalize

▸disks are cheap

JON & DANI, I’M STARTING TO GET COLD FEET!

I MISS THE WARM EMBRACE OF RDBMS

I DIDN’T HAVE TO DENORMALIZE

BACK THEN

CHILL OUT

& PREPARE TO BE WOWED

& PREPARE TO BE WOWED

CDM

ROLL UP YOUR SLEEVES

TYPE STUFF

REMEMBER THAT VM?

1.use movielens_small;2.desc tables;3.desc movies;4.select * from movies limit 10;

TRY IT OUT

YOU SHOULD GET…

YOUR 10 MOVIES

ADDING ON5. select * id, name from movies limit 100;6. PICK YOUR FAVORITE MOVIE

BONUS: CAN YOU FIND THE AVERAGE

RATINGS FOR YOUR FAVORITE MOVIE?

MOVIE ID LIST

SELECT A MOVIE

TOP GUN EXAMPLE

TOP GUN EXAMPLE

FIFTH ELEMENT BECAUSE OBVIOUSLY

FIFTH ELEMENT BECAUSE OBVIOUSLY

NICE WORK YOU!

FRIEND #2: SPARK

FRIEND #2: SPARK

BATCH PROCESSING

LOTS OF DATA?

STREAMING & REAL TIME AGGREGATION

MACHINE LEARNING FOR THE INEVITABLE END OF TIMES

GRAPH ANALYTICS

2 WAYS OF WORKING

1. RDDBASED ON FUNCTIONAL PROGRAMMING

blah.map( lambda x : x * 2 )

COOL BUT NOT EASY

COOL BUT NOT EASY

2. DATAFRAMES

PRETTY EASY

TODAY WE TALK BATCH WITH DATAFRAMES AND PYTHON

ROLL UP YOUR SLEEVESOPEN THE OSCON TUTORIAL ON YOUR DESKTOP

FRIENDSHIP LEVELS

OTHER RESOURCES TO LEARN:1. free courses -

www.academy.datatax.com 2. our blogs -

www.rustyrazorblade.com & www.dtrapezoid.com

3. our friend’s blog - https://lostechies.com/ryansvihla/

4. datastax blog - http://www.datastax.com/dev/blog

THANK YOU, MAGICAL HUMANS

@DTRAPEZOID @RUSTYRAZORBLADE