BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data...

46
BIG data anti-patterns

Transcript of BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data...

Page 1: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

BIG data anti-patterns

Page 2: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Polyglot data integration

Page 3: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

OLTP OLAP / EDW

HBase Cassan dra

Voldem ort

Hadoop

SecurityAnalyticsRec. Engine

Search Monitoring Social Graph

Page 4: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort
Page 5: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Background

• Apache project

• Originated from LinkedIn

• Open-sourced in 2011

• Written in Scala and Java

• Borrows concepts in messaging systems and logs

• Foundational data movement and integration technology

Page 6: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

What’s the big whoop about Kafka?

Page 7: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Throughput

http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Producer

Consumer Consumer Consumer

Kafka Server Kafka Server Kafka Server

Producer Producer

2,024,032 TPS

2ms

2,615,968 TPS

Page 8: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

O.S. page cache is leveraged

0 1 2 3 4 5 6 7 8 9 10 11 ...

ProducerConsumer A Consumer B

OS page cache

Disk

writesreadsreads

Page 9: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Things to look out for

• Leverages ZooKeeper, which is tricky to configure

• Reads can become slow when the page cache is missed and disk needs to be hit

• Lack of security

Page 10: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Summary

• Don’t write your own data integration

• Use Kafka for light-weight, fast and scalable data integration

Page 11: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Full scans!

Page 12: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

SELECT * FROM huge_table JOIN ON other_huge_table …

Page 13: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

What’s the problem?

Page 14: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort
Page 15: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

so part i t ion your d a t a according t o how you w i l l most commonly

access i t

hdfs:/data/tweets/date=20140929/

hdfs:/data/tweets/date=20140930/

hdfs:/data/tweets/date=20140931/

Page 16: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

and then make sure t o include a f i l t e r in your queries so t h a t on ly

those partit ions are read

... WHERE DATE=20151027

Page 17: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

include project ions t o reduce d a t a t h a t needs t o be read from disk o r

pushed over t he network

SELECT id, name FROM...

Page 18: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

hash joins require network io which is s low

Page 19: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort
Page 20: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

and t e l l your query engine t o use a sor t-

merge-bucket (SMB) join

— Hive properties to enable a SMB join set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;

Page 21: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

and look a t using a columnar da ta

format l ike parquet

GOOGLMSFT

05-10-201405-10-2014

526.6239.54

Column Strorage

Column 1 (Symbol)

Column 2 (Date)

Column 3 (Price)

Page 22: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Summary

• Partition, filter and project your data (same as you used to do with relational databases)

• Look at bucketing and sorting your data to support advanced join techniques such as sort-merge- bucket

• Consider storing your data in columnar form

Page 23: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Tombstones

Page 24: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

What is Cassandra?

• Low-latency distributed database

• Apache project modeled after Dynamo and BigTable

• Data replication for fault tolerance and scale

• Multi-datacenter support

Page 25: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Node

Node

Node

Node

Node

Node

East West

Page 26: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

What’s the problem?

Page 27: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

VVV V VV V V VVK K K K K K K K K K K K K K K K K K K K K KV V V V V V V V V V V V

tombstone markers indicate that the column has been deleted

de le te s in Cassandra a re soft; de l e t ed columns a re marked

with tombstones

these tombstoned columns slow-down reads

Page 28: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

don ’t use Cassandra, use kafka

Page 29: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Counting with Java’s built-in collections

Page 30: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort
Page 31: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

What’s the problem?

Page 32: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

USE HYPERLOGLOG TO WORK WITH approximate DISTINCt

COUNTS @SCALE

Page 33: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

HyperLogLog

• Cardinality estimation algorithm

• Uses (a lot) less space than sets

• Doesn’t provide exact distinct counts (being “close” is probably good enough)

• Cardinality Estimation for Big Data: http://druid.io/ blog/2012/05/04/fast-cheap-and-98-right- cardinality-estimation-for-big-data.html

Page 34: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

1 billion distinct elements = 1.5kb memory

standard error = 2%

Page 36: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Bit pattern observations

1xxxxxxxxx..x01xxxxxxxx.

.x

001xxxxxxx.

.x

0001xxxxxx..x

50% of hashed values will look like:25% of hashed values will look like:

12.5% of hashed values will look like:

6.25% of hashed values will look like:

Page 37: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort
Page 38: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

HLL Java library

• https://github.com/aggregateknowledge/java-hll

• Neat implementation - it automatically promotes internal data structure to HLL once it grows beyond a certain size

Page 39: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Approximate count algorithms

• HyperLogLog (distinct counts)

• CountMinSketch (frequencies of members)

• Bloom Filter (set membership)

Page 40: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Summary

• Data skew is a reality when working at Internet scale

• Java’s builtin collections have a large memory footprint don’t scale

• For high-cardinality data use approximate estimation algorithms

Page 41: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

It’s open

Page 42: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Important questions to ask

• Is my data encrypted when it’s in motion?

• Is my data encrypted on disk?

• Are there ACL’s defining who has access to what?

• Are these checks enabled by default?

Page 43: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

How do tools stack up?

Z

ACL’sAt-rest

encryption

In-motion encryption

Enabled by default

Ease of use

Oracle

Hadoop

Cassandra

ooKeeper

Kafka

Page 44: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Summary

• Enable security for your tools!

• Include security as part of evaluating a tool

• Ask vendors and project owners to step up to the plate

Page 45: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Thanks for your time!

Page 46: BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data anti-patterns. Polyglot data integration. OLTP OLAP / EDW HBase Cassan dra Voldem ort

Best-in-class big data tools (in my opinion)

If you want … Consider …

Low-latency lookups Cassandra, memcached

Near real-time processing Storm

Interactive analytics Vertica, Teradata

Full scans, system of record data, ETL, batch processing

HDFS, MapReduce, Hive, Pig

Data movement and integration Kafka