Apache Kudu - Updatable Analytical Storage #rakutentech

42
Apache Kudu Updatable Analytical Storage for Modern Data Platform Sho Shimauchi | Sales Engineer | Cloudera

Transcript of Apache Kudu - Updatable Analytical Storage #rakutentech

Page 1: Apache Kudu - Updatable Analytical Storage #rakutentech

1 © Cloudera, Inc. All rights reserved.

Apache Kudu Updatable Analytical Storage for Modern Data Platform

Sho Shimauchi | Sales Engineer | Cloudera

Page 2: Apache Kudu - Updatable Analytical Storage #rakutentech

2 © Cloudera, Inc. All rights reserved.

Who Am I?

Sho Shimauchi Sales Engineer / Technical Evangelist Joined Cloudera in 2011

The First Employee in Cloudera APJ Email: [email protected] Twitter: @shiumachi

Page 3: Apache Kudu - Updatable Analytical Storage #rakutentech

3 © Cloudera, Inc. All rights reserved.

•  Founded in 2008

•  1600+ Clouderans

•  Machine learning and analytics platform

•  Shared data experience

•  Cloud-native and cloud-differentiated

•  Open-source innovation and efficiency

Page 4: Apache Kudu - Updatable Analytical Storage #rakutentech

4 © Cloudera, Inc. All rights reserved.

Rakuten Card replaced Mainframe to Cloudera Enterprise in 2017 Apache Spark improved performance of the batch processes >2x Please join Cloudera World Tokyo 2017 to see Kobayashi-san’s Keynote!

www.clouderaworldtokyo.com

Rakuten Card + Cloudera

Page 5: Apache Kudu - Updatable Analytical Storage #rakutentech

5 © Cloudera, Inc. All rights reserved.

Why Kudu? Use Cases and Motivation

Page 6: Apache Kudu - Updatable Analytical Storage #rakutentech

6 © Cloudera, Inc. All rights reserved. 6

The modern platform for machine learning and analytics optimized for the cloud

EXTENSIBLE SERVICES

CORE SERVICES DATA

ENGINEERING OPERATIONAL

DATABASE ANALYTIC DATABASE

DATA CATALOG

INGEST & REPLICATION SECURITY GOVERNANCE WORKLOAD

MANAGEMENT

DATA SCIENCE

NEW OFFERINGS

Cloudera Enterprise

Amazon S3 Microsoft ADLS HDFS KUDU STORAGE SERVICES

Page 7: Apache Kudu - Updatable Analytical Storage #rakutentech

7 © Cloudera, Inc. All rights reserved.

HDFS

Fast Scans, Analytics and Processing of

Stored Data

Fast On-Line Updates &

Data Serving

Arbitrary Storage (Active Archive)

Fast Analytics (on fast-changing or

frequently-updated data)

Unchanging

Fast Changing Frequent Updates

HBase

Append-Only

Real-Time

Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration

work to move data between HBase & HDFS

Analytic Gap

Pace of Analysis

Pace

of D

ata

Filling the Analytic Gap

Page 8: Apache Kudu - Updatable Analytical Storage #rakutentech

8 © Cloudera, Inc. All rights reserved.

Apache Kudu: Scalable and fast structured storage

Scalable • Tested up to 300+ nodes (PBs cluster) • Designed to scale to 1000s of nodes and tens of PBs

Fast • Multiple GB/second read throughput per node • Millions of read/write operations per second across cluster

Tabular • Represents data in structured tables like a relational database

• Strict schema, finite column count, no BLOBs •  Individual record-level access to 100+ billion row tables

Page 9: Apache Kudu - Updatable Analytical Storage #rakutentech

9 © Cloudera, Inc. All rights reserved.

Apache Kudu Community

Page 10: Apache Kudu - Updatable Analytical Storage #rakutentech

10 © Cloudera, Inc. All rights reserved.

Can you insert time series data in real time? How long does it take to prepare it for analysis? Can you get results and act fast enough to change outcomes?

Can you handle large volumes of machine-generated data? Do you have the tools to identify problems or threats? Can your system do machine learning?

How fast can you add data to your data store? Are you trading off the ability to do broad analytics for the ability to make updates? Are you retaining only part of your data?

Time Series Data Machine Data Analytics Online Reporting

Why Kudu?

Page 11: Apache Kudu - Updatable Analytical Storage #rakutentech

11 © Cloudera, Inc. All rights reserved.

Cheaper and faster every year. Persistent memory (3D XPoint™) Kudu can take advantage of SSD and NVM using Intel’s NVM Library.

RAM is cheaper and bigger every day. Kudu runs smoothly with huge RAM. Written in C++ to avoid GC issues.

Modern CPUs are adding cores and SIMD width, not GHz. Kudu takes advantage of SIMD instructions and concurrent data structures.

Next generation hardware

Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs

Page 12: Apache Kudu - Updatable Analytical Storage #rakutentech

12 © Cloudera, Inc. All rights reserved.

How it Works Replication And Fault Tolerance

Page 13: Apache Kudu - Updatable Analytical Storage #rakutentech

13 © Cloudera, Inc. All rights reserved.

Tables, tablets, and tablet servers

• Each table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS

• Each tablet has N replicas (3 or 5) with Raft consensus • Automatic fault tolerance • MTTR (mean time to repair): ~5 seconds

Page 14: Apache Kudu - Updatable Analytical Storage #rakutentech

14 © Cloudera, Inc. All rights reserved.

Metadata Replicated master

Acts as a tablet directory Acts as a catalog (which tables exist, etc) Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets)

Caches all metadata in RAM for high performance Client configured with master addresses

Asks master for tablet locations as needed and caches them

Page 15: Apache Kudu - Updatable Analytical Storage #rakutentech

15 © Cloudera, Inc. All rights reserved.

Client

Hey Master! Where is the row for ‘tlipcon’ in table “T”?

It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, …

UPDATE tlipcon SET col=foo

Meta Cache T1: … T2: … T3: …

Page 16: Apache Kudu - Updatable Analytical Storage #rakutentech

16 © Cloudera, Inc. All rights reserved.

Raft consensus

TS A

Tablet 1 (LEADER)

Client

TS B

Tablet 1 (FOLLOWER)

TS C

Tablet 1 (FOLLOWER)

WAL

WAL WAL

2b. Leader writes local WAL

1a. Client->Leader: Write() RPC

2a. Leader->Followers: UpdateConsensus() RPC

3. Follower: write WAL

4. Follower->Leader: success

3. Follower: write WAL

5. Leader has achieved majority

6. Leader->Client: Success!

Page 17: Apache Kudu - Updatable Analytical Storage #rakutentech

17 © Cloudera, Inc. All rights reserved.

How it Works Columnar Storage

Page 18: Apache Kudu - Updatable Analytical Storage #rakutentech

18 © Cloudera, Inc. All rights reserved.

Row Storage

Scans have to read all the data, no encodings

{23059873, newsycbot, 1442865158, Visual exp…} {22309487, RideImpala, 1442828307, Introducing …} …

Tweet_id, user_name, created_at, text

Page 19: Apache Kudu - Updatable Analytical Storage #rakutentech

19 © Cloudera, Inc. All rights reserved.

{25059873, 22309487, 23059861, 23010982}

Tweet_id

{newsycbot, RideImpala, fastly, llvmorg}

User_name

{1442865158, 1442828307, 1442865156, 1442865155}

Created_at

{Visual exp…, Introducing .., Missing July…, LLVM 3.7….}

text

Columnar Storage

Page 20: Apache Kudu - Updatable Analytical Storage #rakutentech

20 © Cloudera, Inc. All rights reserved.

SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;

{25059873, 22309487, 23059861, 23010982}

Tweet_id

1GB

{newsycbot, RideImpala, fastly, llvmorg}

User_name

Only read 1 column

2GB

{1442865158, 1442828307, 1442865156, 1442865155}

Created_at

1GB

{Visual exp…, Introducing .., Missing July…, LLVM 3.7….}

text

200GB

Columnar Storage

Page 21: Apache Kudu - Updatable Analytical Storage #rakutentech

21 © Cloudera, Inc. All rights reserved.

{1442825158, 1442826100, 1442827994, 1442828527}

Created_at Created_at Diff(created_at)

1442825158 n/a

1442826100 942

1442827994 1894

1442828527 533

64 bits each 11 bits each

Columnar Compression

Many columns can compress to a few bits per row! Especially:

Timestamps Time series values Low-cardinality strings

Massive space savings and throughput increase!

Page 22: Apache Kudu - Updatable Analytical Storage #rakutentech

22 © Cloudera, Inc. All rights reserved.

How it Works Write and Read Paths

Page 23: Apache Kudu - Updatable Analytical Storage #rakutentech

23 © Cloudera, Inc. All rights reserved.

LSM vs Kudu LSM – Log Structured Merge (Cassandra, HBase, etc)

Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (SSTable, HFile)

Reads perform an on-the-fly merge of all on-disk HFiles Kudu

Shares some traits (memstores, compactions) More complex. Slower writes in exchange for faster reads (especially scans)

Page 24: Apache Kudu - Updatable Analytical Storage #rakutentech

24 © Cloudera, Inc. All rights reserved.

LSM Insert Path

MemStore INSERT

Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”

HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”

flush

Page 25: Apache Kudu - Updatable Analytical Storage #rakutentech

25 © Cloudera, Inc. All rights reserved.

LSM Insert Path

MemStore INSERT

Row=r1 col=c1 val=“blah2” Row=r1 col=c2 val=“2”

HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2”

flush

HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”

Page 26: Apache Kudu - Updatable Analytical Storage #rakutentech

26 © Cloudera, Inc. All rights reserved.

LSM Update path

MemStore UPDATE

HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2”

HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5”

Row=r2 col=c1 val=“newval”

Note: all updates are “fully decoupled” from reads. Random-write workload is transformed to fully sequential!

Page 27: Apache Kudu - Updatable Analytical Storage #rakutentech

27 © Cloudera, Inc. All rights reserved.

LSM Read path

MemStore

HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2”

HFile 2

Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5”

Row=r2 col=c1 val=“newval”

Merge based on string row keys

R1: c1=blah c2=2 R2: c1=newval c2=5 ….

CPU intensive!

Must always read rowkeys

Any given row may exist across multiple HFiles: must

always merge!

The more HFiles to merge, the slower it reads

Page 28: Apache Kudu - Updatable Analytical Storage #rakutentech

28 © Cloudera, Inc. All rights reserved.

Kudu storage – Inserts and Flushes MemRowSet

INSERT (“todd”, “$1000”,”engineer”)

name pay role

DiskRowSet 1

flush

Multiple files for each columns

base data

Latest version of data

Page 29: Apache Kudu - Updatable Analytical Storage #rakutentech

29 © Cloudera, Inc. All rights reserved.

Kudu storage – Inserts and Flushes MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2

INSERT (“doug”, “$1B”, “Hadoop man”)

flush

base data

base data

Page 30: Apache Kudu - Updatable Analytical Storage #rakutentech

30 © Cloudera, Inc. All rights reserved.

Kudu storage - Updates MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2 DeltaMemStore

DeltaMemStore

base data

base data

On Memory On Disk

On Memory

Page 31: Apache Kudu - Updatable Analytical Storage #rakutentech

31 © Cloudera, Inc. All rights reserved.

Kudu storage - Updates MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2 DeltaMemStore

DeltaMemStore

UPDATE set pay=“$1M” WHERE name=“todd”

Is the row in DiskRowSet 2? (check bloom filters)

Is the row in DiskRowSet 1? (check bloom filters)

Bloom says: no!

Bloom says: maybe!

Search key column to find offset: rowid = 150

150: col 1=$1M

base data

Page 32: Apache Kudu - Updatable Analytical Storage #rakutentech

32 © Cloudera, Inc. All rights reserved.

Kudu storage – Delta flushes MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2 DeltaMemStore

DeltaMemStore

0: pay=foo REDO DeltaFile Flush

A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later

version

base data

base data

Page 33: Apache Kudu - Updatable Analytical Storage #rakutentech

33 © Cloudera, Inc. All rights reserved.

Kudu storage – Minor delta compaction

name pay role

DiskRowSet(pre-compaction) Delta MS

REDO DeltaFile REDO DeltaFile REDO DeltaFile

REDO DeltaFile

base data

Page 34: Apache Kudu - Updatable Analytical Storage #rakutentech

34 © Cloudera, Inc. All rights reserved.

Kudu storage – Major delta compaction

name pay role

DiskRowSet Delta MS

REDO DeltaFile REDO DeltaFile REDO DeltaFile

Unmerged REDO DeltaFile

base data

pay

Compaction can be performed only on high-frequent column

UNDO Records

UNDO stores previous versions of data

Page 35: Apache Kudu - Updatable Analytical Storage #rakutentech

35 © Cloudera, Inc. All rights reserved.

Kudu storage – RowSet Compactions DRS 1 (32MB)

[PK=alice], [PK=iris], [PK=linda], [PK=zach]

DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke]

DRS 3 (32MB)

[PK=carl], [PK=julie], [PK=omar] [PK=zoe]

DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, iris] [jon, julie, linda, mary] [omar, zach, zeke, zoe]

Writes for “chris” have to perform bloom lookups on all 3 RS

Range: A-Z

Range: A-Z

Range: A-Z

Range: A-I Range: J-M Range: O-Z

Reorganize rows to avoid rowsets with overlapping key ranges

“chris” is in this range!

Page 36: Apache Kudu - Updatable Analytical Storage #rakutentech

36 © Cloudera, Inc. All rights reserved.

Kudu Storage - Compactions Main Idea: Always be compacting!

Compactions run continuously to prevent IO storms ”Budgeted” RS compactions: What is the best way to spend X MBs IO? Physical/Logical decoupling: different replicas run compactions at different times

Page 37: Apache Kudu - Updatable Analytical Storage #rakutentech

37 © Cloudera, Inc. All rights reserved.

Kudu storage – Read path MemRowSet

name pay role

DiskRowSet 1

name pay role

DiskRowSet 2 DeltaMemStore

DeltaMemStore 150: pay=$1M

base data

base data

Just need to read this DiskRowSet!

Page 38: Apache Kudu - Updatable Analytical Storage #rakutentech

38 © Cloudera, Inc. All rights reserved.

Kudu storage – Time Travel Read

name pay role

DiskRowSet Delta MS

REDO DeltaFile REDO DeltaFile REDO DeltaFile

base data

pay

UNDO Records

T=0: a query starts to read “pay” in other DiskRowSet

T=10: major delta compaction happened! Base file is updated, and UNDO is created

T=20: the query starts to read “pay” in this DiskRowSet, but read the version of T=0 from UNDO Records

Page 39: Apache Kudu - Updatable Analytical Storage #rakutentech

39 © Cloudera, Inc. All rights reserved.

Takeaways

Page 40: Apache Kudu - Updatable Analytical Storage #rakutentech

40 © Cloudera, Inc. All rights reserved.

Getting Started On the web: https://www.cloudera.com/documentation/kudu/latest.html, https://www.cloudera.com/downloads.html, https://blog.cloudera.com/?s=Kudu, kudu.apache.org • Apache project user mailing list: [email protected] • Quickstart VM

• Easiest way to get started •  Impala and Kudu in an easy-to-install VM

• CSD and Parcels • For installation on a Cloudera Manager-managed cluster

Training classes available: https://www.cloudera.com/more/training.html

Page 41: Apache Kudu - Updatable Analytical Storage #rakutentech

41 © Cloudera, Inc. All rights reserved.

Nov 7, 2017 Tue ANA Intercontinental Hotel

Estimated Attendees #: 1000

E-1: Apache Kudu on Analytical Data Platform

Register Now! www.clouderaworldtokyo.com

Cloudera World Tokyo 2017

Page 42: Apache Kudu - Updatable Analytical Storage #rakutentech

42 © Cloudera, Inc. All rights reserved.

[email protected]