Bulk Loading Data into Cassandra

44
Bulk-Loading Data into Cassandra Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com Planet Cassandra 2014

description

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path. In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.

Transcript of Bulk Loading Data into Cassandra

Page 1: Bulk Loading Data into Cassandra

Bulk-Loading Data into Cassandra

Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com

Planet Cassandra 2014

Page 2: Bulk Loading Data into Cassandra

About Us

• Work with clients to deliver and improve Apache Cassandra services

• Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer

• Based in New Zealand & USA

Page 3: Bulk Loading Data into Cassandra

Why is bulk loading useful?

• Performance tests

Page 4: Bulk Loading Data into Cassandra

Why is bulk loading useful?

• Performance tests

• Migrating historical data

Page 5: Bulk Loading Data into Cassandra

Why is bulk loading useful?

• Performance tests

• Migrating historical data

• Changing topologies

Page 6: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 7: Bulk Loading Data into Cassandra

Cassandra Write Path write[0]

Page 8: Bulk Loading Data into Cassandra

Cassandra Write Path• Writes written to both the commit log and

memtable.

write[0]

memtablecommitlog

Page 9: Bulk Loading Data into Cassandra

Cassandra Write Path• Writes written to both the commit log and

memtable.

• Memtable is sorted.

write[0]

memtablecommitlog

Page 10: Bulk Loading Data into Cassandra

Cassandra Write Path• Memtable flushed out to sstables.

sstable[0]sstable[1]

sstable[2]

write[0]

memtablecommitlog

Page 11: Bulk Loading Data into Cassandra

Cassandra Write Path• Compaction helps keep the read latency

low.

sstable[0]sstable[1]

sstable[2]

sstable[n]

write[0]

memtablecommitlog

Page 12: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Page 13: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Contains all data needed to regenerate components

Page 14: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Index of row keys

Page 15: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Index summary from Index.db file

Page 16: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Bloom filter over sstable

Page 17: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Table of contents of all components

Page 18: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 19: Bulk Loading Data into Cassandra

Set up keyspace and column family

create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; !

create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType';

Page 20: Bulk Loading Data into Cassandra

SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );

// subcomparator for super columns

Page 21: Bulk Loading Data into Cassandra

SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );

// subcomparator for super columns

Page 22: Bulk Loading Data into Cassandra

SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );

// subcomparator for super columns

Page 23: Bulk Loading Data into Cassandra

ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }

Page 24: Bulk Loading Data into Cassandra

Examining sstable output

patricia@dev:~/../data$ ls -lh mykeyspace/mycf total 64 -rw-r--r-- 1 patricia staff 43B Feb 2 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db -rw-r--r-- 1 patricia staff 79K Feb 2 15:31 mykeyspace-mycf-jb-1-Data.db -rw-r--r-- 1 patricia staff 16B Feb 2 15:31 mykeyspace-mycf-jb-1-Filter.db -rw-r--r-- 1 patricia staff 36B Feb 2 15:31 mykeyspace-mycf-jb-1-Index.db -rw-r--r-- 1 patricia staff 4.3K Feb 2 15:31 mykeyspace-mycf-jb-1-Statistics.db -rw-r--r-- 1 patricia staff 80B Feb 2 15:31 mykeyspace-mycf-jb-1-Summary.db -rw-r--r-- 1 patricia staff 79B Feb 2 15:31 mykeyspace-mycf-jb-1-TOC.txt

Page 25: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

Page 26: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

Page 27: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

Page 28: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

Page 29: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

• Run command on separate server

Page 30: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

• Run command on separate server

• Throttle command

Page 31: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

• Run command on separate server

• Throttle command

• Parallelise processes

Page 32: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 33: Bulk Loading Data into Cassandra

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()

Page 34: Bulk Loading Data into Cassandra

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()

Page 35: Bulk Loading Data into Cassandra

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()

Page 36: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 37: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

Page 38: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

Page 39: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

Page 40: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a

Page 41: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 42: Bulk Loading Data into Cassandra

CQL: Keep schema consistent

cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; !

cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ;

Page 43: Bulk Loading Data into Cassandra

CQL3 Considerations

• Uses CompositeType comparator

Page 44: Bulk Loading Data into Cassandra

Q&A

Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com

Planet Cassandra 2014