ETL With Cassandra Streaming Bulk Loading

Cassandra ETLStreaming Bulk Loading

Alex Araujo [email protected]

mailto:[email protected]

mailto:[email protected]

Background

• Sharded MySql ETL Platform on EC2 (EBS)

• Database Size - Up to 1TB

• Write latencies exponentially proportional to data size

• Cassandra Thrift Loading on EC2 (Ephemeral RAID0)

• Database Size - ∑ available node space

• Write latencies ~linearly proportional to number of nodes

• 12 XL node cluster (RF=3): 6-125x Improvement over EBS backed MySQL systems

Background

Thrift ETL

• Thrift overhead: Converting to/from internal structures

• Routing from coordinator nodes

• Writing to commitlog

• Internal structures -> on-disk formatSource: http://wiki.apache.org/cassandra/BinaryMemtable

http://wiki.apache.org/cassandra/BinaryMemtable


Bulk Load

• Core functionality

• Existing ETL Nodes for bulk loading

• Move data file & index generation off C* nodes

BMT Bulk Load

• Requires StorageProxy API (Java)

• Rows not live until flush

• Wiki example uses Hadoop

Source: http://wiki.apache.org/cassandra/BinaryMemtable



Streaming Bulk Load• Cassandra as Fat Client

• BYO SSTables

• sstableloader [options] /path/to/keyspace_dir

• Can ignore list of nodes (-i)

• keyspace_dir should be name of keyspace and contain generated SSTable Data & Index files

UserId<Hash>UserId<Hash>

email name ...

Users

GroupId<UUID>GroupId<UUID>

UserId ...{“date_joined”:”<date>”,”date_left”:

”<date>”,”active”:<true|false>}

UserGroups

GroupId<UUID>GroupId<UUID>

<TimeUUID> ...UserId

UserGroupTimeline

Setup

• Opscode Chef 0.10.2 on EC2

• Cassandra 0.8.2-dev-SNAPSHOT (trunk)

• Custom Java ETL JAR

• The Grinder 3.4 (Jython) Test Harness

Chef 0.10.2

• knife-ec2 bootstrap with --ephemeral

• ec2::ephemeral_raid0 recipe

• Installs mdadm, unmounts default /mnt, creates RAID0 array on /mnt/md0

Chef 0.10.2• cassandra::default recipe

• Downloads/extracts apache-cassandra-<version>-bin.tar.gz

• Links /var/lib/cassandra to /raid0/cassandra

• Creates cassandra user & directories, increases file limits, sets up cassandra service, generates config files

Chef 0.10.2

• cassandra::cluster_node recipe

• Determines # nodes in cluster

• Calculates initial_token; generates cassandra.yaml

• Creates keyspace and column families

Chef 0.10.2

• cassandra::bulk_load_node recipe

• Generates same cassandra.yaml with empty initial_token

• Installs/configures grinder scripts; Java ETL JAR

ETL JAR

ETL JARfor (File : files){ importer = new CBLI(...); importer.open(); // Processing omitted importer.close()}

ETL JAR

CassandraBulkLoadImporter.initSSTableWriters():

File tempFiles = new File(“/path/to/Prefs”);tempFiles.mkdirs();for (String cfName : COLUMN_FAMILY_NAMES) { SSTableSimpleUnsortedWriter writer = new SSTableSimpleUnsortedWriter( tempFiles, Model.Prefs.Keyspace.name, cfName, Model.COLUMN_FAMILY_COMPARATORS.get(cfName), null, bufferSizeInMB);// No Super CFs writers.put(name, writer);}

ETL JAR

CassandraBulkLoadImporter.processSuppressionRecips():

for (User user : users) { String key = user.getUserId(); SSTSUW writer = tableWriters.get(Model.Users.CF.name); // rowKey() converts String to ByteBuffer writer.newRow(rowKey(key)); o.a.c.t.Column column = newUserColumn(user); writer.addColumn(column.name, column.value, column.timestamp); ... // Repeat for each column family}

ETL JARCassandraBulkLoadImporter.close():

for (String cfName : COLUMN_FAMILY_NAMES) { try { tableWriters.get(cfName).close(); } catch (IOException e) { log.error(“close failed for ”+cfName); throw new RuntimeException(cfName+” did not close”); } String streamCmd = "sstableloader -v --debug" + tempFiles.getAbsolutePath(); Process stream = Runtime.getRuntime().exec(streamCmd); if (!stream.waitFor() == 0) log.error(“stream failed”);}

cassandra_bulk_load.py

import randomimport sysimport uuid

from java.io import Filefrom net.grinder.script.Grinder import grinderfrom net.grinder.script import Statisticsfrom net.grinder.script import Testfrom com.mycompany import Appfrom com.mycompany.tool import SingleColumnBulkImport

cassandra_bulk_load.pyinput_files = [] # files to loadsite_id = str(uuid.uuid4())import_id = random.randint(1,1000000)list_ids = [] # lists users will be loaded to

try: App.INSTANCE.start() dao = App.INSTANCE.getUserDAO() bulk_import = SingleColumnBulkImport( dao.prefsKeyspace, input_files, site_id, list_ids, import_id)except: exception = sys.exc_info()[1] print exception.message print exception.stackTrace

cassandra_bulk_load.py# Import statsgrinder.statistics.registerDataLogExpression("Users Imported", "userLong0")

grinder.statistics.registerSummaryExpression("Total Users Imported", "(+ userLong0)")

grinder.statistics.registerDataLogExpression("Import Time", "userLong1")

grinder.statistics.registerSummaryExpression("Import Total Time (sec)", "(/ (+ userLong1) 1000)")

rate_expression = "(* (/ userLong0 (/ (/ userLong1 1000) "+str(num_threads)+")) "+str(replication_factor)+")"

grinder.statistics.registerSummaryExpression("Cluster Insert Rate (users/sec)", rate_expression)

cassandra_bulk_load.py# Import and record statsdef import_and_record(): bulk_import.importFiles() grinder.statistics.forCurrentTest.setLong(

"userLong0", bulk_import.totalLines) grinder.statistics.forCurrentTest.setLong(

"userLong1",grinder.statistics.forCurrentTest.time)

# Create an Import Test with a test number and a descriptionimport_test = Test(1, "Recip Bulk Import").wrap(import_and_record)

# A TestRunner instance is created for each threadclass TestRunner:# This method is called for every run. def __call__(self): import_test()

Stress Results

• Once Data and Index files generated, streaming bulk load is FAST

• Average: ~2.5x increase over Thrift

• ~15-300x increase over MySQL

• Impact on cluster is minimal

• Observed downside: Writing own SSTables slower than Cassandra

Q’s?

ETL With Cassandra Streaming Bulk Loading

Technology

Transcript of ETL With Cassandra Streaming Bulk Loading