ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design...

Using Cassandra in DIRAC Accounting

System

ZhangGang, Fabio, Deng Ziyan

2013.01.24

2/31

Overview

NoSQLIntroduction to CassandraData Model DesignImplementation

3/31

NoSQL

4/31

NoSQL

Learning NoSQL What is NoSQL Be different from RDBMS Use Redis to get familiar with some interesting

features of NoSQL Compare and choose one for DIRAC

What we need? High scalability, fast write and read, big data…

Four candidates: Raik, CouchDB, Hadoop HBase and Cassandra

First, choose Hadoop HBase to explore

5/31

NoSQL

6/31

NoSQL

Hadoop HBase Modeled after Google's BigTable HBase must be installed on HDFS Deploy and maintenance are much more

complicated than Cassandra

Then, turn to Cassandra

7/31

Introduction to Cassandra

8/31


Some features of Cassandra Schema flexibility BigTable-like features: columns, column families Key/value pairs: row/columns pairs Secondary index Writes are much faster than reads All nodes are similar: no single point of failure Tunable trade-offs for distribution and

replication All research is based on a standalone

mode. In production, need a cluster

9/31


RDBMS: Use the join operation, increase the

normalization and reduce the redundancy NoSQL(Cassandra):

For getting a better performance and high scalability, get rid of join operation, which means denormalizing the data and maintaining multiple copies of data(increase the redundancy)

10/31


Data model Cassandra stores data in a multidimensional hash

table• [keyspace][columnfamily][row][column]

Some concepts in Cassandra: keyspace, column family, row, column

keyspqce->database column family->table

11/31


Another structure: super column family Can be thought of as a map of maps

12/31


Model the query first With Cassandra we model the queries and let

the data be organized around them. Think of the most common query paths the application will use, and then create the column families that we need to support them

Aggregate key pattern This pattern fuses together two scalar values

with a separator to create an aggregateLike:

CPUTime:2008-01-01

value

13/31


A special column: counter column A counter is a special kind of column used to

store a number that incrementally counts the occurrences of a particular event or process

Set the value type “CounterColumnType” when create a column family

Increase or decrease the value, not replace Example: for site “BES”,at “2012-10-10”,10

jobs(means the day has 10 CPUTime).

CPUTime:2012-10-01

val=sum(10 CPUTime)

14/31

Data Model Design

15/31

Data Model Design

Model the query first Four factors determine a plot: start time, end

time, plot to generate, and groupby The data that a plot need is grouped by

something. Preprocessing the data

16/31

Data Model Design

Create a CF for CPUTime groupby user CF: standard column family Row key: user Column name: startTime Column value: CPUTime Problem: bad performance and column name not

unique

user 1 startTime 1 … startTime n

value … value

… … … …

user n startTime 2 … NULL

value … NULL

17/31

Data Model Design

The first improvement CF: counter column family Row key: user Column name: startTime/86400(aggregate by

day) Column value: the sum of CPUTime within a day Problem: one CF for one plot

user 1 2008-01-01 … 2013-01-24

value … value

… … … …

user n 2009-05-17 2009-05-20 NULL

value value NULL

18/31

Data Model Design

In cassandra, two method to slove the problem Use a super column family

Use aggregate key

19/31

Data Model Design

The second improvement CF: Counter column family Row key: user Column name: aggregate key

(CPUTime,2012-01-01),(DiskSpace,2012-01-01) Column value: the sum value within a day

user 1 CPUTime:2008-01-01 … DiskSpace:2013-01-24

value … value

… … … …

user n CPUTime:2009-05-17 JobCount:2009-05-20 NULL

value value NULL

20/31

Data Model Design

Create a CF to store the raw data Store raw data for future usage The columns are specified when create the CF Row key: timestamp type(DoubleType) Disk space: 21GB In Mysql:4GB内容内容

一级标题一级标题一级标题一级标题

21/31

Data Model Design

It is a static column family

Biao JobClass User … Site … CPUTime

row 1 value value … value … value

row 2 value value … value … value

… … … … … … …

row n value value … value value value

… … … … … … …

22/31

Implementation

23/31

Implementation

Create column families for each “groupby” cum_groupby_user 4.2MB cum_groupby_site 11MB cum_groupby_processingtype

428KB cum_groupby_country 3.9MB cum_groupby_grid

1.2MB cum_groupby_usergroup

828KB

一级标题一级标题一级标题内容

24/31

Implementation

Communicate with Cassandra pycassa : a Python client for Apache Cassandra

Input data When a new record comes, insert the data into

all the CF at the same time Performance: 4 CF,1100records,about 18s Input data into raw_data_cf: (for a standard CF)

pycassa.ColumnFamily.insert(key,columns) Input data into groupby_cfs: (for a counter CF )

pycassa.ColumnFamily.add(key,column,value)

25/31

Implementation

The data in one row like:

内容内容内容内容内容内容内容

一级标题一级标题一级标题一级标题

26/31

Implementation

Retrieve data and generate a plot start_time, end_time: determine the time span generate : detemine columns with time span groupby: decide which CF should be chose

27/31

Implementation Badger01 hardware

CPU: Intel(R) Xeon(R) CPU E5620 @2.40GHz CPU core:4 Memory:16GB

Comparison Left is the plot get from LHCb web potal Right plot is generate by Cassandra at

badger01

28/31

Implementation

Generate the same plot at badger01 use mysql: about 30s

29/31

Implementation

30/31

Implementation

31/31

Thanks

ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design...

Documents

Transcript of ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design...