ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design...

31
Using Cassandra in DIRAC Accounting System ZhangGang, Fabio, Deng Ziyan 2013.01.24

Transcript of ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design...

Page 1: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

Using Cassandra in DIRAC Accounting

System

ZhangGang, Fabio, Deng Ziyan

2013.01.24

Page 2: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

2/31

Overview

NoSQLIntroduction to CassandraData Model DesignImplementation

Page 3: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

3/31

NoSQL

Page 4: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

4/31

NoSQL

Learning NoSQL What is NoSQL Be different from RDBMS Use Redis to get familiar with some interesting

features of NoSQL Compare and choose one for DIRAC

What we need? High scalability, fast write and read, big data…

Four candidates: Raik, CouchDB, Hadoop HBase and Cassandra

First, choose Hadoop HBase to explore

Page 5: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

5/31

NoSQL

Page 6: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

6/31

NoSQL

Hadoop HBase Modeled after Google's BigTable HBase must be installed on HDFS Deploy and maintenance are much more

complicated than Cassandra

Then, turn to Cassandra

Page 7: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

7/31

Introduction to Cassandra

Page 8: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

8/31

Introduction to Cassandra

Some features of Cassandra Schema flexibility BigTable-like features: columns, column families Key/value pairs: row/columns pairs Secondary index Writes are much faster than reads All nodes are similar: no single point of failure Tunable trade-offs for distribution and

replication All research is based on a standalone

mode. In production, need a cluster

Page 9: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

9/31

Introduction to Cassandra

RDBMS: Use the join operation, increase the

normalization and reduce the redundancy NoSQL(Cassandra):

For getting a better performance and high scalability, get rid of join operation, which means denormalizing the data and maintaining multiple copies of data(increase the redundancy)

Page 10: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

10/31

Introduction to Cassandra

Data model Cassandra stores data in a multidimensional hash

table• [keyspace][columnfamily][row][column]

Some concepts in Cassandra: keyspace, column family, row, column

keyspqce->database column family->table

Page 11: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

11/31

Introduction to Cassandra

Another structure: super column family Can be thought of as a map of maps

Page 12: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

12/31

Introduction to Cassandra

Model the query first With Cassandra we model the queries and let

the data be organized around them. Think of the most common query paths the application will use, and then create the column families that we need to support them

Aggregate key pattern This pattern fuses together two scalar values

with a separator to create an aggregateLike:

CPUTime:2008-01-01

value

Page 13: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

13/31

Introduction to Cassandra

A special column: counter column A counter is a special kind of column used to

store a number that incrementally counts the occurrences of a particular event or process

Set the value type “CounterColumnType” when create a column family

Increase or decrease the value, not replace Example: for site “BES”,at “2012-10-10”,10

jobs(means the day has 10 CPUTime).

CPUTime:2012-10-01

val=sum(10 CPUTime)

Page 14: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

14/31

Data Model Design

Page 15: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

15/31

Data Model Design

Model the query first Four factors determine a plot: start time, end

time, plot to generate, and groupby The data that a plot need is grouped by

something. Preprocessing the data

Page 16: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

16/31

Data Model Design

Create a CF for CPUTime groupby user CF: standard column family Row key: user Column name: startTime Column value: CPUTime Problem: bad performance and column name not

unique

user 1 startTime 1 … startTime n

value … value

… … … …

user n startTime 2 … NULL

value … NULL

Page 17: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

17/31

Data Model Design

The first improvement CF: counter column family Row key: user Column name: startTime/86400(aggregate by

day) Column value: the sum of CPUTime within a day Problem: one CF for one plot

user 1 2008-01-01 … 2013-01-24

value … value

… … … …

user n 2009-05-17 2009-05-20 NULL

value value NULL

Page 18: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

18/31

Data Model Design

In cassandra, two method to slove the problem Use a super column family

Use aggregate key

Page 19: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

19/31

Data Model Design

The second improvement CF: Counter column family Row key: user Column name: aggregate key

(CPUTime,2012-01-01),(DiskSpace,2012-01-01) Column value: the sum value within a day

user 1 CPUTime:2008-01-01 … DiskSpace:2013-01-24

value … value

… … … …

user n CPUTime:2009-05-17 JobCount:2009-05-20 NULL

value value NULL

Page 20: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

20/31

Data Model Design

Create a CF to store the raw data Store raw data for future usage The columns are specified when create the CF Row key: timestamp type(DoubleType) Disk space: 21GB In Mysql:4GB内容内容

一级标题一级标题一级标题一级标题

Page 21: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

21/31

Data Model Design

It is a static column family

Biao JobClass User … Site … CPUTime

row 1 value value … value … value

row 2 value value … value … value

… … … … … … …

row n value value … value value value

… … … … … … …

Page 22: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

22/31

Implementation

Page 23: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

23/31

Implementation

Create column families for each “groupby” cum_groupby_user 4.2MB cum_groupby_site 11MB cum_groupby_processingtype

428KB cum_groupby_country 3.9MB cum_groupby_grid

1.2MB cum_groupby_usergroup

828KB

一级标题一级标题一级标题内容

Page 24: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

24/31

Implementation

Communicate with Cassandra pycassa : a Python client for Apache Cassandra

Input data When a new record comes, insert the data into

all the CF at the same time Performance: 4 CF,1100records,about 18s Input data into raw_data_cf: (for a standard CF)

pycassa.ColumnFamily.insert(key,columns) Input data into groupby_cfs: (for a counter CF )

pycassa.ColumnFamily.add(key,column,value)

Page 25: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

25/31

Implementation

The data in one row like:

内容内容内容内容内容内容内容

一级标题一级标题一级标题一级标题

Page 26: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

26/31

Implementation

Retrieve data and generate a plot start_time, end_time: determine the time span generate : detemine columns with time span groupby: decide which CF should be chose

Page 27: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

27/31

Implementation Badger01 hardware

CPU: Intel(R) Xeon(R) CPU E5620 @2.40GHz CPU core:4 Memory:16GB

Comparison Left is the plot get from LHCb web potal Right plot is generate by Cassandra at

badger01

Page 28: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

28/31

Implementation

Generate the same plot at badger01 use mysql: about 30s

Page 29: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

29/31

Implementation

Page 30: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

30/31

Implementation

Page 31: ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

31/31

Thanks