ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design...
-
Upload
margery-shaw -
Category
Documents
-
view
219 -
download
3
Transcript of ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design...
Using Cassandra in DIRAC Accounting
System
ZhangGang, Fabio, Deng Ziyan
2013.01.24
2/31
Overview
NoSQLIntroduction to CassandraData Model DesignImplementation
3/31
NoSQL
4/31
NoSQL
Learning NoSQL What is NoSQL Be different from RDBMS Use Redis to get familiar with some interesting
features of NoSQL Compare and choose one for DIRAC
What we need? High scalability, fast write and read, big data…
Four candidates: Raik, CouchDB, Hadoop HBase and Cassandra
First, choose Hadoop HBase to explore
5/31
NoSQL
6/31
NoSQL
Hadoop HBase Modeled after Google's BigTable HBase must be installed on HDFS Deploy and maintenance are much more
complicated than Cassandra
Then, turn to Cassandra
7/31
Introduction to Cassandra
8/31
Introduction to Cassandra
Some features of Cassandra Schema flexibility BigTable-like features: columns, column families Key/value pairs: row/columns pairs Secondary index Writes are much faster than reads All nodes are similar: no single point of failure Tunable trade-offs for distribution and
replication All research is based on a standalone
mode. In production, need a cluster
9/31
Introduction to Cassandra
RDBMS: Use the join operation, increase the
normalization and reduce the redundancy NoSQL(Cassandra):
For getting a better performance and high scalability, get rid of join operation, which means denormalizing the data and maintaining multiple copies of data(increase the redundancy)
10/31
Introduction to Cassandra
Data model Cassandra stores data in a multidimensional hash
table• [keyspace][columnfamily][row][column]
Some concepts in Cassandra: keyspace, column family, row, column
keyspqce->database column family->table
11/31
Introduction to Cassandra
Another structure: super column family Can be thought of as a map of maps
12/31
Introduction to Cassandra
Model the query first With Cassandra we model the queries and let
the data be organized around them. Think of the most common query paths the application will use, and then create the column families that we need to support them
Aggregate key pattern This pattern fuses together two scalar values
with a separator to create an aggregateLike:
CPUTime:2008-01-01
value
13/31
Introduction to Cassandra
A special column: counter column A counter is a special kind of column used to
store a number that incrementally counts the occurrences of a particular event or process
Set the value type “CounterColumnType” when create a column family
Increase or decrease the value, not replace Example: for site “BES”,at “2012-10-10”,10
jobs(means the day has 10 CPUTime).
CPUTime:2012-10-01
val=sum(10 CPUTime)
14/31
Data Model Design
15/31
Data Model Design
Model the query first Four factors determine a plot: start time, end
time, plot to generate, and groupby The data that a plot need is grouped by
something. Preprocessing the data
16/31
Data Model Design
Create a CF for CPUTime groupby user CF: standard column family Row key: user Column name: startTime Column value: CPUTime Problem: bad performance and column name not
unique
user 1 startTime 1 … startTime n
value … value
… … … …
user n startTime 2 … NULL
value … NULL
17/31
Data Model Design
The first improvement CF: counter column family Row key: user Column name: startTime/86400(aggregate by
day) Column value: the sum of CPUTime within a day Problem: one CF for one plot
user 1 2008-01-01 … 2013-01-24
value … value
… … … …
user n 2009-05-17 2009-05-20 NULL
value value NULL
18/31
Data Model Design
In cassandra, two method to slove the problem Use a super column family
Use aggregate key
19/31
Data Model Design
The second improvement CF: Counter column family Row key: user Column name: aggregate key
(CPUTime,2012-01-01),(DiskSpace,2012-01-01) Column value: the sum value within a day
user 1 CPUTime:2008-01-01 … DiskSpace:2013-01-24
value … value
… … … …
user n CPUTime:2009-05-17 JobCount:2009-05-20 NULL
value value NULL
20/31
Data Model Design
Create a CF to store the raw data Store raw data for future usage The columns are specified when create the CF Row key: timestamp type(DoubleType) Disk space: 21GB In Mysql:4GB内容内容
一级标题一级标题一级标题一级标题
21/31
Data Model Design
It is a static column family
Biao JobClass User … Site … CPUTime
row 1 value value … value … value
row 2 value value … value … value
… … … … … … …
row n value value … value value value
… … … … … … …
22/31
Implementation
23/31
Implementation
Create column families for each “groupby” cum_groupby_user 4.2MB cum_groupby_site 11MB cum_groupby_processingtype
428KB cum_groupby_country 3.9MB cum_groupby_grid
1.2MB cum_groupby_usergroup
828KB
一级标题一级标题一级标题内容
24/31
Implementation
Communicate with Cassandra pycassa : a Python client for Apache Cassandra
Input data When a new record comes, insert the data into
all the CF at the same time Performance: 4 CF,1100records,about 18s Input data into raw_data_cf: (for a standard CF)
pycassa.ColumnFamily.insert(key,columns) Input data into groupby_cfs: (for a counter CF )
pycassa.ColumnFamily.add(key,column,value)
25/31
Implementation
The data in one row like:
内容内容内容内容内容内容内容
一级标题一级标题一级标题一级标题
26/31
Implementation
Retrieve data and generate a plot start_time, end_time: determine the time span generate : detemine columns with time span groupby: decide which CF should be chose
27/31
Implementation Badger01 hardware
CPU: Intel(R) Xeon(R) CPU E5620 @2.40GHz CPU core:4 Memory:16GB
Comparison Left is the plot get from LHCb web potal Right plot is generate by Cassandra at
badger01
28/31
Implementation
Generate the same plot at badger01 use mysql: about 30s
29/31
Implementation
30/31
Implementation
31/31
Thanks