Data Center Cooling Plan Document - Des Dessy

download Data Center Cooling Plan Document - Des Dessy

of 22

Transcript of Data Center Cooling Plan Document - Des Dessy

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    1/22

    Cooling Device Requirements

    Optimization Team

    8 March 2012

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    2/22

    2

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    3/22

    Chapter 1

    Requirements for the cooling plant:

    1.1 General Requirements:

    The Data Center is devided in four zones, so each of the four racks in the Date Center

    will be associated to a particular zone, as shown below:

    R1 Z1 R2 Z2 R3 Z3 R4 Z4

    where Ri is a particular rack, and Zi is the zone associated to that rack.

    A unique CRAC will be used, but each zone can be cooled indipendently, usign a

    mechanical system control including a set of valves. The interaction of cooling between

    different zones will not be taken into account1.

    The cost for the cooling depends on the Crac usage level, it can be calculated as the

    sum of the power needed to cool each rack.

    Cool air will be pumped into the Data Center from the floor at a fixed temperature

    Tin, hot hair will be exhausted from the ceiling at a variable temperature Tout2.

    Each zone should have an indipendent emergency cooling device available, in order to

    avoid dangerous black-outs. If even the emergency device fails the interested zone will

    be declared out of order, and so each executing job will be suppressed in that particular

    zone.

    1It is important to discuss this problem with the thermal team.2A good policy could be explaiot more the lower CPUs, as they receive cooler hair, and so they can be

    cooled in an easier way.

    3

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    4/22

    4 CHAPTER 1. REQUIREMENTS FOR THE COOLING PLANT:

    1.2 Cooling device characterization:

    Three cooling levels will be available for each zone, plus the level off, which will be

    called level 0.

    Transition between different cooling levels, turning on/off the cooling will be possible

    every Tc, where Tc is called cooling epoque and is equal to five minutes.

    Table 1.1: Example of cooling device charachterization

    Cooling LevelCooling

    5 min 10 min 15 min energy used

    Level.

    4 -3C -6C -10C 5U [W]

    3 -1.8C -3C -5C 2U [W]

    2 -1C -1.6C -2.5C 1U [W]

    The temperature will be a function of the current scheduling and of the current

    cooling plan:

    T = f( cooling plan, current scheduling )

    The cooling level in each zone will be a function of the average temperature of

    CPUs in the rack, of the maximum temperature and of the derivative of the

    temperature in that particular area:

    Cl = f( Tavg,Tmax,T )

    1.3 Cooling policy:

    The Data Center should have an average temperature of 27C, when it is working.

    The cooling system will be turned on in one of the zones when the average temperature

    exceedes 30C or the maximum temperature exceedes 60C.

    The derivative of the temperature will be used to determine the cooling level: the most

    powerful cooling level will be used with high derivative.

    The cooling system will be turned off when the temperature goes below 24C.

    If no information is retrieved from the thermal map, an additional safety mechanism

    is implemented 3.

    3The way safaty mechanism works will be explained later on this report.

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    5/22

    1.4. INPUTS: 5

    0 1 2 3 4 5 6 7 8 9 10

    24

    26

    28

    30

    32

    34

    36

    Time [m]

    Temperature[C]

    Cooling policy

    Unsafe Temperature

    Desired TemperatureStart cooling temperature

    Stop cooling temperature

    Max power

    Med powerMin Power

    Figure 1.1: Example of what could happen with the adopted cooling policy.

    1.4 Inputs:

    A [ 4 10 ] float matrix containing the temperature of each CPU will be received from

    the thermal team. This matrix will be taken one time each Tc, in order to have enough

    information to calculate the derivative of the temperature and the average temperature.

    1.5 Outputs:

    The cooling schedule will be sent to the database every Tc, in order to allow Thermal

    Team to update their thermal model, and to allow the Power Team to calculate the

    amount of non computing energy. The schedule will consist of an array composed by

    four integers, in the interval [ 1, 4 ].

    NOTE: more details on the input and output will be given in the parametric diagram.

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    6/22

    6 CHAPTER 1. REQUIREMENTS FOR THE COOLING PLANT:

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    7/22

    Chapter 2

    Formal models:

    2.1 Algebraic specifications:

    The state variables used will be three:

    The Cooling level { 1 ( Off ) or 2 or 3 or 4 }

    The average temperature { 0C 70C }

    The maximum temperature { 0C 70C }

    If Tmax > 60

    C = Cooling level = 4 (2.1)If Tmax < 60

    C and Tavg > 30C and T > 4 and 1 = Cooling level = 4 (2.2)

    If Tmax < 60C and Tavg > 30

    C and 2 < T < 4 and 1 = Cooling level = 3 (2.3)

    If Tmax < 60C and Tavg > 30

    C and T < 2 and 1 = Cooling level = 2 (2.4)

    If Tmax < 60C and Tavg < 24

    = Cooling level = 1 (2.5)

    If Tmax < 60C and T > 0 and 2 = Cooling level = 3 (2.6)

    If Tmax < 60C and T > 0 and 3 = Cooling level = 4 (2.7)

    2.2 Parametric Diagram and Automata:

    7

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    8/22

    8 CHAPTER 2. FORMAL MODELS:

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    9/22

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    10/22

    10 CHAPTER 2. FORMAL MODELS:

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    11/22

    Chapter 3

    Code implementation:

    3.1 Code:

    The code needed for the cooling device implementation is listed and commented here below:

    1 A DD T HE C OD E H ER E

    3.2 Code automata:

    Here below a code automata is presented, it basically explains how the code works: As it is

    possible to grasp from the figure, the steps in the code are the following:

    1. the variables are initialized;

    2. system enters in a forever loop and waits;

    3. system checks if a valid thermal map is retrievable;

    4. if it is not, the safe plan is adopted, and the waiting state is called again;

    5. if it is, the value of the data is checked for consistency;

    6. if received data is not ok, the safe plan is adopted, and the waiting state is called again;

    7. otherwise the cooling plan is calculated and waiting state is entered finally.

    11

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    12/22

    12 CHAPTER 3. CODE IMPLEMENTATION:

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    13/22

    Chapter 4

    Testing

    4.1 Testing plan:

    The testing of the cooling device will be devided into two section.

    4.1.1 Input-output data structure:

    First of all the format of the input-output data will needs to be controlled. This means that

    if the incoming matrix is not a matrix containing [ 4 x 10 ] floats the input is not valid.

    The input validity is guaranteed by the integration group, that provided a method the ensures

    to receive every time a [4x10] float matrix.

    On the other hand, if the output in not an array containing four integers contained in the

    specified interval, the cooling plan must be recalculated.

    4.1.2 Data consistency:

    We will try to test the behaviour of the cooling device giving as an input a matrix containing

    invalid values. The values we will use are going to be:

    -20 : in this case the software will send to the database an ERROR TYPE 1: Tem-

    perature out of valid range message.

    0 : in this case the software should accept the incoming input.

    30 : in this case the software should accept the incoming input.

    70 : in this case the software should accept the incoming input.

    100: in this case the software will send to the database an ERROR TYPE 1: Tem-

    perature out of valid range message.

    13

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    14/22

    14 CHAPTER 4. TESTING

    4.1.3 Automata path coverage:

    A JUnit test suite will then be used in order to test some of the methods in the code. We will

    focus our effords in particular on one of them, the calculateCooling method, as it is the one

    that effectively implements the automata of section 2.2, and so the one that is at the core

    of the cooling device. In order to have an exaustive test, we will try each possible transition

    between the different states, in order to ensure a proper path coverage.

    4.2 Test results:

    In order to test the code we used JUnit inside the Eclipse environment.

    4.2.1 Data Consistency:

    As stated before, the data consistency test is already done inside the code itself.

    The software reads the input from the database. The database passes a matrix to the coolingcode: one element of this array is red at a time. While processing the input cooling code

    could send back different messages. Between this messages, two are error messages, while the

    other one is a validation signal:

    1. ERROR TYPE 1: Temperature out of valid range: this is sent when one of

    the temperatures in the file is out of the acceptable range. In this case the system uses

    the safe policy.

    2. ERROR TYPE 2: Map not updated: if the termal map is not ready the system

    sends this message, in order to ask the input data. In this case the system uses the safe

    policy.

    3. OK 1: Valid Input: if, while processing the input data, no one of the error listed

    above happens, than the input is correct. The system can now proceed to calculate the

    cooling plan.

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    15/22

    4.2. TEST RESULTS: 15

    Results:

    Input:

    Rack 0 Rack 1 Rack 2 Rack 3

    32 27 61 1

    32 27 29 1

    32 27 29 1

    32 27 29 1

    32 27 29 1

    32 27 29 1

    32 27 29 1

    32 27 29 132 27 29 1

    32 27 29 1

    System output:1 O K 1 : v al i d i n pu t . ..

    3 O n r ac k n um be r 0 c oo li ng l ev el 3 h av e b ee n u se d .

    T he a v er a ge t e mp e ra t ur e h er e i s o f : 3 2. 0 C .

    5 T he m a xi m um t e mp e ra t ur e o n t hi s r ac k i s 3 2. 0 C

    7 O n r ac k n um be r 1 c oo li ng l ev el 0 h av e b ee n u se d .

    T he a v er a ge t e mp e ra t ur e h er e i s o f : 2 7. 0 C .

    9 T he m a xi m um t e mp e ra t ur e o n t hi s r ac k i s 2 7. 0 C

    11 O n r ac k n um be r 2 c oo li ng l ev el 3 h av e b ee n u se d .

    T he a v er a ge t e mp e ra t ur e h er e i s o f : 3 2. 2 C .

    13 T he m a xi m um t e mp e ra t ur e o n t hi s r ac k i s 6 1. 0 C

    15 O n r ac k n um be r 3 c oo li ng l ev el 0 h av e b ee n u se d .

    T he a v er a ge t e mp e ra t ur e h er e i s o f : 1 . 0 C .

    17 T he m ax im um t em pe ra tu re o n t hi s r ac k i s 1 .0 C

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    16/22

    16 CHAPTER 4. TESTING

    Input:

    Rack 0 Rack 1 Rack 2 Rack 3

    32 27 61 1

    32 27 29 132 27 29 1

    100 27 29 1

    32 27 29 1

    32 27 29 1

    32 27 29 1

    32 27 29 1

    32 27 29 1

    32 27 29 1

    System output:1 E RR OR T YP E 1 : T em pe ra tu re o ut o f t he v al id r an ge

    S af e p la n h as b ee n a do pt ed d ue t o i nv al id i np ut

    4.2.2 Path Coverage and Methods Testing:

    Now that we know what happens in valid and invalid inputs are passed, it is possible to check

    what happens to the system if proper inputs are used.

    All the code used for this part of the testing is reported here below:

    i m p or t j u n it . f r a m e w o r k . * ;

    2 i m p or t o r g . j u ni t . A f t e r ;

    i m p or t o r g . j u ni t . B e f o r e ;

    4

    p u bl i c c la s s c o ol i ng 1 Te s t e x te n ds T e st C as e {

    6

    @Before

    8 p ub l ic v oi d s e tU p ( ) {

    }

    10

    @After

    12 p ub l ic v oi d t e ar D ow n ( ) {

    }

    14

    p ub l ic v oi d t e st M ax ( ) {

    16 f lo a t [] a = n ew f lo a t [ 10 ];

    for ( int i = 0; i < a .l eng th ; i ++) {

    18 a [ i ] = i ;

    }

    20 f lo at m ax = 9;

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    17/22

    4.2. TEST RESULTS: 17

    f l o at M a x = c o o l in g 1 . c a l c u l a te M a x ( a ) ;

    22 a s s e r tE q u a ls ( m a x , M a x );

    }

    24

    p u bl i c v oi d t e st A VG ( ) {

    26 f lo at [ ] a = n ew f lo at [ 1 0] ;

    for ( int i = 0; i < a .len gt h; i ++) {

    28 a [i ] = 10 ;

    }

    30 f lo at a vg = 1 0;

    f l o at A V G = c o o l in g 1 . c a l c u l a te A V G ( a ) ;

    32 a s s e r tE q u a ls ( a v g , A V G );

    }

    34

    p u bl i c v oi d t e s tC a lc u la t e Co o l ( ) {

    36 f lo at [ ] a = n ew f lo at [ 1 0] ;

    for ( int i = 0; i < a .len gt h; i ++) {

    38 a [i ] = 31 ;

    }

    40 i nt O to A = c o ol i ng 1 . c a l cu l a te C oo l in g ( 61 , 2 4 , 0 , 0 );

    a s s e r tE q u a ls ( 3 , O t oA ) ;

    42 i nt A to O = c o ol i ng 1 . c a l cu l a te C oo l in g ( 30 , 2 3 , - 3, 3 );

    a s s e r tE q u a ls ( 0 , A t oO ) ;

    44 i nt B t oA 1 = c o ol i ng 1 . c a l cu l at e C oo l in g ( 61 , 2 3 , - 3, 2 );

    a s s e r tE q u a ls ( 3 , B t o A1 ) ;

    46 i nt B t oA 2 = c o ol i ng 1 . c a l cu l at e C oo l in g ( 40 , 2 8 , 1 , 2 );

    a s s e r tE q u a ls ( 3 , B t o A2 ) ;

    48 i nt O to B = c o ol i ng 1 . c a l cu l a te C oo l in g ( 40 , 3 1 , 3 , 0 );

    a s s e r tE q u a ls ( 2 , O t oB ) ;

    50 i nt B to O = c o ol i ng 1 . c a l cu l a te C oo l in g ( 40 , 2 3 , - 2, 2 );

    a s s e r tE q u a ls ( 0 , B t oO ) ;

    52 i nt O to C = c o ol i ng 1 . c a l cu l a te C oo l in g ( 40 , 3 1 , 1 , 0 );

    a s s e r tE q u a ls ( 1 , O t oC ) ;

    54 i nt C to O = c o ol i ng 1 . c a l cu l a te C oo l in g ( 40 , 2 3 , - 4, 1 );

    a s s e r tE q u a ls ( 0 , C t oO ) ;56 i nt C t oA 1 = c o ol i ng 1 . c a l cu l at e C oo l in g ( 61 , 2 5 , - 3, 1 );

    a s s e r tE q u a ls ( 3 , C t o A1 ) ;

    58 i nt C t oA 2 = c o ol i ng 1 . c a l cu l at e C oo l in g ( 40 , 2 9 , 2 , 1 );

    a s s e r tE q u a ls ( 2 , C t o A2 ) ;

    60 }

    }

    As it is easy to see we focused particularly on the method CalculateCool, as it is the core

    of the cooling software. In order to test it in the proper way we analized the State Chart

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    18/22

    18 CHAPTER 4. TESTING

    Diagram of section 2.2, and create a test case for each passible state transition.

    At the end of the test JUnit declared that all the methods were tested succesfully, and so the

    code is supposed to implement correctly the cooling system model weve created.

    4.3 Errors vs Time:

    10 20 30 40 50 60 700

    5

    10

    15

    20

    25

    30

    Days passed after the first software release

    Bugs/Tests executed on the cooling code

    Bugs found

    Executed tests

    Comments: As it is possible to see errors and test are located in the days of the software

    releases. In particular:

    After the first release a few bugs were found, and it was almost easy to fix them all,

    but the code was still running as a standalone peace of software.

    After the second release lots of bugs were found, both in the database and in the

    software itself. At the beginning most of them were sorted out, but this required a lot

    of testing effort.

    With the third release new bugs were introduced, in addition to the unsolved ones.

    This time however it was quite easy to fix them all.

    The last release of the software gave very few problems, as just one bug was found, and

    if was extremely easy to fix it. This is probably due to the fact that this last release

    is really semplified and the code was completely renewed. It is important to note that

    the fourh release has been introduced as the third was becoming too much complicated

    and not so much manageble, so sometimes strange results were given.

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    19/22

    Chapter 5

    Safety mechanism:

    A safety mechanism had been added to the code, in order to cool the datacenter even in thecase a valid thermal map is not found.

    The safety mechanism starts to work if a correct thermal map is not found within a minute.

    In this case all the zones are cooled using level 4, which is of course the highest one.

    The usage of this policy should ensure that all the datacenter will not be too hot even if the

    thermal map fails.

    This safety strategy is really simple, but seems to be the most effective. It is important to

    note that it should be used just in extreme cases, as it is really expensive in terms of power

    usage.

    19

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    20/22

    20 CHAPTER 5. SAFETY MECHANISM:

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    21/22

    Chapter 6

    Software releases:

    6.1 Release 1:

    Date: 25th January 2012

    The first release of the software included all the basic functions in order to test if the module

    was working properly:

    Basic functionality to read/write input/output from/to .txt files;

    A simple trigger system, used to temporize the software;

    A procedure, used to determine if the input correct or not;

    A signaling system, used to notify that;

    Additional methods used to calculate the cooling levels in all the racks of the datacenter.

    The JUnit related file included:

    A test case for each one of the methods used in the code.

    Some additional test cases for the applyCooling method.

    6.2 Release 2:

    Date: 20th February 2012

    The basic improvements of this second release, respect to the first one, were:

    Input/output from/to database, so database interaction;

    More sophisticated triggering system;

    Simplified and more readable code ( we passed from 306 lines to less than 250 );

    21

  • 7/31/2019 Data Center Cooling Plan Document - Des Dessy

    22/22

    22 CHAPTER 6. SOFTWARE RELEASES:

    The Junit related file included:

    A test case for each one of the methods used in the code.

    A test case for each possible transition of the applyCooling method, in order to make

    that possible, we used the cooling devide automata as an oracle. We tested all the

    possible transitions and even what happens if the cooling level remains constant.

    6.3 Release 3:

    Date: 8th March 2012

    The basic improvements of this third release, respect to the second one, were:

    A safety mechanism added to the main class;

    A new secondary class, used to store past information about the cooling levels and the

    past average temperatures in each rack.

    The Junit related file included:

    A test case for each one of the methods used in the code.

    A test case for each possible transition of the applyCooling method.

    A test case for the safety mechanism.

    6.4 Release 4:

    Date: 10th April 2012

    The basic improvements of release 4 were:

    An extreme semplification of the code, based on the use of some new methods that

    made the code more simple and linear.

    A new synchronization system, using the method CurrentTimeMillis().

    A better use of the safety mechanism.

    The Junit related file included:

    A test case for some of the methods used in the code.

    A test case for each possible transition of the applyCooling method.

    A test case for the safety mechanism.