DBMS Performance Monitoring

1

AOBD 08/09 H. Galhardas

DBMS Performance Monitoring

Performance Monitoring Goals

  Monitoring should check that the performance-influencing database parameters are correctly set and if they are not, it should point to where the problems are

  Alternatively, you can sit tight and wait for the inevitable angry users’ complaints

OR

Monitoring (preventive

method)

Correct problems when detected

(reactive method)


2

How does one monitor a DBMS?

  By extracting relevant performance indicators, (counters, gauges and details of internal DBMS’s activities)

  By comparing the obtained values of these indicators against ideal values

  But… there are many indicators!


A Consumer-Producer Chain of a DBMS’s Resources

High Level Consumers

Intermediate Resources/ Consumers

Primary Resources

PARSER OPTIMIZER

EXECUTION SUBSYSTEM

DISK SYBSYSTEM

CACHE MANAGER

LOGGING SUBSYSTEM

LOCKING SUBSYSTEM

NETWORK DISK/ CONTROLLER CPU MEMORY

sql commands

probing spots for

indicators


3

Recurrent Patterns of Problems

  An overloading high-level consumer   Ex: query accessing too

many rows   A poorly parameterized

subsystem   Ex: disk subsystem that

stores tables, indexes, logging in a single disk

  An overloaded primary resource   Ex: CPU busy because

non-DB processes are using it at the same time

Effects are not always felt first where the cause is!


A Systematic Approach to Monitoring

  Question 1: Are critical queries being served in the most efficient manner?

  Question 2: Are DBMS subsystems making optimal use of resources?

  Question 3: Are there enough primary resources available and are they configured adequately, given the current and expected workload?

Extract indicators to answer the following questions


4

Investigating High Level Consumers: Critical query monitoring

Find critical queries

Found any? Investigate lower levels

Answer Q1 over them

Overcon- sumption?

Tune problematic queries

yes

yes no

no


Investigating Intermediate and Primary Resources: routine monitoring

Answer Q3

Problems at OS level?

Answer Q2 Tune low-level resources

yes no

Problematic subsystems?

Tune subsystems

Investigate upper level

yes no


5

Tools used to gather information

  Event monitors   Query plan explainers   Performance monitors


Investigating High Level Consumers

  Answer question 1: “Are critical queries being served in the most efficient manner?”

1.  Identify the critical queries 2.  Analyze their access plans 3.  Profile their execution


6

Identifying Critical Queries

Critical queries are usually those that:   Take a long time

  Consume a great amount of resources (read a lot of data, perform heavy aggregation or sorting, lock an entire table, etc)

  Are frequently executed   Have to be answered in a short time

Often, a user complaint will tip us off.


Event Monitors to Identify Critical Queries

  If no user complains...   Record DBMS’s performance measurements, but only

when a system event happens   Ex: new user connection, or beginning/end of the

execution of an SQL statement   Logs the event time, duration and associated

performance indicator values   Less overhead than other type of tools because

indicators are usually by-product of operations monitored and are accessed in a convenient time   Particularly appropriate for capturing extraordinary

conditions, ex: deadlocks.   Typical measures include CPU used, IO used, locks

obtained etc.


7

What should be measured and evaluated

  Interested in the end-of-statement or end-of-transaction event   Carry relevant info that indicate the resources the

executed query consumed: SQL of the statement, duration of execution, CPU, IO, cache,lock consumption statistics

  Sort the collected data over a given resource consumption statistic and we can find the most expensive queries by that criterion


An example Event Monitor

•  Several CPU indicators sorted by Oracle’s Trace Data Viewer

•  Similar tools: DB2’s Event Monitor and MSSQL’s Server Profiler


8

What to do in case of problems?

  Make sure expensive queries are tuned in the best possible way

  May have to drill down through the consumption chain to search for the real cause of the problem






9

Diagnose Expensive Queries: analyzing access plans

  SQL commands are translated into an internal executable format before they are processed

  After parsing, the optimizer enumerates and estimates the cost of a number of possible ways to execute that query

  The best choice, according to the existing statistics, is chosen   But this choice may not be the best one!

sql parse rewrite enumerate/ cost plans

generate chosen plan plan

PARSER/OPTIMIZER


Query Plan Explainers to Analyze Plans

  Explainers usually depict an access plan as a (graphical) single-rooted tree in which sources at the leaf end are tables or indexes, and internal nodes are operators

  These form an assembly line (actually tree) of tuples!

  Most explainers will let you see the estimated costs for CPU consumption, I/O, and cardinality of each operator. If you see something strange, change an index.


10

An example Plan Explainer

•  Access plan according to MSSQL’s Query Analyzer

•  Similar tools: DB2’s Visual Explain and Oracle’s SQL Analyze Tool


Finding Strangeness in Access Plans What to pay attention to on a plan

  Access paths for each table   For every table involved in the query, determine how

it is accessed: scan, index?   Sorts or intermediary results

  Check if the plan includes a sorting (because of DISTINCT, ORDER BY, GROUP BY) and see if it can be eliminated

  Materialization of temporary results should be avoided

  Order of operations   May want to force an ordering if we otice the order

does not favor the early reduction in the number of tuples

  Algorithms used in the operators


11

To Index or not to index?

select c_name, n_name from CUSTOMER join NATION on c_nationkey=n_nationkey where c_acctbal > 0

Which plan performs best? (nation_pk is an non-clustered index over n_nationkey, and similarly for acctbal_ix over c_acctbal)


Non-clustering indexes can be trouble

Each access to the index generates a random access to the table – possibly duplicate! If !36,308 out of 150,000 customers have positive balance, it ends up that the number of pages read from the table is greater than its size, i.e., a table scan is way better

Table Scan Index Scan 5 sec

143,075 pages 6,777 pages

136,319 pages 7 pages

76 sec 272,618 pages 131,425 pages 273,173 pages

552 pages

CPU time data logical reads

data physical reads index logical reads

index physical reads


12





Profiling a Query’s Execution

  A query was found critical but its plan looks okay. What’s going on? Put it to run.

  Profiling it would determine the quantity of the resources used by a query and assessing how efficient this use was   Profile of a query execution consists of detailed

information about the duration and resource consumption

  Resources   DBMS subsystems: cache, disk, lock, log   OS raw resources: CPU


13

Query profiling indicators

  Duration information involves 3 indicators:   Elapsed time for the query: time it took to process it as

perceived by the user   CPU time: time the CPU was actually used to process the

query   Wait time: time the query was not processing and waiting

for a resource to become available   Resource consumption information includes:

  (I/O) Physical and logical reads/writes   (Locking) Maximum nb locks held, nb lock escalations,nb

deadlocks/timeouts, total time spent waiting for locks   (SQL activity) Nb sorts and temporary area usage: gives an

idea of the time spent in expensive overhead activity


Two common scenarios 1.  Elapsed query time close CPU time, wait time negligible.

Consumption fair: reasonable nb pages being accesses, most of them are logical accesses; nb locks low or inexistent and no deadlocks or lock escalations; if there were sorts, they didn’t augment the nb of logical/physical reads/writes

  Access plan is executed in the best way possible; if the rest of the chain is balanced, this is the best performance the system can deliver for this query

2.  Noticeable discrepancy between elapsed and CPU time, wait time fills the gap. So there must be a problem in resource consumption: contention pb (probably waiting for locks) or poorly performing resource (ex. If area allocated for sorting is small, then additional I/O).

  To distinguish between contention and physical resource consumption problem, run the query in isolation


14

Performance Monitors to Profiling Queries

  Access or compute performance indicators’ values at any time

  Many, many flavors   Scope of the indicators: Generic (all indicators from OS

and DBMS) or specific (correlated indicators of a given subsystem or a given query)   Choose the generic one when interested in how several

distinct indicators behave together   Frequency of data gathering:

  snapshot perf indicators: allow to freeze a situation at a specific moment in time

  Continuous perf. Indicator: snaphotting at regular intervals   Alarm modes: allows us to enter thresholds beyond which

the system is acting abnormally   Presentation of data: textual or graphical

  Textual good for seeing the precise values of indicators   Graphical chosen when wants to see the trends AOBD 08/09 H. Galhardas

An example Performance Monitor (query level)

•  Details of buffer and CPU consumption on a query’s report according to DB2’s Benchmark tool

•  Similar tools: MSSQL’s SET STATISTICS switch and Oracle’s SQL Analyze Tool

Statement number: 1 select C_NAME, N_NAME from DBA.CUSTOMER join DBA.NATION on C_NATIONKEY = N_NATIONKEY where C_ACCTBAL > 0

Number of rows retrieved is: 136308 Number of rows sent to output is: 0 Elapsed Time is: 76.349 seconds

… Buffer pool data logical reads = 272618 Buffer pool data physical reads = 131425 Buffer pool data writes = 0 Buffer pool index logical reads = 273173 Buffer pool index physical reads = 552 Buffer pool index writes = 0 Total buffer pool read time (ms) = 71352 Total buffer pool write time (ms) = 0 … Summary of Results ================== Elapsed Agent CPU Rows Rows Statement # Time (s) Time (s) Fetched Printed 1 76.349 6.670 136308 0


15

An example Performance Monitor (system level)

•  An IO indicator’s consumption evolution (qualitative and quantitative) according to DB2’s System Monitor

•  Similar tools: Window’s Performance Monitor and Oracle’s Performance Manager



Answer Q3



yes no


Tune subsystems


yes no


16

Investigating Primary Resources

•  Answer question 3: “Are there enough primary resources available for a DBMS to consume?”

•  Primary resources: CPU, disk/controllers, memory, and network

•  Analyze specific OS-level indicators to discover bottlenecks.

•  A system-level performance monitor is the right tool here


CPU Consumption Indicators at the OS Level

100%

CPU % of

utilization

70%

time

Sustained utilization over 70% should trigger the alert.

System utilization shouldn’t be more

than 40%. DBMS (in a non-

dedicated machine) should be getting a decent time share.

total usage

system usage


17

Disk Performance Indicators at the OS Level

Wait queue

Average Queue Size

New requests Disk Transfers

/second

Should be close to zero

Wait times should also be close to

zero Idle disk with pending requests? Check controller

contention. Also, transfers

should be balanced among disks/controllers


What to do in case of problem?

  If there are long waiting queues and long service times, check the number of bytes transferred to confirm the disk is busy.   Idle disks with pending requests imply controller problems

  Also compare the wait times from each disk, if we have a group of disks serving the same purpose

  Only reorganize data if we are sure the I/o problems are not dur to high-level consumer inefficiency


18

Memory Consumption Indicators at the OS Level

pagefile

real memory

virtual memory

Page faults/time should be close

to zero. If paging happens, at least

not DB cache pages.

% of pagefile in use (it’s used a fixed file/partition) will tell

you how much memory is “lacking”.


What to do in case of problems?

  If the DB buffer size was properly configured, paging should come from non-DBMS sources.

  Checking memory at the OS level should be done whenever we change the memory allocation parameters of the DBMS or there are significant changes in the nb of users


19


Answer Q3



yes no


Tune subsystems


yes no


Investigating Intermediate Resources/Consumers

•  Answer question 2: “Are subsystems making optimal use of resources?”

•  Main subsystems: Cache Manager, Disk subsystem, Lock subsystem, and Log/Recovery subsystem

•  Extract and analyze relevant perf. indicators •  A Performance Monitor is usually useful, but

sometimes specific tools apply


20

Cache Manager Performance Indicators

Table scan

readpage()

Free Page slots

Page reads/ writes

Pick victim

strategy Data Pages

Cache Manager

If page is not in the cache, readpage

(logical) generate an actual IO (physical).

Ratio of readpages that did not generate

physical IO (cache hit ration) should be 90%

or more

Pages are regularly saved to disk to make

free space. # of free slots should

always be > 0



  If indicators have bad values, the buffer manager is not saving as many disk accesses as it should   One or a small nb of queries may be responsible   May have slow disk subsystem   May have insufficient memory

  Tuning parameters accordingly


21

Disk Manager Performance Indicators

rows

page

extent

file

Storage Hierarchy

(simplified)

disk

Row displacement: should kept under 5% of rows

Free space fragmentation: pages with few space should

not be in the free list

Data fragmentation: ideally files that store DB objects

(table, index) should be in one or few (<5) contiguous extents

File position: should balance workload evenly among all

disks AOBD 08/09 H. Galhardas


  Bad indicator values usually mean inadequately chosen storage parameters   To choose good storage parameters, should be aware of rate

at which data grows and with what mix of inserts/updates/deletes that growth rate happens

  Also, identifiy occasional but predicatble fluctuations in the DB activity

  Then tune disk subsystem


22

Lock Manager Performance Indicators

Lock request

Object Lock Type TXN ID

Lock List

Locks pending list

Deadlocks and timeouts should be virtually

inexistent or extremely low (no more then 1% of the

transactions)

Lock wait time for a transaction should be a

small fraction of the whole transaction time.

Number of locks on wait should be a small fraction of the number of locks on the lock list, under 20%



  The lock subsystem is often a reflection of how well the transactions were designed and seldom a source of a problem itself

  Can move up the consumption chain and find expensive queries, including data locking ones.


23

Logging Subsystem Performance Indicatores   Log is used by every transaction that alters, inserts

or deletes data   Well tuned logging subsyst. Can write log records at

a pace that doesn’t slow active transactions   Indicatores to measure:

  Number of log waits to confirm that the disks storing the log files are keeping pace with the transaction workload – should be zero

  Nb of log expansions or log archives due to lack of space – if bigger than zero indicates a configuration problem

  Log cache-hit ratio to check whether the size of the log buffer is adequate


Conclusion

  Monitoring a DBMS’s performance can be done in a systematic way   The consumption chain helps distinguishing

problems’ causes from their symptoms   Existing tools help extracting relevant

performance indicators   The three questions guide the whole monitoring

process


DBMS Performance Monitoring

Documents

Transcript of DBMS Performance Monitoring