StableBuffer : Optimizing Write Performance for DBMS Applications on Flash Devices
DBMS Performance Monitoring
Transcript of DBMS Performance Monitoring
1
AOBD 08/09 H. Galhardas
DBMS Performance Monitoring
Performance Monitoring Goals
Monitoring should check that the performance-influencing database parameters are correctly set and if they are not, it should point to where the problems are
Alternatively, you can sit tight and wait for the inevitable angry users’ complaints
OR
Monitoring (preventive
method)
Correct problems when detected
(reactive method)
AOBD 08/09 H. Galhardas
2
How does one monitor a DBMS?
By extracting relevant performance indicators, (counters, gauges and details of internal DBMS’s activities)
By comparing the obtained values of these indicators against ideal values
But… there are many indicators!
AOBD 08/09 H. Galhardas
A Consumer-Producer Chain of a DBMS’s Resources
High Level Consumers
Intermediate Resources/ Consumers
Primary Resources
PARSER OPTIMIZER
EXECUTION SUBSYSTEM
DISK SYBSYSTEM
CACHE MANAGER
LOGGING SUBSYSTEM
LOCKING SUBSYSTEM
NETWORK DISK/ CONTROLLER CPU MEMORY
sql commands
probing spots for
indicators
AOBD 08/09 H. Galhardas
3
Recurrent Patterns of Problems
An overloading high-level consumer Ex: query accessing too
many rows A poorly parameterized
subsystem Ex: disk subsystem that
stores tables, indexes, logging in a single disk
An overloaded primary resource Ex: CPU busy because
non-DB processes are using it at the same time
Effects are not always felt first where the cause is!
AOBD 08/09 H. Galhardas
A Systematic Approach to Monitoring
Question 1: Are critical queries being served in the most efficient manner?
Question 2: Are DBMS subsystems making optimal use of resources?
Question 3: Are there enough primary resources available and are they configured adequately, given the current and expected workload?
Extract indicators to answer the following questions
AOBD 08/09 H. Galhardas
4
Investigating High Level Consumers: Critical query monitoring
Find critical queries
Found any? Investigate lower levels
Answer Q1 over them
Overcon- sumption?
Tune problematic queries
yes
yes no
no
AOBD 08/09 H. Galhardas
Investigating Intermediate and Primary Resources: routine monitoring
Answer Q3
Problems at OS level?
Answer Q2 Tune low-level resources
yes no
Problematic subsystems?
Tune subsystems
Investigate upper level
yes no
AOBD 08/09 H. Galhardas
5
Tools used to gather information
Event monitors Query plan explainers Performance monitors
AOBD 08/09 H. Galhardas
Investigating High Level Consumers
Answer question 1: “Are critical queries being served in the most efficient manner?”
1. Identify the critical queries 2. Analyze their access plans 3. Profile their execution
AOBD 08/09 H. Galhardas
6
Identifying Critical Queries
Critical queries are usually those that: Take a long time
Consume a great amount of resources (read a lot of data, perform heavy aggregation or sorting, lock an entire table, etc)
Are frequently executed Have to be answered in a short time
Often, a user complaint will tip us off.
AOBD 08/09 H. Galhardas
Event Monitors to Identify Critical Queries
If no user complains... Record DBMS’s performance measurements, but only
when a system event happens Ex: new user connection, or beginning/end of the
execution of an SQL statement Logs the event time, duration and associated
performance indicator values Less overhead than other type of tools because
indicators are usually by-product of operations monitored and are accessed in a convenient time Particularly appropriate for capturing extraordinary
conditions, ex: deadlocks. Typical measures include CPU used, IO used, locks
obtained etc.
AOBD 08/09 H. Galhardas
7
What should be measured and evaluated
Interested in the end-of-statement or end-of-transaction event Carry relevant info that indicate the resources the
executed query consumed: SQL of the statement, duration of execution, CPU, IO, cache,lock consumption statistics
Sort the collected data over a given resource consumption statistic and we can find the most expensive queries by that criterion
AOBD 08/09 H. Galhardas
An example Event Monitor
• Several CPU indicators sorted by Oracle’s Trace Data Viewer
• Similar tools: DB2’s Event Monitor and MSSQL’s Server Profiler
AOBD 08/09 H. Galhardas
8
What to do in case of problems?
Make sure expensive queries are tuned in the best possible way
May have to drill down through the consumption chain to search for the real cause of the problem
AOBD 08/09 H. Galhardas
Investigating High Level Consumers
Answer question 1: “Are critical queries being served in the most efficient manner?”
1. Identify the critical queries 2. Analyze their access plans 3. Profile their execution
AOBD 08/09 H. Galhardas
9
Diagnose Expensive Queries: analyzing access plans
SQL commands are translated into an internal executable format before they are processed
After parsing, the optimizer enumerates and estimates the cost of a number of possible ways to execute that query
The best choice, according to the existing statistics, is chosen But this choice may not be the best one!
sql parse rewrite enumerate/ cost plans
generate chosen plan plan
PARSER/OPTIMIZER
AOBD 08/09 H. Galhardas
Query Plan Explainers to Analyze Plans
Explainers usually depict an access plan as a (graphical) single-rooted tree in which sources at the leaf end are tables or indexes, and internal nodes are operators
These form an assembly line (actually tree) of tuples!
Most explainers will let you see the estimated costs for CPU consumption, I/O, and cardinality of each operator. If you see something strange, change an index.
AOBD 08/09 H. Galhardas
10
An example Plan Explainer
• Access plan according to MSSQL’s Query Analyzer
• Similar tools: DB2’s Visual Explain and Oracle’s SQL Analyze Tool
AOBD 08/09 H. Galhardas
Finding Strangeness in Access Plans What to pay attention to on a plan
Access paths for each table For every table involved in the query, determine how
it is accessed: scan, index? Sorts or intermediary results
Check if the plan includes a sorting (because of DISTINCT, ORDER BY, GROUP BY) and see if it can be eliminated
Materialization of temporary results should be avoided
Order of operations May want to force an ordering if we otice the order
does not favor the early reduction in the number of tuples
Algorithms used in the operators
AOBD 08/09 H. Galhardas
11
To Index or not to index?
select c_name, n_name from CUSTOMER join NATION on c_nationkey=n_nationkey where c_acctbal > 0
Which plan performs best? (nation_pk is an non-clustered index over n_nationkey, and similarly for acctbal_ix over c_acctbal)
AOBD 08/09 H. Galhardas
Non-clustering indexes can be trouble
Each access to the index generates a random access to the table – possibly duplicate! If !36,308 out of 150,000 customers have positive balance, it ends up that the number of pages read from the table is greater than its size, i.e., a table scan is way better
Table Scan Index Scan 5 sec
143,075 pages 6,777 pages
136,319 pages 7 pages
76 sec 272,618 pages 131,425 pages 273,173 pages
552 pages
CPU time data logical reads
data physical reads index logical reads
index physical reads
AOBD 08/09 H. Galhardas
12
Investigating High Level Consumers
Answer question 1: “Are critical queries being served in the most efficient manner?”
1. Identify the critical queries 2. Analyze their access plans 3. Profile their execution
AOBD 08/09 H. Galhardas
Profiling a Query’s Execution
A query was found critical but its plan looks okay. What’s going on? Put it to run.
Profiling it would determine the quantity of the resources used by a query and assessing how efficient this use was Profile of a query execution consists of detailed
information about the duration and resource consumption
Resources DBMS subsystems: cache, disk, lock, log OS raw resources: CPU
AOBD 08/09 H. Galhardas
13
Query profiling indicators
Duration information involves 3 indicators: Elapsed time for the query: time it took to process it as
perceived by the user CPU time: time the CPU was actually used to process the
query Wait time: time the query was not processing and waiting
for a resource to become available Resource consumption information includes:
(I/O) Physical and logical reads/writes (Locking) Maximum nb locks held, nb lock escalations,nb
deadlocks/timeouts, total time spent waiting for locks (SQL activity) Nb sorts and temporary area usage: gives an
idea of the time spent in expensive overhead activity
AOBD 08/09 H. Galhardas
Two common scenarios 1. Elapsed query time close CPU time, wait time negligible.
Consumption fair: reasonable nb pages being accesses, most of them are logical accesses; nb locks low or inexistent and no deadlocks or lock escalations; if there were sorts, they didn’t augment the nb of logical/physical reads/writes
Access plan is executed in the best way possible; if the rest of the chain is balanced, this is the best performance the system can deliver for this query
2. Noticeable discrepancy between elapsed and CPU time, wait time fills the gap. So there must be a problem in resource consumption: contention pb (probably waiting for locks) or poorly performing resource (ex. If area allocated for sorting is small, then additional I/O).
To distinguish between contention and physical resource consumption problem, run the query in isolation
AOBD 08/09 H. Galhardas
14
Performance Monitors to Profiling Queries
Access or compute performance indicators’ values at any time
Many, many flavors Scope of the indicators: Generic (all indicators from OS
and DBMS) or specific (correlated indicators of a given subsystem or a given query) Choose the generic one when interested in how several
distinct indicators behave together Frequency of data gathering:
snapshot perf indicators: allow to freeze a situation at a specific moment in time
Continuous perf. Indicator: snaphotting at regular intervals Alarm modes: allows us to enter thresholds beyond which
the system is acting abnormally Presentation of data: textual or graphical
Textual good for seeing the precise values of indicators Graphical chosen when wants to see the trends AOBD 08/09 H. Galhardas
An example Performance Monitor (query level)
• Details of buffer and CPU consumption on a query’s report according to DB2’s Benchmark tool
• Similar tools: MSSQL’s SET STATISTICS switch and Oracle’s SQL Analyze Tool
Statement number: 1 select C_NAME, N_NAME from DBA.CUSTOMER join DBA.NATION on C_NATIONKEY = N_NATIONKEY where C_ACCTBAL > 0
Number of rows retrieved is: 136308 Number of rows sent to output is: 0 Elapsed Time is: 76.349 seconds
… Buffer pool data logical reads = 272618 Buffer pool data physical reads = 131425 Buffer pool data writes = 0 Buffer pool index logical reads = 273173 Buffer pool index physical reads = 552 Buffer pool index writes = 0 Total buffer pool read time (ms) = 71352 Total buffer pool write time (ms) = 0 … Summary of Results ================== Elapsed Agent CPU Rows Rows Statement # Time (s) Time (s) Fetched Printed 1 76.349 6.670 136308 0
AOBD 08/09 H. Galhardas
15
An example Performance Monitor (system level)
• An IO indicator’s consumption evolution (qualitative and quantitative) according to DB2’s System Monitor
• Similar tools: Window’s Performance Monitor and Oracle’s Performance Manager
AOBD 08/09 H. Galhardas
Investigating Intermediate and Primary Resources: routine monitoring
Answer Q3
Problems at OS level?
Answer Q2 Tune low-level resources
yes no
Problematic subsystems?
Tune subsystems
Investigate upper level
yes no
AOBD 08/09 H. Galhardas
16
Investigating Primary Resources
• Answer question 3: “Are there enough primary resources available for a DBMS to consume?”
• Primary resources: CPU, disk/controllers, memory, and network
• Analyze specific OS-level indicators to discover bottlenecks.
• A system-level performance monitor is the right tool here
AOBD 08/09 H. Galhardas
CPU Consumption Indicators at the OS Level
100%
CPU % of
utilization
70%
time
Sustained utilization over 70% should trigger the alert.
System utilization shouldn’t be more
than 40%. DBMS (in a non-
dedicated machine) should be getting a decent time share.
total usage
system usage
AOBD 08/09 H. Galhardas
17
Disk Performance Indicators at the OS Level
Wait queue
Average Queue Size
New requests Disk Transfers
/second
Should be close to zero
Wait times should also be close to
zero Idle disk with pending requests? Check controller
contention. Also, transfers
should be balanced among disks/controllers
AOBD 08/09 H. Galhardas
What to do in case of problem?
If there are long waiting queues and long service times, check the number of bytes transferred to confirm the disk is busy. Idle disks with pending requests imply controller problems
Also compare the wait times from each disk, if we have a group of disks serving the same purpose
Only reorganize data if we are sure the I/o problems are not dur to high-level consumer inefficiency
AOBD 08/09 H. Galhardas
18
Memory Consumption Indicators at the OS Level
pagefile
real memory
virtual memory
Page faults/time should be close
to zero. If paging happens, at least
not DB cache pages.
% of pagefile in use (it’s used a fixed file/partition) will tell
you how much memory is “lacking”.
AOBD 08/09 H. Galhardas
What to do in case of problems?
If the DB buffer size was properly configured, paging should come from non-DBMS sources.
Checking memory at the OS level should be done whenever we change the memory allocation parameters of the DBMS or there are significant changes in the nb of users
AOBD 08/09 H. Galhardas
19
Investigating Intermediate and Primary Resources: routine monitoring
Answer Q3
Problems at OS level?
Answer Q2 Tune low-level resources
yes no
Problematic subsystems?
Tune subsystems
Investigate upper level
yes no
AOBD 08/09 H. Galhardas
Investigating Intermediate Resources/Consumers
• Answer question 2: “Are subsystems making optimal use of resources?”
• Main subsystems: Cache Manager, Disk subsystem, Lock subsystem, and Log/Recovery subsystem
• Extract and analyze relevant perf. indicators • A Performance Monitor is usually useful, but
sometimes specific tools apply
AOBD 08/09 H. Galhardas
20
Cache Manager Performance Indicators
Table scan
readpage()
Free Page slots
Page reads/ writes
Pick victim
strategy Data Pages
Cache Manager
If page is not in the cache, readpage
(logical) generate an actual IO (physical).
Ratio of readpages that did not generate
physical IO (cache hit ration) should be 90%
or more
Pages are regularly saved to disk to make
free space. # of free slots should
always be > 0
AOBD 08/09 H. Galhardas
What to do in case of problem?
If indicators have bad values, the buffer manager is not saving as many disk accesses as it should One or a small nb of queries may be responsible May have slow disk subsystem May have insufficient memory
Tuning parameters accordingly
AOBD 08/09 H. Galhardas
21
Disk Manager Performance Indicators
rows
page
extent
file
Storage Hierarchy
(simplified)
disk
Row displacement: should kept under 5% of rows
Free space fragmentation: pages with few space should
not be in the free list
Data fragmentation: ideally files that store DB objects
(table, index) should be in one or few (<5) contiguous extents
File position: should balance workload evenly among all
disks AOBD 08/09 H. Galhardas
What to do in case of problem?
Bad indicator values usually mean inadequately chosen storage parameters To choose good storage parameters, should be aware of rate
at which data grows and with what mix of inserts/updates/deletes that growth rate happens
Also, identifiy occasional but predicatble fluctuations in the DB activity
Then tune disk subsystem
AOBD 08/09 H. Galhardas
22
Lock Manager Performance Indicators
Lock request
Object Lock Type TXN ID
Lock List
Locks pending list
Deadlocks and timeouts should be virtually
inexistent or extremely low (no more then 1% of the
transactions)
Lock wait time for a transaction should be a
small fraction of the whole transaction time.
Number of locks on wait should be a small fraction of the number of locks on the lock list, under 20%
AOBD 08/09 H. Galhardas
What to do in case of problem?
The lock subsystem is often a reflection of how well the transactions were designed and seldom a source of a problem itself
Can move up the consumption chain and find expensive queries, including data locking ones.
AOBD 08/09 H. Galhardas
23
Logging Subsystem Performance Indicatores Log is used by every transaction that alters, inserts
or deletes data Well tuned logging subsyst. Can write log records at
a pace that doesn’t slow active transactions Indicatores to measure:
Number of log waits to confirm that the disks storing the log files are keeping pace with the transaction workload – should be zero
Nb of log expansions or log archives due to lack of space – if bigger than zero indicates a configuration problem
Log cache-hit ratio to check whether the size of the log buffer is adequate
AOBD 08/09 H. Galhardas
Conclusion
Monitoring a DBMS’s performance can be done in a systematic way The consumption chain helps distinguishing
problems’ causes from their symptoms Existing tools help extracting relevant
performance indicators The three questions guide the whole monitoring
process
AOBD 08/09 H. Galhardas