Estimation, Statistics and “Oh My!”
description
Transcript of Estimation, Statistics and “Oh My!”
![Page 1: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/1.jpg)
Dave BallantyneClear Sky SQL
Estimation, Statistics and “Oh My!”
![Page 2: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/2.jpg)
› Freelance Database Developer/Designer– Specializing in SQL Server for 15+ years
› SQLLunch– Lunchtime usergroup– London & Cardiff , 2nd & 4th Tuesdays
› TSQL Smells script author– http://tsqlsmells.codeplex.com/
› Email : [email protected]› Twitter : @DaveBally
The ‘Me’ slide
![Page 3: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/3.jpg)
› This is also me– Far to often ….
› Estimates are central› Statistics provide
estimates
“Oh my!”
![Page 4: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/4.jpg)
› Every Journey starts with a plan
› Is this the ‘best’ way to Lyon?– Fastest– Most efficient– Shortest
› SQL Server make similar choices
The plan
![Page 5: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/5.jpg)
› SQL Server is a cost based optimizer– Cost based = compares predicted costs
› Therefore estimations are needed› Every choice is based on these estimations › Has to form a plan
– And the plan cannot change in execution if ‘wrong’
Estimation
![Page 6: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/6.jpg)
Estimation – Per execution
![Page 7: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/7.jpg)
› Costs are not actual costs– Little or no relevance to the execution costs
› Cost is not a metric– 1 <> 1 anything
› Their purpose:– Pick between different candidate plans for a query
Estimation
![Page 8: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/8.jpg)
› Included within a index› Auto Updated and created
– Optimizer decides “It would be useful if I knew..”– Only on single column– Not in read-only databases
› Can be manually updated› Auto-Creation can operate Async
How is estimation calculated ? - Statistics
![Page 9: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/9.jpg)
› DBCC SHOW_STATISTICS(tbl,stat)– WITH STAT_HEADER
› Display statistics header information– WITH HISTOGRAM
› Display detailed step information– WITH DENSITY_VECTORS
› Display only density of columns– Density = rows / count of distinct values
– WITH STATS_STREAM› Binary stats blob ( not supported )
Statistics Data
![Page 10: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/10.jpg)
Statistics – WITH STAT_HEADER
Total rows in table
Rows read and sampled
No of steps in histogram – Max 200
Density (Rows/Distinct values) (exc boundaries) not used
Avg byte len for all columns
![Page 11: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/11.jpg)
Statistics – WITH HISTOGRAM Each step contains data on a range of
values Range is defined by
◦ <= RANGE_HI_KEY◦ > Previous range RANGE_HI_KEY
Row 3 > 2 and <=407Row 4 > 407 and <=470
6 rows of data = 470
17 Rows of data > 407 and < 470
9 Distinct values> 407 and < 470 Density (17 / 9)
![Page 12: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/12.jpg)
Statistics in practiceRANGE_HI_KEY <= PredicateAND > Previous RANGE_HI_KEY
As Predicate == RANGE_HI_KEYEstimate = EQ_ROWS
![Page 13: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/13.jpg)
Statistics in practiceRANGE_HI_KEY <= PredicateAnd > previous RANGE_HI_KEYAs Predicate < RANGE_HI_KEY
Estimate = AVG_RANGE_ROWS
![Page 14: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/14.jpg)
› Greater accuracy on Range boundary values– Based upon the ‘maxdiff’ algorithm
› Relevant for leading column– Estimate for Smiths– But not Smiths called John
› Additional Columns cumulative density only– 1/(Count of Distinct Values)
Statistics
![Page 15: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/15.jpg)
WITH DENSITY_VECTORS
Density vector = 1/(Count of Distinct Values)
=19,517
1 / 19,517 = ~5.123738E-05
![Page 16: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/16.jpg)
DENSITY_VECTORS in practice
![Page 17: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/17.jpg)
DENSITY_VECTORS in practice
211 * 5.123728E-05= ~1.02331
![Page 18: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/18.jpg)
› All Diazs will estimate to the same:– As will all Smiths,Jones & Ballantynes– The statistics do not contain detail on FirstName– Only how many distinct values there are– And assumes these are evenly distributed
› Not only across a single Surname› But ALL Surnames
DENSITY_VECTORS in practice
![Page 19: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/19.jpg)
› So far we have only used a single statistic for estimations
› For this query:
› To provide the best estimate the optimizer ideally needs to know about LastName and FirstName
Multiple Statistics
![Page 20: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/20.jpg)
› Correlating Multiple Stats› AND conditions › Find the intersection
Multiple Statistics - Usage
LastName = ‘Sanchez’
FirstName = ‘Ken’
![Page 21: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/21.jpg)
› And logic– Intersection Est = ( Density 1 * Density 2)
› Or logic– Row Est 1 + Row Est 2 –(Intersection Estimate)– Avoid double counting
Multiple Statistics
![Page 22: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/22.jpg)
Multiple Stats – In action
10% * 10 % = 1%10% * 20 % = 2%
No Correlation in the data is
assumed
![Page 23: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/23.jpg)
• To keep statistics fresh they get ‘aged’• 0 to > 0 • <= 6 Rows (For Temp Tables)
• 6 Modification• <= 500 Rows
• 500 Modifications• >= 501 Rows
• 500 + 20% of table
• Will cause statistics to be updated on next use• Will cause statements to be recompiled on next
execution• Temp tables in stored procedures more complex
Aged Statistics
![Page 24: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/24.jpg)
Large Tables Trace flag 2371 -Dynamically lower
statistics update threshold for large tables >25,000 rows 2008r2 (SP1) & 2012
![Page 25: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/25.jpg)
› When density vector is not accurate enough› Manually created statistics only› Additional where clause can be utilised
Filtered Statistics
Filter Expression = FilterUnfiltered Rows = Total rows in table before filter
Rows SampledNumber of filtered rows sampled
![Page 26: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/26.jpg)
Filtered Statistics
Density of London * Density of Ramos
Filter is matched and histogram is used
![Page 27: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/27.jpg)
Sampled Data› For ‘large’ data sets a smaller sample can be used› Here 100% of the rows have been sampled
› Here ~52% of the rows have been sampled
› Statistics will assume the same distribution of values through the entire dataset
![Page 28: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/28.jpg)
Non Literal values› Also Auto/Forced Parameterization
![Page 29: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/29.jpg)
› Remember the Density Vector ?
› 19972 (Total Rows )* 0.0008285004 =› 16.5468
Non Literal Values
On Equality The Average Density Is
Assumed
![Page 30: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/30.jpg)
› Stored ProceduresNon Literal Values
![Page 31: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/31.jpg)
› Enables a better plan to be built– (most of the time)– Uses specific values rather than average values
› Values can be seen in properties pane
› Erratic execution costs are often Parameter Sniffing problems
Parameter Sniffing
![Page 32: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/32.jpg)
› Force a value to be used in optimization› A literal value› Or UNKNOWN
– Falls back to density information
Optimize For
![Page 33: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/33.jpg)
› OPTION(RECOMPILE)– Recompile on every execution
› Because the plans aren’t cached– No point as by definition the plan wont be reused
› Uses variables as if literals– More accurate estimates
Forcing Recompiling
![Page 34: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/34.jpg)
› The Achilles heel– A plan is fixed
› But… The facts are:– More Smiths than Ballantynes– More Customers in London than Leeds
Unbalanced data
London
Leeds
Smith 500 50Ballantyne 10 1
![Page 35: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/35.jpg)
› This is known as the ‘Plan Skyline’ problemUnbalanced Data
400
500
600
700
800
900
1000
1100
1200
Summarised (20 Step) Surname Stats Distribution
![Page 36: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/36.jpg)
Unbalanced Data
Abbas Baker Cai Cooper Gill He Jiménez Long Moore Patel Ramirez Ruiz Shen Torres Wood Zhou0
50
100
150
200
250
Full Statistics distribution
![Page 37: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/37.jpg)
› But wait…› It gets worse
– That was only EQ_ROWS– RANGE_ROWS ??
Unbalanced Data
Abbas Baker Cai Cooper Gill He JiménezLong Moore PatelRamirez Ruiz Shen Torres Wood Zhou0
10
20
30
40
50
60
70
80
Range Rows statistics distribution
![Page 38: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/38.jpg)
› Variations in plans– Shape
› Which is the ‘primary’ table ?– Physical Joins– Index (non)usage
› Bookmark lookups– Memory grants
› 500 rows need more memory than 50– Parallel plans
Unbalanced Data
![Page 39: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/39.jpg)
› Can the engine resolve this ?– No!
› We can help though– And without recompiling– Aim is to prevent ‘car-crash’ queries– Not necessarily provide a ‘perfect’ plan
Unbalanced data
Demo
![Page 40: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/40.jpg)
› “There is always a trace flag”– Paul White ( @SQL_Kiwi)
› TF 9292– Show when a statistic header is read
› TF 9204– Show when statistics have been fully loaded
› TF 8666– Display internal debugging info in QP
Which stats are used ?
![Page 41: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/41.jpg)
TF 9292 and 9204
![Page 42: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/42.jpg)
TF8666
![Page 43: Estimation, Statistics and “Oh My!”](https://reader036.fdocuments.net/reader036/viewer/2022062410/568161ce550346895dd1bff2/html5/thumbnails/43.jpg)
› Statistics Used by the Query Optimizer in Microsoft SQL Server 2008
› Plan Caching in SQL Server 2008› SQL Server internals 2008 book (MSPress)
References