Solutions with Practical SQL DBA : June 2013

Thursday, 27 June 2013

SQL Server: Part 2 : All About SQL Server Statistics :Histogram

In the Part 1 about SQL server Statistics, we have discussed about density vector information stored in the the statistics. In this post, let us discuss about the histogram. Let us create a copy of SalesOrderDetail table and two indexes on top of that as we did in our first part.

USE mydb
GO
SELECT * INTO SalesOrderDetail FROM AdventureWorks2008.Sales.SalesOrderDetail
CREATE UNIQUE CLUSTERED INDEX ix_SalesOrderDetailID ON SalesOrderDetail(SalesOrderDetailID)
CREATE NONCLUSTERED INDEX ix_productid ON SalesOrderDetail(productid)

Let us see the histogram information of the non clustered index.

DBCC SHOW_STATISTICS('dbo.SalesOrderDetail', 'ix_productid') WITH HISTOGRAM

You can see 200 records in the output. I have shown only the first 18 records. To create the histogram, SQL server split the data into different buckets (called steps) based on the value of first column of the index.Each record in the output is called as bucket or step.The maximum number of bucket is 200 based on the data distribution.Histogram is a statistical representation of your data.In other words it is the distribution of records based on the value of first column of the index. Histogram is always based only on the first column of the index even if the index is composite one.This is one of the reason why it is always suggested to have most selective column as the first column of the index, but there are exceptions.

Let us look at the output of the histogram. It tried to put the 121317 records in the table into 200 buckets (steps) based on the value of productid.

The RANGE_HI_KEY column represent the upper boundary of each bucket.The lower boundary of each bucket is the RANGE_HI_KEY+1 of the previous bucket. For the first bucket, the lower boundary is the smallest value of the column on which the histogram is generated.

The RANGE_ROWS column represent the number records in that bucket range but not equal to the value of RANGE_HI_KEY. The value 0 on the first record says, there is no record in the table whose productid value is less than 707. If you look into the 11th record with RANGE_HI_KEY value 718, we have 218 in the RANGE_ROWS column .This says ,there are 218 records with productid value is greater than 716 (previous RANGE_HI_KEY) and productid value is less than 718. The output of below query proves that:

SELECT COUNT(*) FROM SalesOrderDetail WHERE productid>716 AND productid<718

The EQ_ROWS is the number of records in the table matching with RANGE_HI_KEY. For the first records, 3083 in the EQ_ROWS says that there are 3083 records in the table with productid 707.The output of below query proves that:

SELECT COUNT(*) FROM SalesOrderDetail WHERE productid=707

The DISTINCT_RANGE_ROWS represent the number of distinct records (distinct productid ) between two RANGE_HI_KEY values. If you look into the 11th record with RANGE_HI_KEY value 718, we have value 1 in the DISTINCT_RANGE_ROWS column .This says ,there is only 1 distinct records withe productid value is greater than 716 (previous RANGE_HI_KEY) and productid value is less than 718. The output of below query proves that:


SELECT COUNT(distinct productid) FROM SalesOrderDetail WHERE productid>716 AND productid<718

The AVG_RANGE_ROWS column represent the average number of rows per distinct values.This is equivalent to RANGE_ROWS / DISTINCT_RANGE_ROWS when RANGE_ROWS value is greater than 0. Otherwise AVG_RANGE_ROWS is considered as 1.

How SQL server optimizer use the histogram for cardinality estimation ? Let us consider the execution plan of the below query.

SELECT productid FROM SalesOrderDetail WHERE productid>=716 AND productid<=718

From where the Estimated Number of rows (1513) is calculated ? Let us go to the histogram

Add the highlighted values which will match to 1513 which is the estimated number of rows in the execution plan .
1076 is the number of records with productid value 716
218 is the number of records with productid value greater than 716 and productid less than 718
219 is the number of records with productid value 718

Sum of these three values is the estimated number of rows in the above execution plan.

When there is complex where condition ,optimizer create required statistics called column statistics and use complex algorithm on top of the histogram data for cardinality estimation. We will discuss about that in the next post.

If you liked this post, do like my page on FaceBook

Wednesday, 26 June 2013

SQL Server: Identifying Edition Specifc features Used in a Database

I had a situation which forced me to do a feasibility study to move SQL server 2008 enterprise edition environment to standard edition. It is hard to check is there any edition specific features are implemented in any of the databases. It will be more complex process if the environment contain many databases.

There are features like data compression,partitioning,Transparent Data Encryption(TDE) and CDC which are available only on specific edition. The databases, that use these edition specific features can not be moved to an edition that does not support this feature.

Microsoft has provided a DMV, sys.dm_db_persisted_sku_features to list all the edition specific featured implemented in a database. This can be used as given below:

SELECT * FROM sys.dm_db_persisted_sku_features

For me, it given the below result.

If the query does not return any result , the specific database does not used any edition specific features and can be moved to any edition of SQL server.

If you liked this post, do like my page on FaceBook

Tuesday, 25 June 2013

SQL Server: Part1 : All About SQL Server Statistics :Density Vector

In our daily life, we use the statistics to take the decision. In the same way SQL server optimizer use the statistics to choose the right query plan. if the statistics are wrong or outdated , SQL server might choose a wrong query plan. In this post, let us try to understand the different aspects of statistics.

The query optimizer use the statistics to determine the rows returned in each step.The estimated rows information in the execution plan is calculated based on the statistics available on the columns.The statistics gives the distribution of data with in the column.With out statistics, the query optimizer can not be determine the efficiency of different plan.By using the information stored in the statistics , the query optimizer can make right choice in accessing the data.

Statistics are created automatically when we define the indexes.Apart from this statistics are getting created when a column is referenced in a query as part of where condition, in the group by clause or join criteria. To happen the auto creation of statistics, the AUTO_CREATE_STATISTICS setting in database level should be enabled. By default this setting is enabled.Apart from these , statistics can be created using the CREATE STATISTICS command.

Every statistics in SQL server stores information about density vector and Histogram (data distribution). Let us try to understand both before we discuss more about statistics.

Density:Density is the ratio of unique values with in the given column or a set of columns.

The formula to calculate density is : 1/Number of distinct values for a column or set of column

Density measure the uniqueness of column or selectivity of column.Density can have value between 0 and 1. If the column has density value 1, it means all the records has same value in that column and less selectivity. Higher the density lower the selectivity. If the column has density value 0.003, that means there are 1/0.003=333 distinct values in that column.

Let us consider a sample. I am going to create a table with two indexes using the below script.

USE mydb
GO
SELECT * INTO SalesOrderDetail FROM AdventureWorks2008.Sales.SalesOrderDetail
CREATE UNIQUE CLUSTERED INDEX ix_SalesOrderDetailID ON SalesOrderDetail(SalesOrderDetailID)
CREATE NONCLUSTERED INDEX ix_productid ON SalesOrderDetail(productid)

Let us see how the statistics will looks like for these two indexes. The below command will show the details of statistics associated with the clustered index Ix_SalesOrderDetailID.

DBCC SHOW_STATISTICS('dbo.SalesOrderDetail', 'ix_SalesOrderDetailID')

The output will have three sections: statistics header,density vector and Histogram. Let us concentrate only on the first two sections.

In the first section (Statistics header)

Name : The name of the index
Updated : Tells the date and time of last statistics update
Rows: The number of rows in the index.This is the number of rows in the indexes not in the table(in case of filtered index)
Rows Sampled: The number of entries that were sampled to generate the statistics. While creating an index , it will go for a full scan.
Steps: The number of steps in the histogram (the third section . we will discuss it later)
Density : This value is not used in SQL server 2008 and exists for backward compatibility
Average Key Length : The average length of index key
String Index: The statistics include string summery statistics or not string summery statistics is the Information about the frequency distribution of substrings is maintained for character columns. This helps the optimizer better estimate the selectivity of conditions that use the LIKE operator.
Filter Expression: The expression used to create filtered index
Unfiltered Rows: The number of rows in the underlying table.If it is filtered index , the value of this column will be greater than the value in the Rows column.

We can get information only about header using 'WITH STAT_HEADER' along with DBCC SHOW_STATISTICS

In the next section, density vector, we have only one record as we have one column in our index.

All Density column gives the density value of the column SalesOrderDetailsId (1/Number of distinct values for a column or set of column). The All density column show the value 8.242868E-06 = 0.000008242868. That means the column SalesOrderDetailsId has 1/0.000008242868 =121317 unique values.This value is equivalent to

SELECT COUNT(DISTINCT SalesOrderDetailID ) FROM SalesOrderDetail

We can get information only about density vector using 'WITH DENSITY_VECTOR' along with DBCC SHOW_STATISTICS.

Let us see the density vector of the non clustered index ix_productid

DBCC SHOW_STATISTICS('dbo.SalesOrderDetail', 'ix_productid') WITH DENSITY_VECTOR

You can see two rows in the density vector even if the the non clustered index is on single column. This is because the clustered index key is part of non clustered index (Read More). The first row says, the density value of the column ProductID is 0.003759399. In other words there are 1/0.003759399 =266 distinct values in the productID column.The output of the below query proves that is true.

SELECT COUNT(DISTINCT ProductID) FROM dbo.SalesOrderDetail

The second row says, the density of the combination of the columns ProductID and SalesOrderDetailID is 0.000008242868. In other words there are 121317 distinct combination of ProductID and SalesOrderDetailID which is equivalent to the total number of records in the table.

Where is density value is used ? Let us look into the execution plan of the below query.

SELECT ProductID,COUNT(*) FROM dbo.SalesOrderDetail GROUP BY ProductID

If you look into the execution plan, in the output of aggregate operator , the estimated number of rows value is 266. It is nothing but the distinct values available in the ProductID column which can be easily calculated from the density vector. But how we can prove that query optimizer use this value to calculate the estimated number of rows? Let us create another copy of this table with out any indexes and statistics.

USE mydb
GO
SELECT * INTO SalesOrderDetail_NoStats FROM SalesOrderDetail

The output of the below query tells us there is no statistics associated with this table.

EXEC SP_HELPSTATS 'SalesOrderDetail_NoStats', 'ALL'

Let us see estimated execution plan of the grouping query on this new table.

SELECT ProductID,COUNT(*) FROM dbo.SalesOrderDetail_NoStats GROUP BY ProductID

Again the optimizer estimated 266 rows in the output of the aggregate operator with out having any index or statistics. Let us again check the statistics associated with this table.

EXEC SP_HELPSTATS 'SalesOrderDetail_NoStats', 'ALL'

Yes, while estimating the plan , SQL server created a statistics on the productID column to help optimizer to choose the right execution plan. Let us see the details of this statistics.

DBCC SHOW_STATISTICS('dbo.SalesOrderDetail_NoStats', '_WA_Sys_00000005_32E0915F')

In the header section, you can notice that the value in the Rows Sampled is less than the value of the Rows. This is because SQL server will not scan the entire table instead it will use only a sample of the table while creating auto statistics.We will discuss about the sampling in detail in the later post. In short on a non indexed fields, the statistics help the optimizer to determine estimated number of record in each operation,what kind of join is appropriate and the order of the processing in the plan.

Let us stop this post here and will discuss about the histogram in the next post.

If you liked this post, do like my page on FaceBook