Thursday, 1 December 2011

Cool CROSS APPLY Tricks, Part 2

In Part 1, we looked at some fairly typical examples of using CROSS APPLY.

But the best is yet to come.

Consider this problem that Adam Haines brought up in his blog entry entitled Converting a Delimited String of Values into Columns. He presents the following table.
declare @t table(  ProductId int ,ProductName varchar(25) ,SupplierId int ,Descr varchar(50)
) insert @t select 1,'Product1',1,'A1,10in,30in,2lbs'union all select 2,'Product2',2,'T6,15in,30in'union all select 3,'Product3',1,'A2,1in,,0.5lbs'union all select 4,'Product4',1,'X5,3in'
Note the column called Descr. That column contains 4 attributes of a product (Type, Length, Height, and Width) in a comma-delimited format. Your job: extract those values. Adam uses a very nice XML method to extract the values from the Descr field. Unfortunately the problem with using XML is that one of the attributes might contain a character that would break it, like an ampersand (&) or a less-than sign (<) for example.

So CROSS APPLY to the rescue!

The Descr field may not have all the commas in place (see row 4 in the table), so we will artifically add 4 of them to the end, creating a string to work with:

select ProductID
      ,SupplierID
      ,stringfrom @tcross apply (select string=Descr+',,,,') f1/*
  ProductID  SupplierID string
----------- ----------- ---------------------
          1           1 A1,10in,30in,2lbs,,,,
          2           2 T6,15in,30in,,,,
          3           1 A2,1in,,0.5lbs,,,,
          4           1 X5,3in,,,,
*/
You see how I used CROSS APPLY to introduce a new column called string? Now that we have that string to work with, we need to find where the commas are. The first comma is easy. We just use CHARINDEX. But the second and third and fourth commas all depend on the location of the previous comma. So we must use CROSS APPLY 4 times:

select ProductID
      ,SupplierID
      ,string
      ,p1,p2,p3,p4from @tcross apply (select string=Descr+',,,,') f1cross apply (select p1=charindex(',',string)) f2cross apply (select p2=charindex(',',string,p1+1)) f3cross apply (select p3=charindex(',',string,p2+1)) f4cross apply (select p4=charindex(',',string,p3+1)) f5/*
  ProductID  SupplierID string                p1 p2 p3 p4
----------- ----------- --------------------- -- -- -- --
          1           1 A1,10in,30in,2lbs,,,,  3  8 13 18
          2           2 T6,15in,30in,,,,       3  8 13 14
          3           1 A2,1in,,0.5lbs,,,,     3  7  8 15
          4           1 X5,3in,,,,             3  7  8  9
*/
Now that we have all the locations of the commas, it’s easy to pull out the attributes that we want:

select ProductID
      ,SupplierID
      ,[Type]
      ,Length
      ,Height
      ,Weightfrom @tcross apply (select string=Descr+',,,,') f1cross apply (select p1=charindex(',',string)) f2cross apply (select p2=charindex(',',string,p1+1)) f3cross apply (select p3=charindex(',',string,p2+1)) f4cross apply (select p4=charindex(',',string,p3+1)) f5cross apply (select [Type]=substring(string,1,p1-1)                   ,Length=substring(string,p1+1,p2-p1-1)                   ,Height=substring(string,p2+1,p3-p2-1)                   ,Weight=substring(string,p3+1,p4-p3-1)) f6/*
  ProductID  SupplierID Type Length Height Weight
----------- ----------- ---- ------ ------ ------
          1           1 A1   10in   30in   2lbs
          2           2 T6   15in   30in   
          3           1 A2   1in           0.5lbs
          4           1 X5   3in     
*/
The terrific thing about this is that it doesn’t involve anything in terms of cost. If you look at a query plan, all it consists of is a Table Scan followed by a Compute Scalar, the latter of which has an estimated operator cost of 0.0000001… in other words, pretty much no cost at all! The Query Optimizer combined all the CROSS APPLY stuff I introduced and essentially does the following query, (which you can see if you look at the properties of the Compute Scalar icon in the plan):

select ProductID
      ,SupplierID
      ,[Type]=substring([Descr]+',,,,',1,charindex(',',[Descr]+',,,,')-1)      ,Length=substring([Descr]+',,,,',charindex(',',[Descr]+',,,,')+1,               (charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1)-               charindex(',',[Descr]+',,,,'))-1)      ,Height=substring([Descr]+',,,,',charindex(',',[Descr]+',,,,',               charindex(',',[Descr]+',,,,')+1)+1,(charindex(',',[Descr]+',,,,',               charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1)+1)-               charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1))-1)      ,Weight=substring([Descr]+',,,,',charindex(',',[Descr]+',,,,',               charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1)+1)+1,               (charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,',               charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,')+1)+1)+1)-               charindex(',',[Descr]+',,,,',charindex(',',[Descr]+',,,,',               charindex(',',[Descr]+',,,,')+1)+1))-1)from @t 
How would you like to parse through and troubleshoot that? What a nightmare! But you don’t have to… our CROSS APPLY query is very readable and easy to understand, and it's fast.

Here’s one more example…

Here’s a table of messages:

declare @t table (Message varchar(100))insert @t select 'cross apply is really cool'union all select 'and ntile is really cool too'union all select 'and iced tea is really cool too'
Your job is to pull out all the word pairs and report on how often they occur throughout the table. Here’s how to do it with our new pal CROSS APPLY:

;with Numbers as(  select N=Number 
  from master..spt_values
  where Type='P' and Number>0)select WordPair,Occurrences=count(*)from @tjoin Numbers on substring(' '+Message,N,1)=' ' and N<len(Message)+1cross apply (select string=substring(' '+Message+' ',N+1,len(Message)+1)) f1cross apply (select p1=charindex(' ',string)) f2cross apply (select p2=charindex(' ',string,p1+1)) f3cross apply (select WordPair=case when p1<p2 then left(string,p2-1) end) f4where WordPair is not nullgroup by WordPairorder by count(*) desc/*
WordPair    Occurrences
----------- -----------
is really             3
really cool           3
cool too              2
cross apply           1
iced tea              1
and iced              1
and ntile             1
apply is              1
tea is                1
ntile is              1
*/
I use a table of numbers (I “cheated” and just used the numbers in master..spt_values) and JOIN that to the Message column based on the location of its space characters. The first CROSS APPLY (f1) creates a string to work with, which is a SUBSTRING of the Message, starting at a word boundary (and ending with a space I artificially put at the end). The second and third CROSS APPLYs (f2 and f3) find the location of the first and second space characters in that string. The final CROSS APPLY creates the word pair, if one was found. If no word pair is found, then WordPair is set to NULL (because there is no ELSE in the CASE). That is filtered out by the WHERE clause.

(Edit on July11,2009: For a more in-depth discussion of processing the above word-pair query, see my blog post entitled Word Pairs Revisited (Cool CROSS APPLY Tricks, Part 2 ½)).

Note that all these new columns that we introduced (string, p1, p2, WordPair) can be used in a WHERE clause and a GROUP BY clause or pretty much anywhere within the query. That’s the best part.

The query below is what is really calculated behind the scenes:

;with Numbers as(  select N=Number 
  from master..spt_values
  where Type='P' and Number>0)select WordPair=case when charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))<                  charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),                  charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1) 
                then left(substring(' '+Message+' ',N+1,len(Message)+1),                  charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),                  charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1)-1) 
                end      ,count(*)from @tjoin Numbers on substring(' '+Message,N,1)=' ' and N<len(Message)+1where case when charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))<        charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),        charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1) 
      then left(substring(' '+Message+' ',N+1,len(Message)+1),        charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),        charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1)-1) 
      end is not nullgroup by case when charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))<           charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),           charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1) 
         then left(substring(' '+Message+' ',N+1,len(Message)+1),           charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1),           charindex(' ',substring(' '+Message+' ',N+1,len(Message)+1))+1)-1) 
         end
order by count(*) desc
Yeesh!

So after all this, I hope I have converted you to the beauty of the APPLY operator. Go spread the word!

Monday, 28 November 2011

Cool CROSS APPLY Tricks, Part 1

In most SQL Server books or tutorials, the APPLY operator would get no more than a 1 or 2 page treatment, if at all. (A notable exception is Itzik Ben-Gan’s book Inside Microsoft SQL Server 2005: T-SQL Querying, which goes into some depth). But in my opinion it’s a really cool and extremely useful operator.

The example below is representative of about 97% of the examples you see in books and tutorials regarding APPLY in that it uses a table-valued function. Using the AdventureWorks database, it finds all Vista credit cards that expire in June of 2008, and gives information about the contact associated with the credit card.
select f.FirstName
      ,f.LastName
      ,f.JobTitle
      ,f.ContactType
      ,cc.CardNumberfrom Sales.CreditCard ccjoin Sales.ContactCreditCard ccc on cc.CreditCardID=ccc.CreditCardIDcross apply dbo.ufnGetContactInformation(ccc.ContactID) fwhere cc.ExpYear=2008
  and cc.ExpMonth=6
  and cc.CardType='Vista'/*
FirstName LastName JobTitle         ContactType   CardNumber
--------- -------- ---------------- ------------- --------------
Peggy     Justice  Owner            Store Contact 11119759315644
John      McClane  Purchasing Agent Store Contact 11119490672347
Laura     Cai      NULL             Consumer      11112813884091
Natalie   Gonzales NULL             Consumer      11114369564985
Jarrod    Sara     NULL             Consumer      11116045498593
Katherine Bailey   NULL             Consumer      11119100149656
Stephanie Gray     NULL             Consumer      11112324154556
Shawna    Sharma   NULL             Consumer      11116413893110
Mindy     Rai      NULL             Consumer      11115163407997
Jackson   Jai      NULL             Consumer      11112011871602
And so on... (74 rows total)
*/
The SQL Engine first must evaluate everthing on the left-hand side of the APPLY, and then, for each row, it is able to process (or apply) the right-hand side to produce some kind of a (table-valued) result.

So in the example above, it first evaluates the left-hand side (i.e. JOINs the CreditCard and ContactCreditCard tables and filters through only the Vista credit cards that expire in June2008), and then for each row, it will apply the right-hand side (i.e. call the function,passing it the ContactID) to produce a result (the function returns a 1-row table consisting of FirstName, LastName, JobTitle, and ContactType).

That’s kind of cool, and it was something that you were not able to do prior to SQL2005, when the APPLY operator was introduced.

But the right-hand side doesn’t have to be a function.

Let’s take a look at all the Stores in AdventureWorks whose Main Office is in Wisconsin (we have to trudge through a lot of tables to get that information):

select c.CustomerID
      ,s.Name
from Sales.Customer cjoin Sales.Store s on c.CustomerID=s.CustomerIDjoin Sales.CustomerAddress ca on c.CustomerID=ca.CustomerIDjoin Person.Address a on ca.AddressID=a.AddressIDjoin Person.StateProvince sp on a.StateProvinceID=sp.StateProvinceIDwhere ca.AddressTypeID=3 --MainOffice  and sp.StateProvinceCode='WI'/*
CustomerID Name
----------- ---------------------------
        418 Good Bike Shop
        453 Unique Bikes
        543 Friendly Neighborhood Bikes
        606 Little Bicycle Supply Shop
*/
And for each of those customers, let’s find out the top 3 products that they bought in terms of dollars (and while we’re at it, let’s show the dollars they spent on the product):

select c.CustomerID
      ,s.Name      ,f1.ProductID
      ,f1.PurchaseAmtfrom Sales.Customer cjoin Sales.Store s on c.CustomerID=s.CustomerIDjoin Sales.CustomerAddress ca on c.CustomerID=ca.CustomerIDjoin Person.Address a on ca.AddressID=a.AddressIDjoin Person.StateProvince sp on a.StateProvinceID=sp.StateProvinceIDcross apply (select top (3) 
                    ProductID
                   ,PurchaseAmt=sum(LineTotal)             from Sales.SalesOrderHeader soh
             join Sales.SalesOrderDetail sod
               on soh.SalesOrderID=sod.SalesOrderID
             where CustomerID=c.CustomerID
             group by ProductID
             order by sum(LineTotal) desc) f1where ca.AddressTypeID=3 --MainOffice  and sp.StateProvinceCode='WI'/*
 CustomerID Name                        ProductID PurchaseAmt
----------- --------------------------- --------- -----------
        418 Good Bike Shop                    795  30367.3500
        418 Good Bike Shop                    794  24136.8075
        418 Good Bike Shop                    792  23508.5175
        453 Unique Bikes                      773  16319.9520
        453 Unique Bikes                      771  12239.9640
        453 Unique Bikes                      772  12239.9640
        543 Friendly Neighborhood Bikes       782   9638.9580
        543 Friendly Neighborhood Bikes       868    671.9040
        543 Friendly Neighborhood Bikes       869    335.9520
        606 Little Bicycle Supply Shop        717   1717.8000
        606 Little Bicycle Supply Shop        838    858.9000
        606 Little Bicycle Supply Shop        738    809.3280
*/
Notice that the right-hand side of the CROSS APPLY is a correlated query where we refer to c.CustomerID, a column from the left-hand side. So each of the 4 customers from the left-hand side were applied to our correlated query on the right-hand side to produce 3 rows each.

But let’s not stop there… We can keep chaining additional information. For each of those products we found, let’s find out the top customer for that product (i.e. who placed the most orders on the product?):

select c.CustomerID
      ,s.Name      ,f1.ProductID
      ,f1.PurchaseAmt
      ,f2.TopCustomerID
      ,f2.NumOrdsfrom Sales.Customer cjoin Sales.Store s on c.CustomerID=s.CustomerIDjoin Sales.CustomerAddress ca on c.CustomerID=ca.CustomerIDjoin Person.Address a on ca.AddressID=a.AddressIDjoin Person.StateProvince sp on a.StateProvinceID=sp.StateProvinceIDcross apply (select top (3) 
                    ProductID
                   ,PurchaseAmt=sum(LineTotal)             from Sales.SalesOrderHeader soh
             join Sales.SalesOrderDetail sod
               on soh.SalesOrderID=sod.SalesOrderID
             where CustomerID=c.CustomerID
             group by ProductID
             order by sum(LineTotal) desc) f1cross apply (select top (1)                    TopCustomerID=CustomerID
                   ,NumOrds=count(*)             from Sales.SalesOrderHeader soh
             join Sales.SalesOrderDetail sod
               on soh.SalesOrderID=sod.SalesOrderID
             where ProductID=f1.ProductID
             group by CustomerID
             order by count(*) desc) f2where ca.AddressTypeID=3 --MainOffice  and sp.StateProvinceCode='WI'/*
 CustomerID Name                        ProductID PurchaseAmt TopCustomerID NumOrds
----------- --------------------------- --------- ----------- ------------- -------
        418 Good Bike Shop                    795  30367.3500             4       8
        418 Good Bike Shop                    794  24136.8075             4       8
        418 Good Bike Shop                    792  23508.5175            16       8
        453 Unique Bikes                      773  16319.9520            75       4
        453 Unique Bikes                      771  12239.9640            20       4
        453 Unique Bikes                      772  12239.9640           118       4
        543 Friendly Neighborhood Bikes       782   9638.9580            10       8
        543 Friendly Neighborhood Bikes       868    671.9040            23       4
        543 Friendly Neighborhood Bikes       869    335.9520            10       4
        606 Little Bicycle Supply Shop        717   1717.8000           166       8
        606 Little Bicycle Supply Shop        838    858.9000            24       6
        606 Little Bicycle Supply Shop        738    809.3280           166      12
*/
This second CROSS APPLY (f2) took the f1.ProductID that resulted from the first CROSS APPLY (f1), and it applied it to the correlated query on its right-hand side to produce a single row of information, namely the TopCustomerID and NumOrds columns.

Note that if we wanted to add product descriptions to this query, which we would get from the Production.Product table, we must perform the JOIN to that table AFTER the CROSS APPLY, because everything with APPLY is processed left-to-right.

select c.CustomerID
      ,s.Name      ,f1.ProductID
      ,ProductName=p.Name      ,f1.PurchaseAmt
      ,f2.TopCustomerID
      ,f2.NumOrdsfrom Sales.Customer cjoin Sales.Store s on c.CustomerID=s.CustomerIDjoin Sales.CustomerAddress ca on c.CustomerID=ca.CustomerIDjoin Person.Address a on ca.AddressID=a.AddressIDjoin Person.StateProvince sp on a.StateProvinceID=sp.StateProvinceIDcross apply (select top (3) 
                    ProductID
                   ,PurchaseAmt=sum(LineTotal)             from Sales.SalesOrderHeader soh
             join Sales.SalesOrderDetail sod
               on soh.SalesOrderID=sod.SalesOrderID
             where CustomerID=c.CustomerID
             group by ProductID
             order by sum(LineTotal) desc) f1cross apply (select top (1)                    TopCustomerID=CustomerID
                   ,NumOrds=count(*)             from Sales.SalesOrderHeader soh
             join Sales.SalesOrderDetail sod
               on soh.SalesOrderID=sod.SalesOrderID
             where ProductID=f1.ProductID
             group by CustomerID
             order by count(*) desc) f2join Production.Product p on f1.ProductID=p.ProductIDwhere ca.AddressTypeID=3 --MainOffice  and sp.StateProvinceCode='WI'/*
 CustomerID Name                        ProductID ProductName                PurchaseAmt TopCustomerID NumOrds
----------- --------------------------- --------- -------------------------- ----------- ------------- -------
        606 Little Bicycle Supply Shop        717 HL Road Frame - Red, 62      1717.8000           166       8
        606 Little Bicycle Supply Shop        738 LL Road Frame - Black, 52     809.3280           166      12
        453 Unique Bikes                      771 Mountain-100 Silver, 38     12239.9640            20       4
        453 Unique Bikes                      772 Mountain-100 Silver, 42     12239.9640           118       4
        453 Unique Bikes                      773 Mountain-100 Silver, 44     16319.9520            75       4
        543 Friendly Neighborhood Bikes       782 Mountain-200 Black, 38       9638.9580            10       8
        418 Good Bike Shop                    792 Road-250 Red, 58            23508.5175            16       8
        418 Good Bike Shop                    794 Road-250 Black, 48          24136.8075             4       8
        418 Good Bike Shop                    795 Road-250 Black, 52          30367.3500             4       8
        606 Little Bicycle Supply Shop        838 HL Road Frame - Black, 44     858.9000            24       6
        543 Friendly Neighborhood Bikes       868 Women's Mountain Shorts, M    671.9040            23       4
        543 Friendly Neighborhood Bikes       869 Women's Mountain Shorts, L    335.9520            10       4
*/
If we had tried to place that JOIN with all the other JOINs (i.e. BEFORE the first CROSS APPLY), we would get an error: The multi-part identifier “f1.ProductID” could not be bound.

This is all interesting, but the real magic of CROSS APPLY is yet to come in Part 2.

Thursday, 24 November 2011

It's the Natural Order of Things... NOT!

In moving from the Visual FoxPro (VFP) world to the SQL Server world, the biggest thing that I had to un-learn was the concept of a “natural order” of a table. In VFP, when you add rows to a table, they ALWAYS get appended to the end of the table. No exceptions. So when you do a SELECT FROM the table, the rows will ALWAYS appear in the same order no matter what. You can create as many indexes as you like on the table, and it still won’t make any difference, because VFP will ALWAYS use the “natural order” un-indexed rows when you do a SELECT FROM the table.

All of that gets thrown out the window when it comes to SQL Server. I see many posts on the MSDN T-SQL Forum from people who present some sample data and want some kind of solution that assumes a certain order in the table. For example, here’s a typical example of a post I’ve seen many times:

My data looks like this:
Code  Quantity
----  --------
A           23
A           15
A           11
B           14
B           18
A           43
C           12
And I want to sum up the quantities by Code and get this result:

Code  TotQuantity
----  -----------
A              49
B              32
A              43
C              12
The inevitable and immediate question from others on the forum trying to help is: “You can’t do what you want unless you order on some other column. Is there a datetime column or something?” And many times, the answer is no, there is no order. This is just what they see when they do a SELECT * and they assume that this is a “natural” order of the rows as they were chronologically INSERTed into the table.

Wrong-o.

I can certainly understand that this is a natural assumption to make. Coming from the VFP world, it’s certainly the assumption I would have made… until I learned more about how SQL Server works.

Let’s see just how wrong this assumption is.

Create a table and insert 10 rows into it and let’s take a look at what we get when we do a SELECT * from it:

create table #test (  col1 varchar(900) not null ,col2 char(1) not null
)
  insert #test values ('B'+replicate('.',899),'N')insert #test values ('R'+replicate('.',899),'A')insert #test values ('A'+replicate('.',899),'T')insert #test values ('D'+replicate('.',899),'U')insert #test values ('S'+replicate('.',899),'R')insert #test values ('C'+replicate('.',899),'A')insert #test values ('H'+replicate('.',899),'L')insert #test values ('U'+replicate('.',899),'O')insert #test values ('L'+replicate('.',899),'R')insert #test values ('Z'+replicate('.',899),'D')
 
select * from #test/*
col1      col2
--------- ----
B........ N
R........ A
A........ T
D........ U
S........ R
C........ A
H........ L
U........ O
L........ R
Z........ D
*/
Great. That makes sense. It’s the “natural” chronologically-entered order of the rows. You can see my name (Brad Schulz) spelled down vertically in the first column and “Natural Ord” spelled down vertically in the second column. That’s the order we put them in, and so it’s not surprising that’s the way they come out.

Now let’s insert two more rows and do another SELECT *:

insert #test values ('X','X')insert #test values ('Y','Y')select * from #test/*
col1      col2
--------- ----
B........ N
R........ A
A........ T
D........ U
S........ R
C........ A
H........ L
U........ O
X         X
Y         Y
L........ R
Z........ D
*/
Huh? Why didn’t it append them at the end? Well, because SQL Server just doesn’t do an append. Table data is stored as a collection of 8-KB (8192-byte) pages. If you subtract the space for page header information and other overhead, that leaves 8060 bytes for the data on each page. And here’s the key: a row of data can not span multiple pages.

The first 10 rows I created were 901 bytes each. The first 8 rows took up 8 * 901 = 7208 bytes, leaving 8060 - 7208 = 852 bytes left in the page. The ninth row of 901 bytes was too big to fit in that page, so it was placed on a whole new page, along with the tenth row (since it, too, was too big to fit on the first page).

But when I added an 11th and 12th row, I purposely made those rows only 2 bytes each. And SQL Server plopped them on the first data page of the table because there was easily enough room for them. So now when we do a SELECT *, we are getting the first page’s worth of data followed by the second page.

You can read more at Books Online about SQL’s data storage in pages.

Now let's stir things up and create a clustered index on col1:

create clustered index ix_test1 on #test (col1)select * from #test/*
col1      col2
--------- ----
A........ T
B........ N
C........ A
D........ U
H........ L
L........ R
R........ A
S........ R
U........ O
X         X
Y         Y
Z........ D
*/
The table is no longer a heap of unordered rows… the table is now ordered on the column we called col1. So a SELECT is now going to retrieve data from the table in that order. If you look at the query plan, you can see that it’s doing a Clustered Index Scan.

But wait, that’s not all…

Now create a nonclustered index on col2:

create nonclustered index ix_test2 on #test (col2)select * from #test/*
col1      col2
--------- ----
C........ A
R........ A
Z........ D
H........ L
B........ N
U........ O
L........ R
S........ R
A........ T
D........ U
X         X
Y         Y
*/
Hmmm… Why does it now come out in col2 order? After all, we did a SELECT on ALL columns of the table, and all the data is stored in the clustered index. So why does the query plan show that an Index Scan was done on the nonclustered index instead of doing a Clustered Index Scan?

Our nonclustered index on col2 (and any nonclustered index we create) automatically includes col1 in the index as well, because it needs that to do a lookup into the clustered index in case it needs to acquire any further columns. So our nonclustered index has both col1 and col2 in it… all the columns of our table are in it!

And why does SQL prefer the NonClustered Index Scan over the Clustered Index Scan, even though they both have exactly the same amount of data? Well, in normal real-life situations, a nonclustered index will have less data in its pages than a clustered one (since the clustered index contains ALL columns). The size of a nonclustered index will be either equal to or less than the size of a clustered index… it will never be bigger… so for that reason, SQL builds an execution plan that will always favor a covering nonclustered index.

Now let’s remove the two indexes… no ordering at all. What happens?

drop index #test.ix_test1drop index #test.ix_test2select * from #test/*
col1      col2
--------- ----
A........ T
B........ N
C........ A
D........ U
H........ L
L........ R
R........ A
S........ R
U........ O
X         X
Y         Y
Z........ D
*/
Wait a minute… We got rid of the clustered index. Why is the data in col1 order now? I assume that when we removed the clustered index, SQL scanned the clustered index (in col1 order) and re-wrote the data into a heap (a bunch o’ data without a clustered index).

I find it interesting, though, that it didn’t follow the rules of putting a new row into the first page it can find that has room. In other words, it wrote out the first 8 rows (the ones where col1 starts with A, B, C, D, H, L, R, and S), taking up 7208 bytes. Then row number 9’s 901 bytes was too big to fit, so it started a new page. But then row numbers 10 and 11 were our 2-byte rows, and they were apparently APPENDED AFTER row 9 in the second page and WERE NOT put into the first page where there was ample room for them. Perhaps, in order to remove the clustered index as quickly as possible, it’s just faster to spit out the rows onto successive pages rather than trying to do a “first page that fits” algorithm for every single row. (If anyone has any insight into this, please chime in).

If we add a new row, we can see that it follows the “first page that fits” rule, just as we had seen the last time we manually inserted records, putting the row into the first page:

insert #test values ('Z','Z')select * from #test/*
col1      col2
--------- ----
A........ T
B........ N
C........ A
D........ U
H........ L
L........ R
R........ A
S........ R
Z         Z
U........ O
X         X
Y         Y
Z........ D
*/
So, after all this, I think we can throw the “natural order” concept out the window, don't you? There’s just no such thing.