Last week, Peter Myers of Solid Quality Mentors gave a terrific presentation on Data Mining to our San Francisco SQL Server User Group. And I happened to have “mining” on my mind since I had just written my silly blog entry about the Seven Dwarfs, so I guess it was a sign of some kind.
In giving an overview of data mining, Peter brought up one example of it that all of us have seen many times: When you go to Amazon and many other websites and you are looking at an item, they give you suggestions for other products under the heading “Customers who bought this item also bought…”. This is a great marketing tool. I know I’ve clicked on those items many times… in fact, just after I read a good book, I look it up on Amazon to get ideas for others.
So let’s roll up our sleeves and fool around with the database that we’re all really sick of… AdventureWorks.
AdventureWorks sells bicycle equipment (bikes and their components, and related clothing and accessories), and their customers are both stores and individuals. Here’s our scenario. The marketing department of AdventureWorks really wants to push products from the Clothing and Accessories categories. When an individual customer visits the AdventureWorks website and looks at a particular Product, we should make suggestions of clothing and/or accessory items to them based on the product they’re looking at. In other words, customers who bought this item also bought… these wonderful clothing and accessory items! We will show 3 suggestions.
Our ultimate goal is to come up with an in-line table function that will accept a ProductID and a CustomerID and will return 3 suggested ProductID’s.
So let’s approach this one step at a time, building as we go.
I’m just going to pull an individual (i.e. non-store) customer out of of the database at random to play with… and the winner is… Customer 12203, Zachary Hughes. Congrats, Zack! And I’m going to pull a product at random… and that is Product 777, Mountain-100 Black, 44.
So Zack is looking at that Black Mountain bike. Which three clothing and/or accessory items will webeat him over the head with suggest to him?
First of all, let’s find all the other customers (i.e. not Zack himself) who have bought Product 777. That involves JOINing the SalesOrderDetail (where the ProductID lives) to the SalesOrderHeader (where we find the CustomerID):
Now we’re ready to make our recommendations of 3 other items, so we’ll just show the TOP (3) items:
But if Zack is looking at ProductID 777, we don’t want to just blindly recommend 870, 872, and 711. Our buddy Zack may have already bought one of those in the past, and we want to promote something he’s never bought before. So we’ll add a predicate to the WHERE clause so that we don’t recommend any products to the customer that he has bought in the past.
Let’s go ahead and create the in-line function out of what we have… we’re only going to return ProductID’s and not the names:
In my next blog entry, we’ll use this function as a basis to do some furtherbadgering marketing to the AdventureWorks victims customers.
In giving an overview of data mining, Peter brought up one example of it that all of us have seen many times: When you go to Amazon and many other websites and you are looking at an item, they give you suggestions for other products under the heading “Customers who bought this item also bought…”. This is a great marketing tool. I know I’ve clicked on those items many times… in fact, just after I read a good book, I look it up on Amazon to get ideas for others.
By the way, on a side note, there is a data mining anecdote (perhaps an urban legend?) that’s been going around for years that a surprising correlation was discovered in that people who bought diapers also tended to by beer as well. You can read about it in more detail here and here.Anyway, this got me to thinking how I would approach this “Customers Who Bought..” scenario. Even though the Data Mining Tool in Analysis Services can help you with this, I just thought it would be a fun little challenge to solve from scratch using good old-fashioned SQL.
So let’s roll up our sleeves and fool around with the database that we’re all really sick of… AdventureWorks.
AdventureWorks sells bicycle equipment (bikes and their components, and related clothing and accessories), and their customers are both stores and individuals. Here’s our scenario. The marketing department of AdventureWorks really wants to push products from the Clothing and Accessories categories. When an individual customer visits the AdventureWorks website and looks at a particular Product, we should make suggestions of clothing and/or accessory items to them based on the product they’re looking at. In other words, customers who bought this item also bought… these wonderful clothing and accessory items! We will show 3 suggestions.
Our ultimate goal is to come up with an in-line table function that will accept a ProductID and a CustomerID and will return 3 suggested ProductID’s.
So let’s approach this one step at a time, building as we go.
I’m just going to pull an individual (i.e. non-store) customer out of of the database at random to play with… and the winner is… Customer 12203, Zachary Hughes. Congrats, Zack! And I’m going to pull a product at random… and that is Product 777, Mountain-100 Black, 44.
So Zack is looking at that Black Mountain bike. Which three clothing and/or accessory items will we
First of all, let’s find all the other customers (i.e. not Zack himself) who have bought Product 777. That involves JOINing the SalesOrderDetail (where the ProductID lives) to the SalesOrderHeader (where we find the CustomerID):
declare @ProductID int, @CustomerID int set @ProductID=777set @CustomerID=12203select OrigProdID=d1.ProductID ,OrigSalesOrder=h1.SalesOrderID ,OrigCustomer=h1.CustomerIDfrom Sales.SalesOrderDetail d1 join Sales.SalesOrderHeader h1 on d1.SalesOrderID=h1.SalesOrderIDwhere d1.ProductID=@ProductID and h1.CustomerID<>@CustomerID/* OrigProdID OrigSalesOrder OrigCustomer ----------- -------------- ------------ 777 43853 11 777 45056 17 777 45796 17 777 43843 18 777 45266 18 777 46026 18 777 44126 20 777 44792 20 777 45570 20 And so on... (242 rows total) */However, remember these other customers must be individuals as opposed to stores, so we must JOIN in the Customer table in order to filter only those with CustomerType of ‘I’:
declare @ProductID int, @CustomerID int set @ProductID=777set @CustomerID=12203select OrigProdID=d1.ProductID ,OrigSalesOrder=h1.SalesOrderID ,OrigCustomer=h1.CustomerIDfrom Sales.SalesOrderDetail d1 join Sales.SalesOrderHeader h1 on d1.SalesOrderID=h1.SalesOrderIDjoin Sales.Customer c on h1.CustomerID=c.CustomerIDwhere d1.ProductID=@ProductID and h1.CustomerID<>@CustomerID and c.CustomerType='I'/* OrigProdID OrigSalesOrder OrigCustomer ----------- -------------- ------------ 777 43767 11001 777 43837 11009 777 43757 11017 777 43732 11025 777 44029 11050 777 44058 11056 777 44657 11120 777 45631 11237 777 45765 11245 777 44016 11247 And so on... (60 rows total) */Now that we have all the individuals who buy the product, let’s pull in all the orders that those individuals have ever made. That means JOINing in the SalesOrderHeader table once again, linking it ON CustomerID:
declare @ProductID int, @CustomerID int set @ProductID=777set @CustomerID=12203select OrigProdID=d1.ProductID ,OrigSalesOrder=h1.SalesOrderID ,OrigCustomer=h1.CustomerID ,OtherSalesOrder=h2.SalesOrderID from Sales.SalesOrderDetail d1 join Sales.SalesOrderHeader h1 on d1.SalesOrderID=h1.SalesOrderIDjoin Sales.Customer c on h1.CustomerID=c.CustomerIDjoin Sales.SalesOrderHeader h2 on h1.CustomerID=h2.CustomerID where d1.ProductID=@ProductID and h1.CustomerID<>@CustomerID and c.CustomerType='I'/* OrigProdID OrigSalesOrder OrigCustomer OtherSalesOrder ----------- -------------- ------------ --------------- 777 43767 11001 43767 777 43767 11001 51493 777 43767 11001 72773 777 43837 11009 43837 777 43837 11009 51562 777 43837 11009 57736 777 43757 11017 43757 777 43757 11017 51256 777 43757 11017 68396 And so on... (148 rows total) */And now that we have all those orders, we need to find out what products were in those orders. So JOIN the SalesOrderDetail table again:
declare @ProductID int, @CustomerID int set @ProductID=777set @CustomerID=12203select OrigProdID=d1.ProductID ,OrigSalesOrder=h1.SalesOrderID ,OrigCustomer=h1.CustomerID ,OtherSalesOrder=h2.SalesOrderID ,SuggProdID=d2.ProductIDfrom Sales.SalesOrderDetail d1 join Sales.SalesOrderHeader h1 on d1.SalesOrderID=h1.SalesOrderIDjoin Sales.Customer c on h1.CustomerID=c.CustomerIDjoin Sales.SalesOrderHeader h2 on h1.CustomerID=h2.CustomerID join Sales.SalesOrderDetail d2 on h2.SalesOrderID=d2.SalesOrderIDwhere d1.ProductID=@ProductID and h1.CustomerID<>@CustomerID and c.CustomerType='I'/* OrigProdID OrigSalesOrder OrigCustomer OtherSalesOrder SuggProdID ----------- -------------- ------------ --------------- ----------- 777 43767 11001 43767 777 777 43767 11001 51493 779 777 43767 11001 51493 878 777 43767 11001 51493 870 777 43767 11001 51493 871 777 43767 11001 51493 884 777 43767 11001 51493 712 777 43767 11001 72773 997 777 43767 11001 72773 870 777 43767 11001 72773 872 777 43767 11001 72773 708 And so on... (311 rows total) */But remember, we are only interested in those other products being part of the Clothing or Accessories categories. Products in the database are associated with a SubCategory (like Helmets, Locks, Caps, Gloves, etc) and those are, in turn, associated with a main Category (Bikes, Components, Clothing, Accessories), so we must JOIN in the Product and ProductSubCategory tables in order to filter the ProductCategoryID of 3 (Clothing) or 4 (Accessories). Since we’re pulling in the Product table, let’s spit out the name of the product also:
declare @ProductID int, @CustomerID int set @ProductID=777set @CustomerID=12203select OrigProdID=d1.ProductID ,OrigSalesOrder=h1.SalesOrderID ,OrigCustomer=h1.CustomerID ,OtherSalesOrder=h2.SalesOrderID ,SuggProdID=d2.ProductID ,SuggProdName=p.Name from Sales.SalesOrderDetail d1 join Sales.SalesOrderHeader h1 on d1.SalesOrderID=h1.SalesOrderIDjoin Sales.Customer c on h1.CustomerID=c.CustomerIDjoin Sales.SalesOrderHeader h2 on h1.CustomerID=h2.CustomerID join Sales.SalesOrderDetail d2 on h2.SalesOrderID=d2.SalesOrderIDjoin Production.Product p on d2.ProductID=p.ProductIDjoin Production.ProductSubCategory sc on p.ProductSubCategoryID=sc.ProductSubCategoryIDwhere d1.ProductID=@ProductID and h1.CustomerID<>@CustomerID and c.CustomerType='I' and sc.ProductCategoryID in (3,4)/* OrigProdID OrigSalesOrder OrigCustomer OtherSalesOrder SuggProdID SuggProdName ----------- -------------- ------------ --------------- ----------- ------------------------------- 777 43767 11001 51493 878 Fender Set - Mountain 777 43767 11001 51493 870 Water Bottle - 30 oz. 777 43767 11001 51493 871 Mountain Bottle Cage 777 43767 11001 51493 884 Short-Sleeve Classic Jersey, XL 777 43767 11001 51493 712 AWC Logo Cap 777 43767 11001 72773 870 Water Bottle - 30 oz. 777 43767 11001 72773 872 Road Bottle Cage 777 43767 11001 72773 708 Sport-100 Helmet, Black And so on... (163 rows total) */Great, that looks good. Now all we need to do is find out which products were ordered the most. That means we will GROUP BY the SuggProdID and sort the list in descending order of the number of orders in which it appeared (which is just the COUNT(DISTINCT SalesOrderID)). Note that I’m also outputting the Product Name, just so we can see it and satisfy our curiousity. In our final in-line function we create, we will just return a list of ProductID’s only.
declare @ProductID int, @CustomerID int set @ProductID=777set @CustomerID=12203select SuggProdID=d2.ProductID ,SuggProdName=p.Name ,NumOrds=count(distinct h2.SalesOrderID)from Sales.SalesOrderDetail d1 join Sales.SalesOrderHeader h1 on d1.SalesOrderID=h1.SalesOrderIDjoin Sales.Customer c on h1.CustomerID=c.CustomerIDjoin Sales.SalesOrderHeader h2 on h1.CustomerID=h2.CustomerID join Sales.SalesOrderDetail d2 on h2.SalesOrderID=d2.SalesOrderIDjoin Production.Product p on d2.ProductID=p.ProductIDjoin Production.ProductSubCategory sc on p.ProductSubCategoryID=sc.ProductSubCategoryIDwhere d1.ProductID=@ProductID and h1.CustomerID<>@CustomerID and c.CustomerType='I' and sc.ProductCategoryID in (3,4)group by d2.ProductID ,p.Name order by count(distinct h2.SalesOrderID) desc/* SuggProdID SuggProdName NumOrds ----------- --------------------------- ----------- 870 Water Bottle - 30 oz. 23 872 Road Bottle Cage 15 711 Sport-100 Helmet, Blue 13 712 AWC Logo Cap 10 871 Mountain Bottle Cage 10 934 Touring Tire 10 878 Fender Set - Mountain 9 873 Patch Kit/8 Patches 8 708 Sport-100 Helmet, Black 8 923 Touring Tire Tube 8 930 HL Mountain Tire 6 877 Bike Wash - Dissolver 5 707 Sport-100 Helmet, Red 5 880 Hydration Pack - 70 oz. 5 921 Mountain Tire Tube 4 882 Short-Sleeve Classic Jersey 4 883 Short-Sleeve Classic Jersey 3 884 Short-Sleeve Classic Jersey 3 860 Half-Finger Gloves, L 3 881 Short-Sleeve Classic Jersey 2 874 Racing Socks, M 1 865 Classic Vest, M 1 866 Classic Vest, L 1 869 Women's Mountain Shorts, L 1 715 Long-Sleeve Logo Jersey, L 1 716 Long-Sleeve Logo Jersey, XL 1 858 Half-Finger Gloves, S 1 859 Half-Finger Gloves, M 1 929 ML Mountain Tire 1 */That’s great, except for one thing. It’s only meaningful to suggest another product if it’s been on at least a few orders. For example, it may not be helpful to suggest an item that’s been ordered with Product 777 only 1 time in the past. So we will introduce a HAVING COUNT(DISTINCT SalesOrderID)>5 to make sure we only pay attention to those items that have been on more than 5 orders.
Now we’re ready to make our recommendations of 3 other items, so we’ll just show the TOP (3) items:
declare @ProductID int, @CustomerID int set @ProductID=777set @CustomerID=12203select top (3) SuggProdID=d2.ProductID ,SuggProdName=p.Name from Sales.SalesOrderDetail d1 join Sales.SalesOrderHeader h1 on d1.SalesOrderID=h1.SalesOrderIDjoin Sales.Customer c on h1.CustomerID=c.CustomerIDjoin Sales.SalesOrderHeader h2 on h1.CustomerID=h2.CustomerID join Sales.SalesOrderDetail d2 on h2.SalesOrderID=d2.SalesOrderIDjoin Production.Product p on d2.ProductID=p.ProductIDjoin Production.ProductSubCategory sc on p.ProductSubCategoryID=sc.ProductSubCategoryIDwhere d1.ProductID=@ProductID and h1.CustomerID<>@CustomerID and c.CustomerType='I' and sc.ProductCategoryID in (3,4)group by d2.ProductID ,p.Name having count(distinct h2.SalesOrderID)>5order by count(distinct h2.SalesOrderID) desc/* SuggProdID SuggProdName ----------- ---------------------- 870 Water Bottle - 30 oz. 872 Road Bottle Cage 711 Sport-100 Helmet, Blue */Looks great. Let’s say Zack is looking at ProductID 711 instead of 777. What comes out then?
/*
SuggProdID SuggProdName
----------- ----------------------
711 Sport-100 Helmet, Blue
921 Mountain Tire Tube
870 Water Bottle - 30 oz.
*/Well that’s strange. People who bought ProductID 711 tended most to buy ProductID 711 again? Oh, wait a minute. ProductID 711 is a Helmet, and therefore already part of the Accessory category. We’ll have to introduce a predicate to the WHERE clause to eliminate products from the suggestion list that match the product we’re initially inquiring on. In fact, maybe we should widen that restriction. For example, if a person is looking at a Blue Helmet, we don’t want to suggest a Red one. So let’s just make sure the SubCategoryID of the suggested products is different from the original product's SubCategoryID. So we’ll add that to the WHERE clause:declare @ProductID int, @CustomerID int set @ProductID=711set @CustomerID=12203select top (3) SuggProdID=d2.ProductID ,SuggProdName=p.Name from Sales.SalesOrderDetail d1 join Sales.SalesOrderHeader h1 on d1.SalesOrderID=h1.SalesOrderIDjoin Sales.Customer c on h1.CustomerID=c.CustomerIDjoin Sales.SalesOrderHeader h2 on h1.CustomerID=h2.CustomerID join Sales.SalesOrderDetail d2 on h2.SalesOrderID=d2.SalesOrderIDjoin Production.Product p on d2.ProductID=p.ProductIDjoin Production.ProductSubCategory sc on p.ProductSubCategoryID=sc.ProductSubCategoryIDwhere d1.ProductID=@ProductID and h1.CustomerID<>@CustomerID and c.CustomerType='I' and p.ProductSubCategoryID<>(select ProductSubCategoryID from Production.Product where ProductID=@ProductID) and sc.ProductCategoryID in (3,4) group by d2.ProductID ,p.Name having count(distinct h2.SalesOrderID)>5order by count(distinct h2.SalesOrderID) desc/* SuggProdID SuggProdName ----------- --------------------- 921 Mountain Tire Tube 870 Water Bottle - 30 oz. 922 Road Tire Tube */There, that’s better. I think we’ve got something here. So if Zack is looking at ProductID 711, we can suggest ProductID’s 921, 870, and 922. And if he's is looking at ProductID 777, we can suggest 870, 872, and 711.
But if Zack is looking at ProductID 777, we don’t want to just blindly recommend 870, 872, and 711. Our buddy Zack may have already bought one of those in the past, and we want to promote something he’s never bought before. So we’ll add a predicate to the WHERE clause so that we don’t recommend any products to the customer that he has bought in the past.
Let’s go ahead and create the in-line function out of what we have… we’re only going to return ProductID’s and not the names:
if object_id('ufn_GetSuggProducts') is not null drop function dbo.ufn_GetSuggProducts gocreate function dbo.ufn_GetSuggProducts( @ProductID int ,@CustomerID int)returns table as return select top (3) SuggProdID=d2.ProductIDfrom Sales.SalesOrderDetail d1 join Sales.SalesOrderHeader h1 on d1.SalesOrderID=h1.SalesOrderIDjoin Sales.Customer c on h1.CustomerID=c.CustomerIDjoin Sales.SalesOrderHeader h2 on h1.CustomerID=h2.CustomerID join Sales.SalesOrderDetail d2 on h2.SalesOrderID=d2.SalesOrderIDjoin Production.Product p on d2.ProductID=p.ProductIDjoin Production.ProductSubCategory sc on p.ProductSubCategoryID=sc.ProductSubCategoryIDwhere d1.ProductID=@ProductID and h1.CustomerID<>@CustomerID and c.CustomerType='I' and p.ProductSubCategoryID<>(select ProductSubCategoryID from Production.Product where ProductID=@ProductID) and sc.ProductCategoryID in (3,4) and d2.ProductID not in (select ProductID from Sales.SalesOrderDetail sod join Sales.SalesOrderHeader soh on sod.SalesOrderID=soh.SalesOrderID where CustomerID=@CustomerID)group by d2.ProductIDhaving count(distinct h2.SalesOrderID)>5order by count(distinct h2.SalesOrderID) descAnd let’s give it a try:
select * from dbo.ufn_GetSuggProducts(777,12203) /* SuggProdID ----------- 872 934 878 */Finally!
In my next blog entry, we’ll use this function as a basis to do some further
No comments:
Post a Comment