Wednesday, August 19, 2009

Creating XML Sitemap from MS SQL

As anyone that has tried to create a sitemap for a decent sized site knows, no tool you can pull off the shelf really works. When you are dealing with hundreds of thousands of URLs on your site, tools that follow your site and try and create an XML sitemap just don’t work very well.
If you want to provide a sitemap to the search engines that have just the URL’s you want and none of the junk or various permutations of a URL a BOT is bound to find, you have just one choice and that is to programmatically develop a file with the output you desire.
I could find no utility on the market that would allow you to feed it a T-SQL query and have it output an XML sitemap. Maybe some day I’ll write it. For now I can accomplish my needs with the stored procedure below.
So what we have here is a stored procedure that does the following:
  • Creates seven sitemaps based on seven different queries. Each sitemap has a specific type of page we want the search engines to follow.
  • Outputs the sitemap to an XML file encoded in UTF-16 (only format MS SQL supports unfortunately).
  • Converts the files from UTF-16 to UTF-8
  • Finally, deletes and names the files
I use the various sitemaps here each for a different purpose. I have one for each domain I manage of the following: product URL’s, category URL’s, and category URL’s restricted to brands.
I used the SQL XML RAW capability to output my query in an XML format. This works REALLY well.
The most time consuming part was trying to cheat my way into getting a UTF-8 formatted file since MS SQL just can’t do anything but UTF-16. There are easier ways to do this in Linux, in Windows I had to find a tool I could use from a command line. I ended up finding a great tool for converting various types of documents from one encoding to another. It’s called Unifier by Melody-Soft.
At this point all that was left was renaming the files to be in a standard format. I do this with xp_cmdshell as you can see in my SP below.
   1: USE [CART]
   2: GO
   3: /****** Object:  StoredProcedure [dbo].[Sp_CreateSitemaps2]    Script Date: 08/19/2009 12:16:34 ******/
   4: SET ANSI_NULLS ON
   5: GO
   6: SET QUOTED_IDENTIFIER ON
   7: GO
   8: -- =============================================
   9: -- Author:        Sean McWherter
  10: -- Create date: 8-10-09
  11: -- Description:    Generates all sitemaps
  12: -- =============================================
  13: ALTER PROCEDURE [dbo].[Sp_CreateSitemaps] 
  14: AS
  15: BEGIN
  16:     -- SET NOCOUNT ON added to prevent extra result sets from
  17:     -- interfering with SELECT statements.
  18:     SET NOCOUNT ON;
  19: --PS PRODUCTS
  20: DECLARE @x_psproduct xml
  21: DECLARE @strXML_psproduct varchar(MAX)
  22: SET @x_psproduct=(
  23:  
  24: select  --top 200
  25: 'http://www.petstore.com/' + replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(case when WebParent is not null then ParentDesc else description end, ' - ', '_'),'-','_') + 
  26: '_' + replace(LongName, '-', '_') +
  27: '-' + replace(ManufacturerName, '-', '_') +
  28: '-' + idProduct +
  29: '-' + a.GPCat4,
  30: '®',''),'.',''),'/','_'),',',''),'& ',''),' ','_'),'&','_'),'™',''),':',''),'>',''),'<',''),'%','_') +
  31: '-vi.html' as loc,
  32: 'daily' as changefreq
  33: --, idproduct, parentdesc, longname, manufacturername
  34: from MDCART_products a inner join MDCART_CategoriesGP b on a.GpCat4 = b.GPCat1 inner join MDCART_Manufacturers c on a.ManufacturerCode = c.ManufacturerCode
  35: where ((a.idProduct = WebParent and ParentDesc is not null)  or WebParent is null) and LongName is not null and description is not null and ManufacturerName is not null and a.GpCat4 is not null and Visable = 1
  36: and LEFT(gpcat4,2) in (select distinct LEFT(gpcat4,2) cats  from MDCART_products where active <> 0 and gpcat4 is not null)
  37: order by a.idProduct desc
  38: for xml raw('url'), ELEMENTS)
  39:  
  40: SET @strXML_psproduct= '<?xml version="1.0" encoding="UTF-8"?>
  41: <urlset      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'
  42: + CONVERT(varchar(MAX),@x_psproduct) + '</urlset>'
  43:  
  44: --PS CATEGORY FEED
  45: DECLARE @x_pscategory xml
  46: DECLARE @strXML_pscategory varchar(MAX)
  47: SET @x_pscategory=(
  48:  
  49: select  --top 250 
  50: 'http://www.petstore.com/' + replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(LongName, ' - ', '_'),'-','_') + 
  51: '-' + GPCat1,
  52: '®',''),'.',''),'/','_'),',',''),'& ',''),' ','_'),'&','_'),'™',''),':',''),'>',''),'<',''),'%','_') +
  53: '-ct.html' as loc,
  54: 'hourly' as changefreq
  55: from  MDCART_CategoriesGP 
  56: where  Visable = 1 and LongName is not null and GPCat1 is not null
  57: and LEFT(GPCat1,2) in (select distinct LEFT(gpcat4,2) cats  from MDCART_products where active <> 0 and gpcat4 is not null)
  58: order by gpcat1 
  59: for xml raw('url'), ELEMENTS)
  60:  
  61: SET @strXML_pscategory= '<?xml version="1.0" encoding="UTF-8"?>
  62: <urlset      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'
  63: + CONVERT(varchar(MAX),@x_pscategory) + '</urlset>'
  64:  
  65: --PS CATEGORY W/ BRAND FEED
  66: DECLARE @x_pscategory_brand xml
  67: DECLARE @strXML_pscategory_brand varchar(MAX)
  68: SET @x_pscategory_brand=(
  69:  
  70: select  --top 250 
  71: 'http://www.petstore.com/' + replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(ManufacturerName, ' - ', '_'),'-','_') + 
  72: '-' + LongName +
  73: '-' + ManufacturerCode +
  74: '-' + GPCat1,
  75: '®',''),'.',''),'/','_'),',',''),'& ',''),' ','_'),'&','_'),'™',''),':',''),'>',''),'<',''),'%','_') +
  76: '-ct.html' as loc,
  77: 'hourly' as changefreq
  78: from ( SELECT DISTINCT b.ManufacturerName, c.LongName, b.ManufacturerCode, c.GPCat1
  79:         FROM  MDCART_products AS a INNER JOIN MDCART_Manufacturers AS b ON a.ManufacturerCode = b.ManufacturerCode INNER JOIN 
  80:                 MDCART_CategoriesGP AS c ON a.GpCat4 = c.GPCat1 
  81:         where ManufacturerName is not null and LongName is not null and description is not null and b.ManufacturerCode is not null and c.GPCat1 is not null and Visable =1 
  82:         and LEFT(c.GPCat1,2) in (select distinct LEFT(gpcat4,2) cats  from MDCART_products where active <> 0 and gpcat4 is not null)) as a1
  83: order by ManufacturerName, LongName 
  84: for xml raw('url'), ELEMENTS)
  85:  
  86: SET @strXML_pscategory_brand= '<?xml version="1.0" encoding="UTF-8"?>
  87: <urlset      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'
  88: + CONVERT(varchar(MAX),@x_pscategory_brand) + '</urlset>'
  89:  
  90: --MD PRODUCTS
  91: DECLARE @x_mdproduct xml
  92: DECLARE @strXML_mdproduct varchar(MAX)
  93: SET @x_mdproduct=(
  94:  
  95: select  --top 200
  96: 'http://www.marinedepot.com/' + replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(case when WebParent is not null then ParentDesc else description end, ' - ', '_'),'-','_') + 
  97: '_' + replace(LongName, '-', '_') +
  98: '-' + replace(ManufacturerName, '-', '_') +
  99: '-' + idProduct +
 100: '-' + a.GPCat1,
 101: '®',''),'.',''),'/','_'),',',''),'& ',''),' ','_'),'&','_'),'™',''),':',''),'>',''),'<',''),'%','_') +
 102: '-vi.html' as loc,
 103: 'daily' as changefreq
 104: --, idproduct, parentdesc, longname, manufacturername
 105: from MDCART_products a inner join MDCART_CategoriesGP b on a.GPCat1 = b.GPCat1 inner join MDCART_Manufacturers c on a.ManufacturerCode = c.ManufacturerCode
 106: where ((a.idProduct = WebParent and ParentDesc is not null)  or WebParent is null) and LongName is not null and description is not null and ManufacturerName is not null and a.GPCat1 is not null and Visable = 1
 107: and LEFT(a.GPCat1,2) in (select distinct LEFT(gpcat1,2) cats  from MDCART_products where active = -1 and gpcat1 is not null)
 108: order by a.idProduct desc
 109: for xml raw('url'), ELEMENTS)
 110:  
 111: SET @strXML_mdproduct= '<?xml version="1.0" encoding="UTF-8"?>
 112: <urlset      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'
 113: + CONVERT(varchar(MAX),@x_mdproduct) + '</urlset>'
 114:  
 115: --MD CATEGORY FEED
 116: DECLARE @x_mdcategory xml
 117: DECLARE @strXML_mdcategory varchar(MAX)
 118: SET @x_mdcategory=(
 119:  
 120: select  --top 250 
 121: 'http://www.marinedepot.com/' + replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(LongName, ' - ', '_'),'-','_') + 
 122: '-' + GPCat1,
 123: '®',''),'.',''),'/','_'),',',''),'& ',''),' ','_'),'&','_'),'™',''),'>',''),'<',''),'%','_') +
 124: '-ct.html' as loc,
 125: 'hourly' as changefreq
 126: from  MDCART_CategoriesGP 
 127: where  Visable = 1 and LongName is not null  and GPCat1 is not null
 128: and LEFT(GPCat1,2) in (select distinct LEFT(gpcat1,2) cats  from MDCART_products where active = -1 and gpcat1 is not null)
 129: order by gpcat1 
 130: for xml raw('url'), ELEMENTS)
 131:  
 132: SET @strXML_mdcategory= '<?xml version="1.0" encoding="UTF-8"?>
 133: <urlset      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'
 134: + CONVERT(varchar(MAX),@x_mdcategory) + '</urlset>'
 135:  
 136: --MD CATEGORY W/ BRAND FEED
 137: DECLARE @x_mdcategory_brand xml
 138: DECLARE @strXML_mdcategory_brand varchar(MAX)
 139: SET @x_mdcategory_brand=(
 140:  
 141: select  --top 250 
 142: 'http://www.marinedepot.com/' + replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(ManufacturerName, ' - ', '_'),'-','_') + 
 143: '-' + LongName +
 144: '-' + ManufacturerCode +
 145: '-' + GPCat1,
 146: '®',''),'.',''),'/','_'),',',''),'& ',''),' ','_'),'&','_'),'™',''),':',''),'>',''),'<',''),'%','_') +
 147: '-ct.html' as loc,
 148: 'hourly' as changefreq
 149: from ( SELECT DISTINCT b.ManufacturerName, c.LongName, b.ManufacturerCode, c.GPCat1
 150:         FROM  MDCART_products AS a INNER JOIN MDCART_Manufacturers AS b ON a.ManufacturerCode = b.ManufacturerCode INNER JOIN 
 151:                 MDCART_CategoriesGP AS c ON a.gpcat1 = c.GPCat1 
 152:         where ManufacturerName is not null and LongName is not null and description is not null and b.ManufacturerCode is not null and c.GPCat1 is not null and Visable =1 
 153:         and LEFT(c.GPCat1,2) in (select distinct LEFT(gpcat1,2) cats  from MDCART_products where active = -1 and gpcat1 is not null)) as a1
 154: order by ManufacturerName, LongName 
 155: for xml raw('url'), ELEMENTS)
 156:  
 157: SET @strXML_mdcategory_brand= '<?xml version="1.0" encoding="UTF-8"?>
 158: <urlset      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'
 159: + CONVERT(varchar(MAX),@x_mdcategory_brand) + '</urlset>'
 160:  
 161: --MD AQUARIUM PAGE FEED
 162: DECLARE @x_mdaquarium xml
 163: DECLARE @strXML_mdaquarium varchar(MAX)
 164: SET @x_mdaquarium=(
 165:  
 166: select  --top 250 
 167: 'http://www.marinedepot.com/' + replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(idPage, ' - ', '_'),'-','_'),
 168: '®',''),'.',''),'/','_'),',',''),'& ',''),' ','_'),'&','_'),'™',''),':',''),'>',''),'<',''),'%','_') +
 169: '-ap.html' as loc,
 170: 'daily' as changefreq
 171: from MDCART_Aquarium_Page
 172: order by idPage
 173: for xml raw('url'), ELEMENTS)
 174:  
 175: SET @strXML_mdaquarium= '<?xml version="1.0" encoding="UTF-8"?>
 176: <urlset      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'
 177: + CONVERT(varchar(MAX),@x_mdaquarium) + '</urlset>'
 178: ------- OUTPUT
 179: exec xp_cmdshell 'del \\servername\sitemappath\SiteMaps\*.xml', no_output
 180: exec spWriteStringToFile @strXML_psproduct, '\\sangbe\WWW_Live\petstore.com\IMD\SiteMaps', 'utf16-psproduct.xml'
 181: exec spWriteStringToFile @strXML_pscategory, '\\sangbe\WWW_Live\petstore.com\IMD\SiteMaps', 'utf16-pscategory.xml'
 182: exec spWriteStringToFile @strXML_pscategory_brand, '\\sangbe\WWW_Live\petstore.com\IMD\SiteMaps', 'utf16-pscategory_brand.xml'
 183: exec spWriteStringToFile @strXML_mdproduct, '\\sangbe\WWW_Live\petstore.com\IMD\SiteMaps', 'utf16-mdproduct.xml'
 184: exec spWriteStringToFile @strXML_mdcategory, '\\sangbe\WWW_Live\petstore.com\IMD\SiteMaps', 'utf16-mdcategory.xml'
 185: exec spWriteStringToFile @strXML_mdcategory_brand, '\\sangbe\WWW_Live\petstore.com\IMD\SiteMaps', 'utf16-mdcategory_brand.xml'
 186: exec spWriteStringToFile @strXML_mdaquarium, '\\sangbe\WWW_Live\petstore.com\IMD\SiteMaps', 'utf16-mdaquarium.xml'
 187: exec xp_cmdshell 'C:\Unifier\xunifier.exe /c "\\servername\sitemappath\SiteMaps\config1.ucp" /l "\\servername\sitemappath\SiteMaps\sitemapconversion.log" /y "\\servername\sitemappath\SiteMaps\utf16*.xml"', no_output
 188: exec xp_cmdshell 'rename \\servername\sitemappath\SiteMaps\output_utf16-psproduct.xml psproduct.xml', no_output
 189: exec xp_cmdshell 'rename \\servername\sitemappath\SiteMaps\output_utf16-pscategory.xml pscategory.xml', no_output
 190: exec xp_cmdshell 'rename \\servername\sitemappath\SiteMaps\output_utf16-pscategory_brand.xml pscategory_brand.xml', no_output
 191: exec xp_cmdshell 'rename \\servername\sitemappath\SiteMaps\output_utf16-mdproduct.xml mdproduct.xml', no_output
 192: exec xp_cmdshell 'rename \\servername\sitemappath\SiteMaps\output_utf16-mdcategory.xml mdcategory.xml', no_output
 193: exec xp_cmdshell 'rename \\servername\sitemappath\SiteMaps\output_utf16-mdcategory_brand.xml mdcategory_brand.xml', no_output
 194: exec xp_cmdshell 'rename \\servername\sitemappath\SiteMaps\output_utf16-mdaquarium.xml mdaquarium.xml', no_output
 195: exec xp_cmdshell 'del \\servername\sitemappath\SiteMaps\utf16*.xml', no_output
 196: END
Finally this last snippet shows the syntax I used for pinging the biggest search engines to let them know I’ve updated my sitemap. I have a SQL job that runs nightly. It first executes this SP, then pings each search engine. Below is a sample for Google. Just create a step in your job formatted like you see below to run this javascript.
   1: var WebAddress = "http://www.google.com/webmasters/tools/ping?sitemap=http://www.marinedepot.com/sitemapfolder/mdaquarium.xml";
   2: function main()
   3: {
   4:     var xmlHttpRequest = CreateObject("Microsoft.XMLHTTP");   
   5:     xmlHttpRequest.open("GET", WebAddress, false);
   6: }
   7:  
   8: var WebAddress2 = "http://www.google.com/webmasters/tools/ping?sitemap=http://www.marinedepot.com/sitemapfolder/mdaquarium.xml";
   9: function main2()
  10: {
  11:     var xmlHttpRequest2 = CreateObject("Microsoft.XMLHTTP");   
  12:     xmlHttpRequest2.open("GET", WebAddress2, false);
  13: }
  14:  
  15: var WebAddress3 = "http://www.google.com/webmasters/tools/ping?sitemap=http://www.marinedepot.com/sitemapfolder/mdcategory_brand.xml";
  16: function main3()
  17: {
  18:     var xmlHttpRequest3 = CreateObject("Microsoft.XMLHTTP");   
  19:     xmlHttpRequest3.open("GET", WebAddress3, false);
  20: }
  21:  
  22: var WebAddress4 = "http://www.google.com/webmasters/tools/ping?sitemap=http://www.marinedepot.com/sitemapfolder/mdproduct.xml";
  23: function main4()
  24: {
  25:     var xmlHttpRequest4 = CreateObject("Microsoft.XMLHTTP");   
  26:     xmlHttpRequest4.open("GET", WebAddress4, false);
  27: }
  28:  
  29: var WebAddress5 = "http://www.google.com/webmasters/tools/ping?sitemap=http://www.petstore.com/sitemapfolder/pscategory.xml";
  30: function main5()
  31: {
  32:     var xmlHttpRequest5 = CreateObject("Microsoft.XMLHTTP");   
  33:     xmlHttpRequest5.open("GET", WebAddress5, false);
  34: }
  35:  
  36: var WebAddress6 = "http://www.google.com/webmasters/tools/ping?sitemap=http://www.petstore.com/sitemapfolder/pscategory_brand.xml";
  37: function main6()
  38: {
  39:     var xmlHttpRequest6 = CreateObject("Microsoft.XMLHTTP");   
  40:     xmlHttpRequest6.open("GET", WebAddress6, false);
  41: }
  42:  
  43: var WebAddress7 = "http://www.google.com/webmasters/tools/ping?sitemap=http://www.petstore.com/sitemapfolder/psproduct.xml";
  44: function main7()
  45: {
  46:     var xmlHttpRequest7 = CreateObject("Microsoft.XMLHTTP");   
  47:     xmlHttpRequest7.open("GET", WebAddress7, false);
  48: }
image

No comments:

Post a Comment