SQL Server Data Mining Capacities 2008 R2

I was wondering what the maximum capacities were for data mining, and could not find the answer in SQL Server Books Online. So, I asked the Microsoft Analysis Services product team for the answer. If you read this blog, sometimes you get insider information.

They provided me with the data mining capacities in the following table (the first four rows). This information is NOT yet in SQL Server Books Online, but Microsoft promised that it will be. I want to stress that these capacities are theoretical limits, and practically there are other limitations (such as human management skill or the NTFS file system) which prevent people from achieving these theoretical limits. Continue reading “SQL Server Data Mining Capacities 2008 R2” »

Predixion Software Invites Beta Testers

For those following this blog, I have been writing about SQL Server data mining, based on a book coauthored by Jamie MacLennan and Bogdan Crivat.  They had been on the Microsoft product development team for this technology, and in recent months have both publicly blogged about their new venture Predixion Software:

On the website I see today, Predixion Software is asking for beta participants starting August 16.  I have heard their product will have native PowerPivot support.  They are asking for:

  • SQL Server Data Mining Add-In Users
  • PowerPivot Users
  • Excel Users

I have high confidence in Jamie MacLennan (now the Chief Technology Officer of Predixion Software) and Bogdan Crivat and their entire development team (see a photo at http://jamiemaclennan.blogspot.com/2010/04/cheers-from-predixion-dev-team.html and better yet the video at http://dai.ly/dwY2gA).

I encourage you to visit the Predixion Software website if you are interested in data mining and have ever used Excel before.  Apply to be part of the beta, and tell your friends. 

Be on the inside track, starting August 16.

http://www.predixionsoftware.com

Microsoft Sequence Clustering

Data Mining with Microsoft SQL Server 2008 Review Chapter 10

This chapter pays homage to Andrey Markov, the 19th century Russian mathematician who proposed what we call today Markov chains (page 334).  The combination of sequence and clustering in the title reveals the nature of this machine learning algorithm:  a combination of sequencing and clustering.  The outline for this blog post:

  • Revisiting Natural Groups
  • Recap of the Authors’ Web Click Solution
  • Further Discussion on Web Click Analysis
  • Authors’ DMX Code
  • Expanding the Applications for Microsoft Sequence Clustering
  • Persons Produce Wisdom

Revisiting Natural Groups

As with the last chapter this chapter also refers to natural groups, “The number of natural groups in a sequence clustering model” (page 339), and an example on page 335 explicitly mentions genetic sequencing.  Data mining is an applied science, and therefore I recommend avoiding the word natural to describe grouping since this word refers to a specific presupposed world view often attached to genetic and biological sciences.  Continue reading “Microsoft Sequence Clustering” »

Microsoft Clustering

Data Mining with Microsoft SQL Server 2008 Review Chapter 9

This chapter discusses an unsupervised machine learning algorithm called clustering, and “Microsoft Clustering” implies a particular implementation (you get two for one, k-means and expectation maximization).  The term unsupervised means that the analyst does not provide a target (or dependent) variable.  My outline for this post:

  • Does clustering produce natural groups?
  • Chapter 9 Project (provided by the authors)
  • Chapter 9 DMX
  • Two Dimensional Cluster Text (Excel)
  • Scoring Sample DMX

Does clustering produce natural groups?

I start with a philosophical question, one important to the statistical interpretation of what clustering produces.   As I pondered this word natural, I believe the discussion extends beyond just the clustering algorithm to any unsupervised machine learning algorithm, even ones not automatically included in SQL Server Data Mining.  Continue reading “Microsoft Clustering” »

Microsoft Time Series Algorithm

Data Mining with Microsoft SQL Server 2008 Review Chapter 8

I have commented several times that time series was an entire class when I was in graduate school. It was an appropriate topic for that stage (either for graduate school or later in an undergraduate) because calculus is required to communicate the mathematics. If I had to bet on a single data mining algorithm used across all situations and companies and countries and industries, this one would be it. For the 2008 version, Microsoft has made good improvements to this algorithm, allowing analysts to tune parameters depending on the situation. Among all the available Microsoft data mining algorithms, I believe the parameter choices affect results for this algorithm the most, and therefore might justify multiple models for comparison (since only empirical results can best demonstrate efficient outcomes).

Time series was a big topic for W. Edwards Deming. He used this subject to demonstrate what variance is, and whether a system was in control. Continue reading “Microsoft Time Series Algorithm” »

Microsoft Decision Trees Algorithm

Data Mining with Microsoft SQL Server 2008 Review Chapter 7 

Decision Trees is one of the most useful algorithms.  This algorithm conceptually extends modeling into a tree of nested models where each branch provides tailored understanding of the training data.  This blog posting will track the DMX code which substantially provides the discussion framework for the chapter.  You can get this code for free from the authors’ (actually the publisher’s) website, but if you want to be a data mining professional you should also have the book. This same single algorithm encompasses both Microsoft Decision Trees and Microsoft Linear Regression.

The sample DMX code refers to the ASSprocs stored procedure.  That code is available from http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470277742,descCd-DOWNLOAD.html.  While I was looking for the code, I discovered that this book is available from Wiley in e-Book format (see the previous link), and optionally you can see the eBook as part of Safari Books Online (subscription service): http://my.safaribooksonline.com/9780470277744.

Continue reading “Microsoft Decision Trees Algorithm” »

Microsoft Naïve Bayes

Data Mining with Microsoft SQL Server 2008 Review Chapter 6

The book now goes into a series of chapters, six through twelve, of an in-depth look at the individual algorithms.  I will repeat a comment from earlier in this series:  this book was authored by the technology gurus who developed this software.  The text supplements and extends what is free through MSDN Product Documentation (separately downloadable as SQL Server Books Online). The book has two important features:

  • Detailed how-to tutorials and instructions of how to use the technology
  • Behind-the-scenes technical tips which, though authoritative, cannot and should not be in the product documentation because Microsoft wants to promise functionality not implementation.  In other words, how a product is implemented may change, though the functions should be consistent with the Microsoft documentation.

Now, let’s talk about the use of Microsoft in the chapter title (this chapter and subsequent chapters) to describe the algorithms.  The Naïve Bayes machine learning algorithm is well known in the literature.  Microsoft has made between minor and major tweaks with each algorithm, allowing them to rightfully claim the implementation as theirs.   I do not have personal knowledge on whether these changes amount to a patent level of unique creation, but certainly enough to qualify for a copyright.  Later, chapter 17 will talk about extending this technology and developing your own algorithms.  Thus, it’s fair for Microsoft to sign their names on their algorithms, and that name persists through the data mining wizards and interfaces.  Some future third-party developers might choose to make their own implementation of these same algorithms, and add their own names.  If you choose to make one, I encourage you to share it, or at least a free version of it, on the open-source community site codeplex.com.

Continue reading “Microsoft Naïve Bayes” »