Microsoft Clustering
Data Mining with Microsoft SQL Server 2008 Review Chapter 9
This chapter discusses an unsupervised machine learning algorithm called clustering, and “Microsoft Clustering” implies a particular implementation (you get two for one, k-means and expectation maximization). The term unsupervised means that the analyst does not provide a target (or dependent) variable. My outline for this post:
- Does clustering produce natural groups?
- Chapter 9 Project (provided by the authors)
- Chapter 9 DMX
- Two Dimensional Cluster Text (Excel)
- Scoring Sample DMX
Does clustering produce natural groups?
I start with a philosophical question, one important to the statistical interpretation of what clustering produces. As I pondered this word natural, I believe the discussion extends beyond just the clustering algorithm to any unsupervised machine learning algorithm, even ones not automatically included in SQL Server Data Mining. Other algorithms may not be creating groups, but the more general unsupervised question is whether machine learning algorithms somehow uncover natural processes. Think about the word natural and what it means to science and business. The authors use the word several times:
Clustering is a simple, natural, and even automatic human operation (page 291)
The Microsoft Clustering algorithm finds natural groupings inside your data when these groupings are not obvious (page 292)
Identifying natural groups in your data frees you from simply analyzing your business based on the existing organization (page 292)
I consider myself a scientist, and if you have a “Computer Science” degree then I believe you are a scientist too. Culturally, it is common for people to reserve the word scientist for a particulart type of person, perhaps one paid to do original research, and I find that narrow restriction limiting because even people without science degrees add to the science conversation in culture. Perhaps no single group adds more to what people consider science than marketers. Anyone who sells a product will find themselves using scientific language about the nature of the universe and causal effects.
The word natural is an important word to science and implies a causal relationship between the known physical mathematics and laws governing the universe and things we see happen. I have to mention a story about a @shanselman Tweet I saw this week, a father had a question from his 4-year-old son about the Darth Vader character from the Star Wars movie. The kid asked his dad, “Why is Darth Vader so bad? Did someone make him bad?” The simple questions that kids ask underlie the important philsophical ideas about what is natural what is unnatural. I challenge you with a deeper thought: some people promoting natural believe nothing is unnatural (ask them).
I have done published research before, and have participated in a wide variety of business intelligence projects across industries. I currently am mentoring several doctoral learners at the University of Phoenix, some of whom are doing business intelligence research. I anticipate in continuing to encourage and support the use of the scientific method. There is a larger scientific debate about whether events happen either through intentional action or through natural processes. That discussion includes most any topic people might use clustering for, including problems in genetics, medicine, biology, psychology, sociology, marketing or finance. I will provide a general rule here that not just clustering but all the data mining algorithms provide evidence of correlation, and correlation is not sufficient evidence to conclude causality. In practice, my observation is that people may apply scientific induction to conclude causality (rather than the typically higher standard of deductive science).
I have decided to examine whether other data mining authors use this word natural to describe clustering. This first group favors the term:
When human beings try to make sense of complex questions, our natural tendency is to break the subject into smaller pieces, each of which can be explained more simply (Bery & Linoff, 1997, page 187)
The whole point of automati cluster detection is to find clusters that make sense to you. (Berry & Linoff, 1997, page 205)
Examples of undirected data mining include determining what products should be grouped together for a specialty cataglo, finding groups of readers or listeners with similar tastes in books and music, and discovering natural customer segments for market analysis. (Berry & Linoff, 2000, Page 103)
Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups… Clustering naturally requires different techniques to the classification and association learning methods we have considered so far. (Witten & Frank, 2005, page 136; Chakrabarti, et. al., 2009, page 184)
Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should capture the natural structure of the data… Classes, or conceptually meaningful groups of objects that share common characteristics, play an important role in how people analyze and describe the world. Indeed human beings are skilled at dividing objects into groups (clustering) and assigning particular objects to these groups (classification). (Tan, Steinbach & Kumar, 2006, page 487)
I have a strong suspicion that the term natural crept into the scienfic discussion through genetics or biology (see Tan, Steinbach & Kumar, 2006, page 488). However apply clustering to crime investigation, and we lose all sense of what natural might mean. In a criminal investigation, the default assumption is that events happen because of intelligent or intentional causes. I believe I live in a world which mixes both natural and inteligent causes. This second group of authors takes a more neutral approach in defining clustering:
Clustering, like the dimensionality reduction methods discussed… can be used for two purposes: it can be used for data exploration [and] to understand the structure of data… If such groups are found, these may be named (by application experts) and their attributes be defined. (Alpaydin, 2010, page 255)
The purpose of clustering techniques is to detect similar subgroups among a large collection of cases and assign those observations to the clusters… The clusters are assigned a sequential number to identify them in results reports… Just as important as identifying such clusters is the need to determine how those clusters are different. (Nisbert, Elder & Miner, 2009, page 147)
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters… By automated clusering, we can identify dense and sparse regions in object space and, therefore, discover overall distribution patterns and interesting correlation among data attributes… In machine learning, clustering is an example of unsupervised learning… Clustering is a form of learning by observation, rather than learning by examples. (Han & Kamber, 2006, pages 383-384)
I prefer the last definition by Han and Kamber (2006), and I make that call based not only on what I quoted but also from comparing the surrounding text of that and the other quotes I listed. Their quote clearly mentions the word correlation, and does not stray into the debate on natural versus intelligent agency. I return to my earlier mention of Darth Vader and the criminal investigation. In American courts, the job of defense attorneys is to show that what allegedly happpened is the result of natural causes, not generally or specifically the fault of the defendant(s). The job of the prosecution is to provide evidence beyond reasonable doubt that some event happened through intelligent causes. My advice to you is to pay attention to the story people tell with data mining, and how they are using clustering: are they look for intelligent causes or are they looking for natural causes?
Chapter 9 Project (Provided by the Authors)
I opened up the Chapter 9 project, based on the Movie Click database. I’m not sure what happened, but the solution includes an application of decision trees. I’m guessing the code was copied from other work, but not aligned to the topic of clustering:
Chapter 9 DMX
The DMX provided again leverages the ASSProcs.DLL provided by the authors (and which would need to be added to your Analysis Services instance). I habitually will run scripts as if they were authored to be run as a unit, and I believe most people have that assumption too. Sometimes, developers will put snippets into a code file, and be selecting and runnning only sections. I suspect that process happened in this code, when I attempted to run the following section:
The issue here is that there should be a GO before the second SELECT command. Later, another section would not work:
To investigate, you need to log on to the database instance where your Adventure Works data is. In my case, here is a picture of the DimCustomer area:
In production environments, it is common to see databases or tables or keys change names. In this case, changing the name to “CustomerKey” in the DMX will fix the issue. The DMX code overall shows how to create a structure, create a clustering mining model, and then obtain results from that model. I was noticing this time that unlike the SSMS 2008 T-SQL window, the DMX interface will always send the output to a grid. By contrast, the T-SQL window defaults to grid, but also allows output to be plain text or to a file.
Two Dimensional Cluster Text (Excel)
I did NOT know we could do surface graphs in Excel. I typically find myself using SAS other tools to make snazzy graphs. I changed the graph provided by the authors so that it was on a black background with different shading from what they provided:
The idea beyind this graph is that the independent x and y variables (defining the floor dimensions) result in a different likelihood value (determined in the z or vertical direction). The likelihood relationship produces the surface illustrated by the graph. Now, I did not spend a lot of time trying to understand the Excel workbook, but I like the concept because Excel can be used for Visual Analytics (as I have been arguing in this blog). From the front page, you have a panel to use the visual increment buttons to change value and generate sample data:
Also, look at the Visual Basic code. The “developer” tab does not show by default, so you will have to turn it on. Then, click the Visual Basic button, and look at the code under each of the sub-icons (some of the code screens are blank):
The indivudal sheets contain code which may be for contorls on those sheets. The module has two macros. The two class modules provide callable functions which can be used in other areas. This project involves advanced Excel development skills, though as i said, I like using Excel as an interface for interacting with data mining because the interface is natural for data organization. How’s that for a natural defense?
Scoring Sample DMX
The last example shows how to use the NATURAL PREDICTION JOIN to match cluster predictions with source data. I like this example, though I find it better to use the Excel Data Mining Add-In to perform the same task. See if you can make the Data Mining add-in work with the model you create with this DMX code. See if you can replicate the DMX results in an Excel workbook.
Back to Darth Vader
I can’t leave Star Wars alone. The Emperor’s continual argument to Luke Skywalker was that it was his destiny to join the dark side. And yet, both Darth Vader and the Emperor tried to make Luke choose the dark side. As much as we make natural destiny arguments, the reality of free will and intelligewnt action will continue to challenge people who make assumptions that the universe is a 100% product of natural action. The defense attorneys’ best clients will not have any intelligence and will act completely according to natural instinct, no decisions required.
References
The MSDN Documentation provides excellent information online about this algorithm:
- Mining Model Content for Clustering Models
http://msdn.microsoft.com/en-us/library/cc645761.aspx - Microsoft Clustering Algorithm
http://msdn.microsoft.com/en-us/library/ms174879.aspx - Microsoft Clustering Algorithm Technical Reference
http://msdn.microsoft.com/en-us/library/cc280445.aspx - Viewing a Mining Model with the Microsoft Cluster Viewer
http://msdn.microsoft.com/en-us/library/ms174801.aspx - Querying a Clustering Model
http://msdn.microsoft.com/en-us/library/cc280440.aspx
Alpaydin, E. (2010). Introduction to Machine Learning (2nd ed.). Cambridge, MA: The MIT Press.
Berry, M. J. A., & Linoff, G. (1997). Data Mining Techniques for Marketing Sales, and Customer Support. New York, NY: John Wiley & Sons, Inc.
Berry, M. J. A., & Linoff, G. (2000). Mastering Data Mining: The Art and Science of Customer Relationship Management. New York, NY: John Wiley & Sons, Inc.
Chakrabarti, S., Cox, E., Frank, E., Guting, R. H., Han, J., Jiang, X., . . . Witten, I. H. (2009). Data Mining Know it All. Burlington, MA: Elsevier Inc.
Keim, D. A., Mansmann, F., & Thomas, J. (2009). Visual Analytics: How much visualization and how much analytics? SIGKDD Explorations, 11(2), 5-8.
MacLennan, J., Tang, Z., & Crivat, B. (2009). Data Mining with Microsoft SQL Server 2008. Indianapolis, IN: Wiley Publishing Inc.
Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of Statistical Analysis & Data Mining Applications. Burlington, MA: Elsevier, Inc.
Pyle, D. (1999). Data Preparation for Data Mining. San Diego, CA: Academic Press.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Boston, MA: Pearson Education Inc.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd ed.). San Francisco, CA: Morgan Kaufmann Publishers.










