Data Mining with Microsoft SQL Server 2008 Review Chapter 10
This chapter pays homage to Andrey Markov, the 19th century Russian mathematician who proposed what we call today Markov chains (page 334). The combination of sequence and clustering in the title reveals the nature of this machine learning algorithm: a combination of sequencing and clustering. The outline for this blog post:
- Revisiting Natural Groups
- Recap of the Authors’ Web Click Solution
- Further Discussion on Web Click Analysis
- Authors’ DMX Code
- Expanding the Applications for Microsoft Sequence Clustering
- Persons Produce Wisdom
Revisiting Natural Groups
As with the last chapter this chapter also refers to natural groups, “The number of natural groups in a sequence clustering model” (page 339), and an example on page 335 explicitly mentions genetic sequencing. Data mining is an applied science, and therefore I recommend avoiding the word natural to describe grouping since this word refers to a specific presupposed world view often attached to genetic and biological sciences. As a consultant analyzing general business intelligence questions, I am not likely to communicate to my clients that sequence clusters result from natural processes, since we require scientific evidence demonstrating that conclusion.
Recap of the Authors’ Web Click Solution
This section covers the authors’ example solution in BIDS (Business Intelligence Development Studio). The book’s example database is based on web clicks, and the two formative tables provide information on how people are choosing (free will) to use a website. We say choosing, but I believe that people are guided by their own value systems, world view, and visual cues on a screen. The following two screenshots show the first few rows of tables provided by the authors:
The customer table has basic identification for customers starting with customer zero, although you can start with any number since the important linkage is that the keys in the two table match (they could even be alphanumeric keys). The ClickPath table has a specific structure. First, the URL Category was determined from the actual URL. Many URLs these days have additional parameters, and unless URLs have significant numbers of cases in the same category, this algorithm will not be successful in analyzing results. If the web traffic were much larger, perhaps analyzing by subcategory would be possible. Higher traffic sites have a mathematical advantage in allowing more detailed pattern recognition. Thus, there is a connection between the traffic volume and the detailed modeling possible with data mining. That pattern is true across the algorithms, since more data provides better analytic power. Second, the Sequence ID column was determined perhaps from the date and time stamp. As a general rule, transactional databases should be keeping date and time transactions, and many times a specific general transactional entry may have several datetimes, including the original datetime and the last modified datetime. In this example, we assume that the rows have a single datetime, namely when a URL was requested. I suppose a more sophisticated web server could theoretically provide two times, the datetime of initial data transmission, and the datetime of final data transmission. Once available, datetimes can be sorted and sequence numbers can be assigned.
I will next share some screen shots. The first screen shot describes the mining model in the authors’ example:
Note that “Click Path” is a nested table, again showcasing a feature possible with SQL Server Data Mining: analysis of nested tables. In other software programs, you might have to flatten the entries into a single table. As I showed in the table structure, this example does NOT require preflattening before analysis, and therfore is a robust design for high volume analysis.
The next table provides the mining legend. The software will provide a color-coded legend, and the long list shows the number of distinct categories in the initial list. I am not a color-blind person, but I am senstive that this interface might be challenging for color-blind people, especially since you do not have the option of changing the automatically assigned color palette. I asked Wolfram Alpha “percentage color blind” and it returned 0.0019% (estimates based on 131,748 patients to healthcare providers, 2006 and 2007). The list below is not even the whole list.
The next graphic shows cluster characteristics, with Cluster 9 showing high for “News”. This default view abbreviates the terms in the States column (as I show the graphic), but you can show the full title by changing the column width (which I did NOT do). These default views are intended to communicate all the information, but people wanting to know a story will only want selected portions which a data analyst would choose.
The next graphic focuses on what the machine learning algorithm calls Cluster 9. The values column show the transtional states which favor this cluster. The whole list is long, and in application, a data analyst would provide decision makers with selected results from this screen for decision making purposes (though I believe everyone should have access to the full list and results since they may use the same data from the machine learning algorithm and come to a different conclusion).
This next graphic shows the cluster discrimination. The interactive interface allows comparing clusters with one another. One of the choices is “population”, the group refecting the entire sample presented to the machine learning algorithm. You would want clusters which are different from one another, and also different from the population too. Sequence clustering determines these differences based on the transition states. Note how this approach is different from regular clustering, since in this machine learning algorithm, the demographic (if you can use that word) is a dynamic motion from one state to the next.
I am not showing the State Transitions tab (which appears in my screenshots) since the output produced a single bubble icon for each of the categories (and I did show part of that long list). I finish this example by stating again what it did:
- Microsoft Sequence Clustering accepted a nested table structure with the sequence and unique identifier (representing in this case web usage sessions)
- The machine learning algorithm used the sequences to produce similar clusters
- The analysis windows show the characteristics of these clusters
- Running new observations against this model would classify those new observations by cluster based on their sequencing
Further Discussion on Web Click Analysis
Since we are talking about web clicks and Russia I have to make a short detour on web click information about this blog. Almost since the outset, this blog has been having a good number of hits from Russia. The following map was generated by StatPress, a plug-in for WordPress.
The primary geographic location reported by StatPress is simply “English EN”, which could mean any of many different countries including the United States. The point of this specific graph is to map non English EN sources for web clicks. In particular, this website is receiving clicks from http://lenta.ru which appears to me to be a news portal. Not being a Russian speaker, I could not find the source of these referrals, though I do appreciate them. I mention this example because it provides a real-world view on analyzing web clicks. I believe http://marktab.net has traffic too low to make decisions based on detailed analysis. Rather, I look at the general trends provided by three sources to have an overall judgment of the website’s efficiency: 1) StatPress plug-in provided for WordPress, 2) Google Analytics, and 3) a preloaded webclick software provided by my ISP (internet service provider).
What is actionable, from a web hosting producer, on a website? This list provides some ideas:
- How to categorize web material
- What other non-related or related material to present on a webpage
- What advertising topics could be presented on a webpage
- What cross-promotional topics could be presented on a webpage
Even as I write the list, I know from both producing and consuming experience that website pages may have a specific title, but might not be clearly classified into a particular group. This story becomes more murky when we have web applications which can present dynamic content. In the case of SharePoint, the content might be completely tied to who the user is. I make these comments because the days of static HTML webpages (and I probably created hundreds in this category using text editing software, starting on Unix in the 1990s) are largely gone for organizations with heavy web traffic. Fortunately, the web click analysis problem can be tackled from a number of different machine learning algorithms, not just sequence clustering.
Ideally, I believe everyone should be moving toward real-time analysis. With the advantages that visual interfaces have in helping people make quick decisions, my ideal model has Visual Analytics on top of real-time, streaming analysis. As applied to the authors’ original example, an ideal implementation would be to classify new users based on their live usage, and use that information to dynamically present a web experience customized to how they interact with the web. What I am saying is more than just having a long list of categories on a page:
- How about sorting those categories into a different order based on the algorithm’s determination of a person’s likely usage?
- How about reordering content based on estimated interest?
- How about presenting advertising based not just on the static topics touched (the approach many people use now) but also the dynamic sequence that people choose? <– This idea could set apart a web marketing company from their competition.
Many web applications already allow users to reorganize a “home page” or “portal” if these users log on. I know that when I arrange anything, whether the items in my kitchen or the files on my network drive, I am unlikely to change that order unless some substantial issue comes along. Some users may want the web application to dynamically order their portal based on parameters they specify (such as look at the last week or last month or last year), and in areas they choose (meaning allow dynamic reordering only on part of a web page but not the whole thing). My point is that even though a machine learning algorithm is available, we need not think of the situation as either user or machine controlled. Instead, the machine can provide users with a number of possible choices (each of which might represent a complex combination of settings) and the user can make a final decision. I believe data mining has best application in reducing a large set of complex information into a smaller manageable set which people can use to make decisions.
Authors’ DMX Code
The DMX code defines a structure, builds a model, and then runs through a series of queries for that model. My only comment is that this code did not include the connection code. This code is my version (though you would have to change the database, or initial catalog, name to be yours):
// Create a data source using the utility stored procedure
// provided with the book
CALL
ASSprocs.CreateDataSource(‘Web Data’,'Provider=SQLNCLI10.1;Data Source=localhost;
Integrated Security=SSPI;Initial Catalog=DM2008_Chapter10′,
‘ImpersonateCurrentUser’,”,”)
GO
Expanding the Applications for Microsoft Sequence Clustering
I believe that among the algorithms in SQL Server Data Mining, more people will have trouble either finding problems appropriate for this algorithm or realizing that data could be used for this algorithm. I repeat what I said earlier, that clustering makes clusters based on static characteristics, but sequence clustering makes clusters based on sequence action. I have to comment again on natural, since I find it less likely to believe that sequence clustering makes natural groups compared with regular clustering. Rather, I believe we can and should look for applications dealing with sequences people can choose. Some of these examples may not have enough data for sequence clustering, but I hope this list of questions helps generate creativity:
- What sequence of training helps people prepare for a medical career?
- Is the path to the American Presidency always through the Ivy League?
- What shopping patterns do people have based on where they enter a shopping mall and where they go? (The same question can be applied to a single major retail store)
- What teams should a professional athlete play with to be more likely to win a national championship?
- What stock might someone purchase next?
- What items are people interested in next on an auction website?
- Do gamblers in a casino follow a usage pattern?
- Can we understand patient treatment by the sequence of medical interventions?
Persons Produce Wisdom
I want to make an important statement again about machine learning algorithms: these computations do NOT produce value statements and should NOT be considered advice or recommendations. Just because people follow a certain path, perhaps in large numbers, towards a certain outcome does not mean that people have to follow that specific path or specific way in the future. The Ivy League is not a written requirement for the American presidency, but it does represent a network of associations which might encourage a path toward that end. Thus, to followers, don’t be so sure that the path determined by science represents a path you would want to go. Also, to leaders, don’t be so sure that that path you are telling other people to follow represents the only among perhaps many (but not unlimited) similar paths. Sequence clustering attempts to discover snippets of similarity, but should not be a substitute for applied wisdom.
References
The MSDN Documentation provides excellent information online about this algorithm:
- Mining Model Content for Sequence Clustering Models
http://msdn.microsoft.com/en-us/library/cc645747.aspx - Microsoft Sequence Clustering Algorithm
http://msdn.microsoft.com/en-us/library/ms175462.aspx - Microsoft Sequence Clustering Algorithm Technical Reference
http://msdn.microsoft.com/en-us/library/cc645866.aspx - Viewing a Mining Model with the Microsoft Sequence Cluster Viewer
http://msdn.microsoft.com/en-us/library/ms174804.aspx - Querying a Sequence Clustering Model
http://msdn.microsoft.com/en-us/library/cc645869.aspx
MacLennan, J., Tang, Z., & Crivat, B. (2009). Data Mining with Microsoft SQL Server 2008. Indianapolis, IN: Wiley Publishing Inc.
If you like this post, you may also like:
- Microsoft Clustering [Translate] Data Mining with Microsoft SQL Server 2008 Review Chapter 9 This chapter discusses an unsupervised machine learning algorithm called clustering,...
- Microsoft Decision Trees Algorithm [Translate] Data Mining with Microsoft SQL Server 2008 Review Chapter 7 Decision Trees is one of the most useful algorithms. This...
- Microsoft Time Series Algorithm [Translate] Data Mining with Microsoft SQL Server 2008 Review Chapter 8 I have commented several times that time series was an...
- Microsoft Naïve Bayes [Translate] Data Mining with Microsoft SQL Server 2008 Review Chapter 6 The book now goes into a series of chapters, six...
- Applied Data Mining using Microsoft Excel 2007 [Translate] Data Mining with Microsoft SQL Server 2008 Book Review Chapter 2 This chapter starts with a recommendation to http://trymicrosoftoffice.com which...








