Data Mining continues to be an area of active research. In this blog post, I summarize my notes and observations from a 2006 presentation titled Web Content Mining presented by Dr. Bing Liu of the University of Illinois Chicago. The information I present summarizes Dr. Liu’s content. At the end of this post, I have a link which has Dr. Liu’s outline of references, as well as the slides and the link to the webcast.
I am choosing to blog on this webcast because it introduces important terminology and concepts which help frame the scientific pursuit of new information in this active research area. Also, Dr. Liu wrote a book titled Web Data Mining, which went on sale November 23, 2010. I have not read this book, but from the outline at Amazon.com, the content includes all the key information from this webcast. I will provide a book link at the end of the blog post.
What is Web Content Mining?
Web Data Mining encompasses mining knowledge from the web or usage of the web. Dr. Liu presented three types of mining:
- Web Usage Mining = Access patterns from usage (may include TCP/IP addresses or user logins to correlate visits)
- Web Structure Mining = Discovering knowledge from hyperlinks (seeing how information may be clustered in groups of links from a related section or even from a single blog post)
- Web Content Mining = Mining knowledge from web pages
Dr. Liu conceded that this topic is large and he was only presenting part of this large topic. I believe there would be interesting problems in combining two or more of the above mining areas when framing a specific scientific problem. Also, Dr. Liu’s book covers more than just Web Content Mining, and gives coverage of the entire topic of Web Data Mining.
Dr. Liu’s Roadmap for Web Content Mining
- Structured Data Extraction
- Information Integration
- Opinion Mining (Information Extraction)
What is Structured Data Extraction?
Structured Data Extraction means obtaining regularly formatted data objects from the web, and creating a database based on this regular format. Dr. Liu talked about list pages and detail pages, where a product might be summarized along with lists of other objects, where the detail page might provide specifically more information.
- Example of List Page: SQL Server Books from Amazon.Com
- Example of Detail Page: A Specific Book from Amazon.Com on SQL Server
My comment is that sometimes, the detail pages might have a tabbed interface, and the nesting can go quite deep. In the case of Amazon, a page which might go deep includes the comments from users, which might span multiple pages (and comments can have comments too). Another nested structure includes the “Look Inside” feature on Amazon.com, which previews certain pages from a specific book. Hierarchically, those images are part of the nested presentation of a specific book.
What should be the entity (unit element)? You might assume that in my Amazon.com example, the entity would be a specific listing, uniquely identified by the ISBN number (either 10 or 13 digits, or both). However, my answer to the entity question depends on the scientific question. If you are going to study the previews from the “Look Inside” feature, then perhaps the entity is the specific image presented for a specific book. How to structure the analysis depends on the specific question.
Dr. Liu mentions wrapper induction as a method to mine structured data. Wrapper induction means using structured data mining to determine the text elements which signify the boundaries of a specific object. We are assuming that we are dealing with an XML format (meaning text), and if you notice in this post, I am using italics to signify a definition. Data mining can be used to determine these patterns, or specifically provide output which probabilistically determines likely candidates for patterns in wrappers. The word inductive means guessed, and therefore wrapper induction implies guessing the text which frames a structured data object.
Dr. Liu then discusses automatic extraction, the unstructured mining of lists. I like the word list, and Microsoft is bringing that word into prominence with SharePoint. Solutions for automatic extraction focus on the data record, the element of interest, and the data region, the group of data records. Algorithms exist for this purpose, but I believe this area could be more customized when someone has a specific scientific goal in mind. I have believed that algorithm writing is an important area to know for someone practicing data mining, and this area is one where I can see someone writing code.
For this entire area of structured data mining, the problem is well-defined, and someone could determine a solution since the challenge is to match something which has a predetermined intelligent design behind its presentation.
What is Information Integration?
Dr. Liu describes information integration as aligning information from different websites. From my experience, that problem could exist within the same large organization, and my mind draws back to a problem I had with a large insurance company where they were attempting to align their own information from the United States with their Latin American subsidiaries (who were doing the same processes, but on a different system and with different terminology). Dr. Liu’s presentation focuses on how the external website looks to the outside user, but the same challenge exists within a large organization which may have separate systems collecting similar data.
Dr. Liu mentioned that much of the academic work has been in Web Query Interface Integration, which means using input fields among websites to determine best matches. He used an example of booking an airline flight using a single master interface, which would then look at different websites which each may have a slightly different way of obtaining the identical information. That’s a lot of words, but I have a website for you to try: FareCompare.com. This single website allows you to look for flights in multiple systems, and I have used it for my own travel.
Determining similarities with data mining involves applying the clustering activity to see what attributes might be identical. Name on one website might be user on another website. Sometimes, attributes might be grouped, like first name and surname. In some problems, as Dr. Liu mentioned, we might be able to make a bridging assumption, where a relationship between attributes is assumed through a third proxy attribute. Another way to assume connections is through query probing, putting values into an input interface and seeing how the results correlate together.
Synonyms are difficult to determine, and as Dr. Liu mentions, the number of observations to determine a synonym might be high. In practical terms, there may not be enough attributes (columns) or observations (rows) or websites to be able to integrate information through data mining. The actual number required depends on the variability of the input data. For these reasons, information integration is an active research area.
What is Opinion Mining (Information Extraction)?
Information extraction varies from information integration because the goal is to discover or determine new knowledge or conclusions based on presented content. Dr. Liu uses the phrase opinion mining to describe this area, but I find myself liking information extraction as a phrase since the underlying content might not be opinion-oriented in nature. I concede that the resultant information extraction is, itself, an opinion, even if one highly supported by evidence through data mining.
Dr. Liu explicitly introduces this subject as either extracting information about factual data or about subjective opinions.
- Factual Problem – Obtain factual information about public high schools in the United States to determine an integrated picture, state-by-state, of high school education in America
- Subjective Opinion Problem – Obtain opinions of public high schools in the United States to determine public sentiment about the efficacy of teachers
Some problems I have to believe straddle the two worlds. One example I will pose is Google Flu Trends which matches geography with terms entered in a search engine to determine flu outbreak patterns. As it turns out, self-reported searches in search engines become a highly correlated variable with flu epidemics. It may be argued that average people might not know the difference between the flu and the common cold, and the symptoms may be similar. Thus, there is an argument that terms in a search engine are actually an opinion and not necessarily (or intended to be) a fact. Thus, you could debate whether Google Flu Trends represents a factual problem or subjective opinion problem, but both problems could be classified information extraction. This example shows why I prefer the phrase information extraction.
Dr. Liu then mentions that for opinions, people are wanting summaries of the opinions. I am currently working with a telecommunications client which both is challenged with summarizing service opinion as well as providing the source data for further drill-down investigation. I believe part of the information is in summary visualization, and have offered advice on that topic for this client.
Let me list some websites I use which provide opinions:
- Amazon.com – books, media, and all types of other stuff
- TripAdvisor.com – hotels, flights, and restaurants — though I use it mostly for hotels
- DPReview.com – digital cameras
- WebHostingTalk.com – web host services
- ClarkHoward.com – consumer products and services
As with information integration, the data mining task in information extraction is to determine synonyms. Also, the two problems share the issue of having potentially large amounts of data to determine such patterns. Some of the websites I mentioned help people by providing another way beyond just qualitative text to classify feedback, perhaps by providing categories or perhaps by a numerical rating system. The additional quantitative information allows the qualitative data to have some anchoring against a known scale (even if self-reported).
An information integration project might start first as a structured data extraction (as the first phase) and then move on through information integration (to group similar information from among websites) and then result in information integration (compiling the results into a single picture). All the websites I listed help collect information into a single source, and even in the last case (Clark Howard) which has a single expert moderator, users can add their own opinions and feedback too. Information integration is not just a data mining task, but as you see from my examples, can become a business. In all my cited cases, contributors “donate” content under website terms and conditions to clarify copyright and ownership issues. However, I can see a role for someone collecting information for private sale, or even charging subscribers to see collated opinions derived from other websites.
Dr. Liu mentions that information integration is another area of active research. I add that this area is one for active entrepreneurship too.
Conclusion
Dr. Liu concludes that even the not-discussed web mining techniques have elements of integration or extraction. He also emphasized that these problems require natural language understanding. I hope my notes would help provide a contextual summary useful if you want to view the webcast (requires installation of Cisco’s WebEx Player).
The link for Dr. Liu’s presentation: http://www.cs.uic.edu/~liub/WCM-Refs.html
Dr. Liu’s Book:
If you like this post, you may also like:
- Data Mining Separates News from Noise [Translate] I vocalized this title recently when explaining what data mining does. The challenge which many people face is information overload....
- Data Mining for Business Intelligence Book Review [Translate] Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel® with XLMiner® Book Review I recently met...
- Data Mining Implications of Gartner’s 2011 Projections [Translate] Industry analysis organization Gartner announced four major trends for the next few years. This blog post projects implications for data...
- Visual Analytics and SQL Server Data Mining [Translate] The Association for Computing Machinery produces a regular journal called SIGKDD Explorations, where SIGKDD is an acronym for Special Interest...
- Data Mining Answers for SQL Server Professionals [Translate] I received detailed feedback based on my December 6, 2010 presentation Data Mining for SQL Server Professionals. As I posted...

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)