Clustering involves identifying a finite set of categories (clusters) to describe the data. The clusters can be mutually exclusive, hierarchical or overlapping. (Fayyad, et al., 1996, p. 44). Each member of a cluster should be very similar to other members in its cluster and dissimilar to other clusters. Techniques for creating clusters include partitioning (often using the k-means algorithm) and hierarchical methods (which group objects into a tree of clusters), as well as grid, model, and density-based methods. (Han & Kamber, 2001, p. 346-348)analysis is a form of cluster analysis that focuses on the items that donЃft fit neatly into other clusters (Han & Kamber, 2001). Sometimes these objects represent errors in the data, and other times they represent the most interesting pattern of all. Freitas (1999) focuses on outliers in his discussion of attribute surprisingness and suggests that another criterion for interestingness measures should be surprisingness.
Summarization maps data into subsets and then applies a compact description for that subset. Also called characterization or generalization, it derives summary data from the data or extracts actual portions of the data which Ѓgsuccinctly characterize the contentsЃh (Dunham, 2003, p. 8).Modeling (Association Rule Mining)or Association Rule Mining involves searching for interesting relationships between items in a data set. Market basket analysis is a good example of this model. An example of an association rule is Ѓgcustomers who buy computers tend to also buy financial softwareЃh (Han & Kamber, 2001, pp. 226-117). Since association rules are not always interesting or useful, constraints are applied which specify the type of knowledge to be mined such as specific dates of interest, thresholds on statistical measures (rule interestingness, support, confidence), or other rules applied by end users (Han & Kamber, 2001, pp. 262).
.5 Change and Deviation Detection
Also called sequential analysis and sequence discovery (Dunham, 203, p. 9), change and deviation detection focuses on discovering the most significant changes in data. This involves establishing normative values and then evaluating new data against the baseline (Fayyad, et al., 1996, p. 45). Relationships based on time are discovered in the data.above methods form the basis for most data mining activities. Many variations on the basic approaches described above can be found in the literature including algorithms specifically modified to apply to spatial data, temporal data mining, multi-dimensional databases, text databases and the Web (Dunham, 2003; Han & Kamber, 2001).
7. Related Disciplines: Information Retrieval and Text Mining
Two disciplines closely related to data mining are information retrieval and text mining. The relationship between information retrieval and data mining techniques has been complementary. Text mining, however, represents a new discipline arising from the combination of information retrieval and data mining.
.1 Information Retrieval (IR)
Many of the techniques used in data mining come from Information Retrieval (IR), but data mining goes beyond information retrieval. IR is concerned with the process of searching and retrieving information that exists in text-based collections (Dunham, 2003, p. 26). Data mining, on the other hand, is not concerned with retrieving data that exists in the repository. Instead, data mining is concerned with patterns that can be found that will tell us something new . something that isnЃft explicitly in the data (Han & Kamber, 2001).techniques are applied to text-based collections (Baeza-Yates & Ribeiro-Neto, 1999). Data mining techniques can be applied to text documents as well as databases (KDD), Web based content and metadata, and complex data such as GIS data and temporal data.terms of evaluating effectiveness, IR and data mining system markedly differ. Per Dunham (2003, p. 26), the effectiveness of an IR system is based on precision and recall and can be represented by the following formulas:= Relevant and Retrieved= Relevant and Retrievedeffectiveness of any knowledge discovery system is whether or not any useful or interesting information (knowledge) has been discovered. Usefulness and interestingness measures are much more subjective than IR measures (precision and recall).
7.2 IR Contributions to Data Mining
Many of the techniques developed in IR have been incorporated into data mining methods including Vector Space Models, Term Discrimination Values, Inverse Document Frequency, Term Frequency-Inverse Document Frequency, and Latent Semantic Indexing.Space Models, or vector space information retrieval systems, represent documents as vectors in a vector space (Howland & Park, 2003, p. 3; Kobayashi & Aono, 2003, p. 105). Term Discrimination Value posits that a good discriminating term is one that, when added to the vector space, increases the distances between documents (vectors). Terms that appear in 1%-10% of documents tend to be good discriminators (Senellart & Blondel, 2003, p. 28). Inverse Document Frequency (IDF) is used to measure similarity. IDF is used in data mining methods including clustering and classification (Dunham, 2003, pp. 26-27). Term Frequency-Inverse Document Frequency (TF-IDF) is an IR algorithm based on the idea that terms that appear often in a document and do not appear in many documents are more important and should be weighted accordingly (Senellart & Blondel, 2003, p. 28). Latent Semantic Indexing (LSI) is a dimensional reduction process based on Singular Value Decomposition (SVD). It can be used to reduce noise in the database and help overcome synonymy and polysemy problems (Kobayashi & Aono, 2003, p. 107).
7.3 Data Mining Contributions to IR
Although IR cannot utilize all the tools developed for data mining because IR is generally limited to unstructured documents, it has nonetheless benefited from advances in data mining. Han and Kamber (2001) describe Document Classification Analysis which involves developing models which are then applied to other documents to automatically classify documents. The process includes creating keywords and terms using standard information retrieval techniques such as TF-IDF and then applying association techniques from data mining disciplines to build concept hierarchies and classes of documents which can be used to automatically classify subsequent documents (p. 434).data mining idea of creating a model instead of directly searching the original data can be applied to IR. Kobayashi & Aono (2003) describe using Principle Component Analysis (PCA) and Covariance Matrix Analysis (COV) to map an IR problem to a Ѓgsubspace spanned by a subset of the principal componentsЃh (p. 108).
8. Text Mining
Text mining (TM) is related to information retrieval insofar as it is limited to text. Yet it is related to data mining in that it goes beyond search and retrieval. Witten and Frank (2005) explain that the information to be extracted in text mining is not hidden; however, it is unknown because in its text form it is not amenable to automatic processing. Some of the methods used in text mining are essentially the same methods used in data mining. However, one of the first steps in text mining is to convert text documents to numerical representations which then allows for the use of standard data mining methods (Weiss, Indurkhya, Zhang & Damerau, 2005).Weiss, et al. (2005), Ѓgone of the main themes supporting text mining is the transformation of text into numerical data, so although the initial presentation is different, at some intermediate stage, the data move into a classical data-mining encoding. The unstructured data becomes structuredЃh (pp. 3-4)., et al (2005) use the spreadsheet analogy as the classical data mining model for structured data. Each cell contains a numerical value that is one of two types: ordered numerical or categorical. Income and cost are examples of ordered numerical attributes. Categorical attributes are codes or true or false. In text mining, the idea is to convert the text presented as a document to values presented in one row of a spreadsheet where each row represents a document and the columns contain words found in one or more documents. The values inside the spreadsheet can then be defined (categorically) as present (this word is in this document) or absent (this word is not in this document). The spreadsheet represents the entire set of documents or corpus.collection of unique words found in the entire document collection represents the dictionary and will likely be a very large set. However, many of the cells in the spreadsheet will be empty (not present). An empty cell in a data mining operation might pose a problem, as it would be interpreted as an incomplete record. However, in text mining, this sparseness of data works to reduce the processing requirements because only cells containing information need to be analyzed. The result is that the size of the spreadsheet is enormous but it is mostly empty. This Ѓgallows text mining programs to operate in what would be considered huge dimensions for regular data-mining applicationsЃh (Weiss, et al., 2