Clustering By Impact, Algorithms And Applications

Project Award Number: IIS-0208519


Principal Investigator

Daniel Barbará
Department of Information and Software Engineering
George Mason University
4400 University Dr
MSN 4A4
Fairfax
VA 22030
703-9931627
703-9931638
dbarbara@gmu.edu

Keywords

Clustering
Data Streams

Outlier detection
Adaptive models

Project Summary

As companies and organizations collect data continuosly, the need for incremental algorithms that can mine the data and track the change of patterns through time becomes pressing. In this project we investigate new methods for incremental clustering of data sets. We call these methods clustering by impact, because the decision of in which cluster we should place a new point is based on the impact that this point has over a function that naturally expresses the definition of clusters. Moreover, these methods have the virtue of requiring only a bounded amount of main memory to concisely represent the clusters that have been found so far, making the task of continous mining of data streams possible. We propose the usage of two functions for these incremental methods. The first uses the concept of self-similarity, which is, in escence, the property of being invariant with respect to the scale used to look at the data set. While ``pure'' fractals are self-similar at every scale, many data sets exhibit self-similarity over a range of scales. The second function is based on the notion of entropy and is used to cluster data sets with categorical attributes. One area that the project emphasizes is that of tracking the evolution of clusters in data streams. To that effect, we have developed a probabilistic model (based on Hoeffding bounds) that, along with a natural way in which our incremental algorithms are capable of detecting outliers, allows the automation of the decision of when it is time to re-examine the clusters obtained. We have also developed a principled, statistically-based technique to detect outliers, which are understood as points that do not fit any of the clusters. The technique is also incremental, making decisions about the points as they come along.

Publications and Products

  • ``Detecting Outliers using Transduction and Statistical Testing,'' Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, August 20-23, 2006. D. Barbará, C. Domeniconi, J. Rogers.
  • ``Categorization and Keyword Identification of Unlabeled Documents,'' Proceedings of the Fifth IEEE International Conference on Data Mining , Houston, Texas, November 27-30, 2005. N. Kang, C. Domeniconi, D. Barbará
  • ``Classifying Documents Without Labels,'' Proceedings of the SIAM International Conference on Data Mining , Lake Buena Vista, Florida, April 22-24, 2004. D. Barbará, C. Domeniconi, N. Kang
  • ``Self-similar mining of time association rules,'' Proceedings of the the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04). Sidney, Australia May 2004. D. Barbará and P. Chen.
  • ``Mining Relevant Text from Unlabelled Documents,'' Proceedings of the Third IEEE International Conference on Data Mining , Melbourne, Florida, November 19-22, 2003. D. Barbará, C. Domeniconi, N. Kang.
  • ``Using Self-Similarity to Cluster Large Data Sets,'' Journal of Data Mining and Knowledge Discovery, 7(2) 123-152, April 2003. D. Barbará, P. Chen.
  • Fractal Characterization of Web Workloads,'' Proceedings of the 11th International World Wide Web Conference, May 2002. D. Menasce, V. Almeida, D. Barbará, B. Abrahao, and F. Ribeiro. (Best Paper Finalist.)
  • ``Bootstrapping a data mining intrusion detection system,'' ACM Symposium on Applied Computing, March 2003. D. Barbará, J. Couto, Y. Li, J. Lin, and S. Jajodia
  • ``COOLCAT: An entropy-based algorithm for categorical clustering,'' Eleventh International Conference on Information and Knowledge Management (CIKM'02) , Nov. 2002. D. Barbará, J. Couto and Y. Li.
  • ``Requirements for Clustering Data Streams,'' SIGKDD Explorations (Special Issue on Online, Interactive, and Anytime Data Mining), D. Barbará.Vol. 3, No. 2, Jan 2002.

Project Impact

This project aims to produce scalable, incremental clustering algorithms capable of tracking the evolution of clusters in data streams. As such, it can have a big impact in applications such as intrusion detection, web mining, and sensor data mining.

Goals, Objectives and Targeted Activities

We have developed scalable, incremental algorithms to cluster categorical and numerical data sets, using concepts from information theory and fractals. The incremental nature of the algorithms makes them ideal for clustering data streams (i.e., continuously arriving data). We have developed software to test these concepts, including a simple, but powerful GUI. We have conducted extensive experimentation, using both synthetic and real data sets, proving that the algorithm is robust and competes very favorably with previous algorithms in the literature. We have developed means to discover categorical data outliers, based on entropy meassures that allow us to decide whether a new point fits or not the current clusters. We have used these findings to separate unlabeled network packet data into two classes: points that correspond to normal activity, and points that correspond to activity that can be deemed suspicious.
Findings:
  • Incremental clustering of categorical data
  • COOLCAT is a concrete example of the type of clustering algorithms that we intended to develop under this grant. Its requirements were to be scalable, incremental, and therefore applicable to streams of incoming data.

    COOLCAT satisfies these requirements by using a simple function that can be calculated for a new point that we wish to cluster (given that we already have a set of defined clusters). In the case of COOLCAT, this function measures the impact that the point has over the entropy of each cluster. By clustering the point in the cluster whose entropy is reduced the most, we can make an incremental, fast decision. The reasoning behind such decision is that a good clustering should miminize the entropy of the individual clusterings by placing similar points together. COOLCAT is specifically designed for categorical data, where the entropy can be easily computed by using the discrete probability distributions of each of the categorical values of each attribute. The algorithm assumes independence among the attributes of the dataset.

  • Document classification without labels:
  • Concretely, the technique aims to address the problem of text classification, which is an important task with applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, this is a strong assumption, since it many situations, those labeled documents do not exist. We focus on the classification of documents into two classes (although the work is extensible to many classes): relevant and irrelevant (assuming a target class of relevant documents).

    Originally, we divided the set of available (unlabeled documents) into buckets (this, for instance, could be what one obtains from different search engines in answer to the same query). Using association rule mining, we find common sets of words among the buckets and postulate that the class of relevant documents is defined by those words. This leads to obtaining a cluster (sample) of documents of which we can guarantee contains a large percentage of relevant text. We call the algorithm that obtains this sample DocMine.

    To test the feasibility of our approach, we used the Reuters text categorization collection, removing common and rare words and designing experiments with different targets (topics) of relevance. Concretely, we select a topic of interest (e.g., "crops") and a set of buckets (e.g., 5), and vary the percentage of relevant documents in each bucket from 50\% to 80\%. (We selected as topics of interest those with the largest number of documents in the dataset, one per experiment.) Varying the minimal support for the frequent set of words, we conducted experiments, measuring in each case, the precision (percentage of truly relevant documents) obtained in the final sample that our technique renders. Our experiments show that our method is capable of selecting sets of documents with a precision that exceeds 90\% in most cases. The sample obtained can be considered a labeled sample of documents and used for further classification.

    That was the focus of the next step: using the sample to train classification models that can separate the whole dataset of documents. We tried several methods of classification to separate the data, including SVM. To do this, we needed to develop a heuristic that helps identify a small sample of negative (non-relevant) documents. Using the Reuters document repository again, we conducted experiments to compare six different combinations of algorithms, as follows:

    • DocMine and 1-class SVM (which finds a cluster of relevant documents, using the sample given by DocMine)
    • . DocMIne and 2-class SVM, by using a novel heuristic that bootstrapped negative (non-relevant) examples.
    • DocMine and SPY,SVM
    • DocMine and SPY, EM
    • DocMine and Rocchio, SVM
    • DocMine and Rocchio, EM

    SPY, SVM, Rocchio and EM were tested using were tested using the classification system LPU (developed by B. Liu and X. Lee and publicly available). The results show that, as expected, different combinations of methods are well-suited for different datasets and topics. However, a conclusion that stands out is that the method of bootstrapping negative values, combined with DocMine is extremely robust across datasets and topics.In all cases the precision of this method exceeds 90\%. We emphasize that in all cases, the precision of the method tends to reach very high levels (over 95\%) as the cardinality of common itemsets grows, regardless of the support utilized.

    In the next step of this research, we wanted to remove the assumption of having buckets of documents, and therefore assumed that the documents were contained in a single set. We used then a global unsupervised feature selection approach to characterize the words that co-occur frequently in the corpus of documents, and postulate that those combinations of words define the relevant documents. (I.e., the relevant class shows coherence of words, while the irrelevant documents contain many other words.) We introduced a new unsupervised feature (word) selection algorithm to handle muticlass classification of documents without labels. We then apply a locally adaptive clustering algoririthm for documents to separate them. The experimental results, conducted using both the Reuters document repository and an Email spam dataset of 1431 documents - to test spam algorithms-, demonstrate the feasibility of our approach in terms of achieved accuracy measured against the ground truth. Furthermore, the analysis of the weights credited to words provide evidence that the identified keywords can guide the process of label assignments to clusters. The results outperform the use of more conventional clustering methods (K-means) by a large margin. For instance, in the case of the Email dataset, the average error committed by our method was 2.0\% while using K-means the error registered was 45\%.

  • Outlier detection
  • We have developed a principled, statistically-based method to detect outliers in datasets. The method is unique in many ways. First, it requires the setting of very few, intuitive parameters, and it is very resilient to the values of those. Secondly, it does not require a model of what normality is. Third, it uses cluster information in a very efficient way, although the absence of such information does not hinder the performance in a large extent.

    Our method is based in Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. Here, we use transduction to obtain a likelihood of a new point belonging to a cluster, and applying standard hypothesis testing we either accept or reject that hypothesis (the membership of the point to the cluster). If we reject all the hypotheses (one per cluster), the point is declared an outlier. Again, while the cluster information is used advantageously in the method, the absence of it does not stop the method from being applicable. In other words, we could diagnose outliers in datasets for which no cluster information is available.

    To apply transduction, a function that measures the strangeness of a point is needed. We have, up to date, used as such function the sum of the distances to the K nearest neighbors (K then becoming one of the two parameters that neeeds to be specified). However, it is possible that other functions are suitable for the task (for future work and for categorical data, we plan to use entropy as a strangeness measure, following our experience of using entropy to cluster categorical data in COOLCAT).

    The fitness of a point with respect to a cluster is computed by means of a p-value, which measures the fraction of points currently in the cluster which exhibit a strangeness value larger than the one for the point we wish to diagnose. That, along with a confidence level (e.g., 95\%), which corresponds to the second parameter of our method, allow us to do hypothesis testing: if the p-value is smaller than 1-confidence level, then the null hypothesis ("the point belongs to the cluster") is rejected in favor of the alternative hypothesis ("the point does not belong to the cluster"). One can test as many hypothesis as clusters exist in the dataset, and if all the null hypothesis are rejected, the point is considered an outlier. Since we are performing multiple tests (one per cluster) the actual confidence of the final decision is smaller (if tao = 1 - confidence level, then (1-tao)^c, where c is the number of clusters would be the final confidence). To compensate for this diminished confidence, we raise the individual confidence on the tests, in such a way that the final confidence reflects the value we wish to have.

    We have thoroughly evaluated our method with both synthetic and real datasets. They include a synthetically designed dataset using Gaussian clusters (to visualize and understand the results), the Fisher-Iris dataset (found in the UCI repository), an on-line bookstore dataset that was generated by an e-commerce site, a dataset of texture published by the ELENA Project in Europe, and National Hockey League data. We measure in each of the experiments the True Positive (TP) and False Positive (FP) rate of outliers detected.

    These results compete very favorable when compared with another popular technique (designed by E. Knorr and R. Ng), called DB (for distance-based outliers). Our technique outperform theirs in many datasets (in some cases by an order of magnitude in TP). Although we did not yet compare our technique with the density-based approach LOF (this is in the future plans), we have theoretically demonstrated that our technique addresses the weaknesses exhibited in distance-based algorithms that LOF was designed to counteract. Concretely, the issue is the presence of clusters of different density and the impossibility for distance-based outlier algorithms to correctly discern between a point that is inside the least-dense cluster (clearly a non-outlier) and one that is situated at a comparable distance of the densest cluster (an outlier). In distance-based outlier algorithms we are forced to declare both points either as outliers or as normal. Our method, on the other hand, can successfully deal with such an scenario, because it takes into consideration the behavior of the points in each cluster. For the denser cluster, points will exhibit a strangeness measure smaller than the outlier, while in the sparse cluster, the strangeness measure will be comparable to the point in question.

    Aside from the experiments mentioned above, we conducted experiments to self-clean datasets (obtaining TP above 95\% with FP less than 4\%) and to perform outlier detection using sets that have been contaminated by outliers themselves (obtaining TP rates larger than 95\% and FP rates smaller than 3\%). All these results point to a very robust, versatile, and practical method.

Area References

  • M. Breunig, H. Kriegel, R. Ng, J. Sander ``LOF: Identifying Density-Based Local Outliers.'' Proc. of the ACM SIGMOD Conference on Management of Data, 2000, 427-438.
  • P. Cheeseman, A. Stutz, `` Bayesian classification (AUTOCLASS): Theory and Results,'' In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, 1995.
  • R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, 2nd Edition. Wiley-Interscience. 2001.
  • V. Ganti, J. Gehrke, and R. Ramakrishnan. ``CACTUS - Clustering Categorical Data Using Summaries.'' In Proceedings of the ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA , 1999.
  • D. Gibson, J. Kleinberg, and P. Raghavan. ``Clustering Categorical Data: An Approach Based on Dynamical Systems.'' In Proceedings of the International Conference on Very Large Databases (VLDB), New York, NY, September 1998.
  • S. Guha, R. Rastogi, and K. Shim. ``CURE: A clustering algorithm for large databases.'' In Proceedings of the ACM SIGMOD Conference on Management of Data, Seattle, WA, May 1998.
  • M. Hubert, P.J. Rousseeuw, and S. Van~Aelst `` Multivariate Outlier Detection and Robustness.'' In Handbook of Statistics, Vol. 24, C.R. Rao, E. Wegman, J. Solka, editors. Elsevier, 2005.
  • E. Knorr and R. Ng. ``Algorithms for Mining Distance-Based Outliers in Large Datasets,'' Proceedings of the 24th Intl. Conference on Very Large Databases, 392-403.
  • S. Guha, R. Rastogi, and K. Shim. ``ROCK: A Robust Clustering Algorithm for Categorical Attributes.'' In Proceedings of the 15th International Conference on Data Engineering, Sydney, Australia, April 1999.
  • A.K. Jain and R.C. Dubes. Algorithms for clustering data. Prentice Hall, 1988.
  • L. O'Callaghan, N. Misra ,A. Meyerson ,S. Guha, R., Motwani, ``High performance clustering of streams in large data sets.'' Proceedings of the International Conference on Data Engineering, 2002.
  • R. Zhang, R. Ramakrishnan, and M.Livny. ``Birch: An efficient data clustering method for very large databases.''In Proceedings of the ACM SIGMOD Conference on Data Management, Montreal, Canada, June 1996.
  • Project Websites


    COOLCAT (Categorical Clustering)

    Details of our categorical clustering techniques

    Fractal Mining.

    Details of our numerical clustering techniques