COOLCAT is a concrete example of the type of clustering algorithms that we intended to develop under this grant. Its requirements were to be scalable, incremental, and therefore applicable to streams of incoming data.
COOLCAT satisfies these requirements by using a simple function that can be calculated for a new point that we wish to cluster (given that we already have a set of defined clusters). In the case of COOLCAT, this function measures the impact that the point has over the entropy of each cluster. By clustering the point in the cluster whose entropy is reduced the most, we can make an incremental, fast decision. The reasoning behind such decision is that a good clustering should miminize the entropy of the individual clusterings by placing similar points together. COOLCAT is specifically designed for categorical data, where the entropy can be easily computed by using the discrete probability distributions of each of the categorical values of each attribute. The algorithm assumes independence among the attributes of the dataset.
Concretely, the technique aims to address the problem of text classification, which is an important task with applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, this is a strong assumption, since it many situations, those labeled documents do not exist. We focus on the classification of documents into two classes (although the work is extensible to many classes): relevant and irrelevant (assuming a target class of relevant documents).
Originally, we divided the set of available (unlabeled documents) into buckets (this, for instance, could be what one obtains from different search engines in answer to the same query). Using association rule mining, we find common sets of words among the buckets and postulate that the class of relevant documents is defined by those words. This leads to obtaining a cluster (sample) of documents of which we can guarantee contains a large percentage of relevant text. We call the algorithm that obtains this sample DocMine.
To test the feasibility of our approach, we used the Reuters text categorization collection, removing common and rare words and designing experiments with different targets (topics) of relevance. Concretely, we select a topic of interest (e.g., "crops") and a set of buckets (e.g., 5), and vary the percentage of relevant documents in each bucket from 50\% to 80\%. (We selected as topics of interest those with the largest number of documents in the dataset, one per experiment.) Varying the minimal support for the frequent set of words, we conducted experiments, measuring in each case, the precision (percentage of truly relevant documents) obtained in the final sample that our technique renders. Our experiments show that our method is capable of selecting sets of documents with a precision that exceeds 90\% in most cases. The sample obtained can be considered a labeled sample of documents and used for further classification.
That was the focus of the next step: using the sample to train classification models that can separate the whole dataset of documents. We tried several methods of classification to separate the data, including SVM. To do this, we needed to develop a heuristic that helps identify a small sample of negative (non-relevant) documents. Using the Reuters document repository again, we conducted experiments to compare six different combinations of algorithms, as follows:
SPY, SVM, Rocchio and EM were tested using were tested using the classification system LPU (developed by B. Liu and X. Lee and publicly available). The results show that, as expected, different combinations of methods are well-suited for different datasets and topics. However, a conclusion that stands out is that the method of bootstrapping negative values, combined with DocMine is extremely robust across datasets and topics.In all cases the precision of this method exceeds 90\%. We emphasize that in all cases, the precision of the method tends to reach very high levels (over 95\%) as the cardinality of common itemsets grows, regardless of the support utilized.
In the next step of this research, we wanted to remove the assumption of having buckets of documents, and therefore assumed that the documents were contained in a single set. We used then a global unsupervised feature selection approach to characterize the words that co-occur frequently in the corpus of documents, and postulate that those combinations of words define the relevant documents. (I.e., the relevant class shows coherence of words, while the irrelevant documents contain many other words.) We introduced a new unsupervised feature (word) selection algorithm to handle muticlass classification of documents without labels. We then apply a locally adaptive clustering algoririthm for documents to separate them. The experimental results, conducted using both the Reuters document repository and an Email spam dataset of 1431 documents - to test spam algorithms-, demonstrate the feasibility of our approach in terms of achieved accuracy measured against the ground truth. Furthermore, the analysis of the weights credited to words provide evidence that the identified keywords can guide the process of label assignments to clusters. The results outperform the use of more conventional clustering methods (K-means) by a large margin. For instance, in the case of the Email dataset, the average error committed by our method was 2.0\% while using K-means the error registered was 45\%.
We have developed a principled, statistically-based method to detect outliers in datasets. The method is unique in many ways. First, it requires the setting of very few, intuitive parameters, and it is very resilient to the values of those. Secondly, it does not require a model of what normality is. Third, it uses cluster information in a very efficient way, although the absence of such information does not hinder the performance in a large extent.
Our method is based in Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. Here, we use transduction to obtain a likelihood of a new point belonging to a cluster, and applying standard hypothesis testing we either accept or reject that hypothesis (the membership of the point to the cluster). If we reject all the hypotheses (one per cluster), the point is declared an outlier. Again, while the cluster information is used advantageously in the method, the absence of it does not stop the method from being applicable. In other words, we could diagnose outliers in datasets for which no cluster information is available.
To apply transduction, a function that measures the strangeness of a point is needed. We have, up to date, used as such function the sum of the distances to the K nearest neighbors (K then becoming one of the two parameters that neeeds to be specified). However, it is possible that other functions are suitable for the task (for future work and for categorical data, we plan to use entropy as a strangeness measure, following our experience of using entropy to cluster categorical data in COOLCAT).
The fitness of a point with respect to a cluster is computed by means of a p-value, which measures the fraction of points currently in the cluster which exhibit a strangeness value larger than the one for the point we wish to diagnose. That, along with a confidence level (e.g., 95\%), which corresponds to the second parameter of our method, allow us to do hypothesis testing: if the p-value is smaller than 1-confidence level, then the null hypothesis ("the point belongs to the cluster") is rejected in favor of the alternative hypothesis ("the point does not belong to the cluster"). One can test as many hypothesis as clusters exist in the dataset, and if all the null hypothesis are rejected, the point is considered an outlier. Since we are performing multiple tests (one per cluster) the actual confidence of the final decision is smaller (if tao = 1 - confidence level, then (1-tao)^c, where c is the number of clusters would be the final confidence). To compensate for this diminished confidence, we raise the individual confidence on the tests, in such a way that the final confidence reflects the value we wish to have.
We have thoroughly evaluated our method with both synthetic and real datasets. They include a synthetically designed dataset using Gaussian clusters (to visualize and understand the results), the Fisher-Iris dataset (found in the UCI repository), an on-line bookstore dataset that was generated by an e-commerce site, a dataset of texture published by the ELENA Project in Europe, and National Hockey League data. We measure in each of the experiments the True Positive (TP) and False Positive (FP) rate of outliers detected.
These results compete very favorable when compared with another popular technique (designed by E. Knorr and R. Ng), called DB (for distance-based outliers). Our technique outperform theirs in many datasets (in some cases by an order of magnitude in TP). Although we did not yet compare our technique with the density-based approach LOF (this is in the future plans), we have theoretically demonstrated that our technique addresses the weaknesses exhibited in distance-based algorithms that LOF was designed to counteract. Concretely, the issue is the presence of clusters of different density and the impossibility for distance-based outlier algorithms to correctly discern between a point that is inside the least-dense cluster (clearly a non-outlier) and one that is situated at a comparable distance of the densest cluster (an outlier). In distance-based outlier algorithms we are forced to declare both points either as outliers or as normal. Our method, on the other hand, can successfully deal with such an scenario, because it takes into consideration the behavior of the points in each cluster. For the denser cluster, points will exhibit a strangeness measure smaller than the outlier, while in the sparse cluster, the strangeness measure will be comparable to the point in question.
Aside from the experiments mentioned above, we conducted experiments to self-clean datasets (obtaining TP above 95\% with FP less than 4\%) and to perform outlier detection using sets that have been contaminated by outliers themselves (obtaining TP rates larger than 95\% and FP rates smaller than 3\%). All these results point to a very robust, versatile, and practical method.