|
Clustering of categorical attributes is a difficult problem that
has not received as much attention as its numerical counterpart.
In this work we explore the connection
between clustering and entropy: clusters of similar points have
lower entropy than those of dissimilar ones. We use this connection
to design a heuristic algorithm, COOLCAT, which is capable
of efficiently cluster large data sets of records with categorical
attributes. In contrast with other categorical clustering algorithms
published in the past, COOLCAT's clustering results are
very stable for different sample sizes and parameter settings.
Also, the criteria for clustering
is a very intuitive one, since it is
deeply rooted on the well-known notion of entropy.
|