Clustering of categorical attributes is a difficult problem that has not received as much attention as its numerical counterpart. In this work we explore the connection between clustering and entropy: clusters of similar points have lower entropy than those of dissimilar ones. We use this connection to design a heuristic algorithm, COOLCAT, which is capable of efficiently cluster large data sets of records with categorical attributes. In contrast with other categorical clustering algorithms published in the past, COOLCAT's clustering results are very stable for different sample sizes and parameter settings. Also, the criteria for clustering is a very intuitive one, since it is deeply rooted on the well-known notion of entropy.
People:
  • Daniel Barbará
  • Julia Couto
  • Yi Li
Publications:

-

Home