Introducing Group Reduce, a K-Means Algorithm for Clustering Groups

2016-10-18

Check out the groupreduce.py python file for the algorithm and this jupyter notebook for an example of the groupreduce algorithm in action. Full github repo here.

Group Reduce is a k-means clustering algorithm for reducing the number of groups in a data set (i.e., for clustering groups). It takes in pandas dataframes categorical data in 0/1 form, and outputs lists of groups close to one another based on based on k-means clustering. Group Reduce is designed for data sets where instances have multiple group membership.

A naive k-means clustering algorithms, when adding to a cluster, simply finds the midpoint between the n points in the cluster (i.e., it finds the point in a space that is the smallest summed euclidean distance between the points). When clustering individual instances, this method makes perfect sense.

However, when trying to cluster groups, it must be taken into account that groups can have different size populations and that group memberships can overlap. As a result, instead of finding the minimum sum of unweighted euclidean distances from points represented by each group (one point per group), when clustering groups the search for a centroid can be done more effectively by finding the minimum sum of unweighted euclidean distances of the union of the populations of the groups in the cluster. This is exactly how Group Reduce works.

For example, let's say cluster D includes groups A, B, and C, and that group A includes the set of points (K, J, M), group B includes the set of points (M, N, P, Q, X, Y), and group C includes the set of points (X, Y, Z). Here Group Reduce will find the union of the groups' memberships--the set of points (K, J, M, N, P, Q, X, Y, Z)--and then the minimum sum of unweighted euclidean distances from this newly created set representing the members of the cluster.

While Group Reduce is by no means the most efficient or cleanly coded algorithm ever created, it works as advertised. I'm hoping in to continue working on Group Reduce to make it more flexible and resilient.

Basic steps in Group Reduce algorithm:

Designate seed randomly
Find groups with centroids furthest from seed (and each other)
Create a dataframe for each seed, with rows where df[seed]==1
Iteratively find group with centroid closest to a cluster and integrate it into the cluster by creating a dataframe where df[seed]==1 or df[groups_already_integrated_into_cluster]==1 or df[integrated_group]==1

Function:

k_means (df, n_clusters (optional, default = 8), n_iter (optional, default = 10))

Parameters:

df: dataframe used in analysis
n_clusters: number of clusters returned
n_iter: number of iterations of k-means performed

Returned Values:

The k-means function returns one variable, a dictionary with two keys, 'best_result' and 'all_results'. The 'best_result' key holds a list of the groupings created with the iteration of the clustering algorithm that had lowest centroid movement and the associated centroid movement (inertia). The 'all_result' key holds a list of [groupings,inertia] created through the algorithm.

Library dependencies:

Pandas, Random

Tags: data science, algorithms, python

[Return Home]