Question

Observation/case weighting in cluster analysis in R

0

Entering edit mode

danielle.newby • 0

@daniellenewby-12503

Last seen 7.5 years ago

Hi everyone,

I have a large matrix of over 100K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.

To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.

I want to use hclust in the stats package. From the help information for hclust the arguments are as follows:

hclust(d, method = "complete", members = NULL)

The information for the members argument is:, NULL or a vector with length size of d. See the ‘Details’ section. When you look at the details section you get: If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.

From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:

hclust(d, method = "complete", members = df$freq)

Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.

If anyone can help me that would be great or can put me in contact with the developers of hclust as its not clear what this argument is for.

Thanks

Danielle

hclust hierarchical clustering clustering weight • 2.2k views

ADD COMMENT • link updated 7.5 years ago by Aaron Lun ★ 28k • written 7.5 years ago by danielle.newby • 0

score 0 · Answer 1 · 2017-07-25

hclust is a function in the stats package, which is one of the base R packages. This question doesn't have anything to do with Bioconductor, so it would be suited to a more general forum (e.g., the R mailing list). For what it's worth, it seems that you could think of each set of identical observations as a "cluster", in which case a vector containing the sizes of those sets would probably be an appropriate setting for members. That said, having identical observations is quite unusual in high-dimensional biological settings.