Hi,
I was trying to replicate the BHC library example code (https://bioconductor.org/packages/release/bioc/html/BHC.html) with the Beast Cancer dataset (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), with PCA applied), but I have found problems with it.
I understood from the code example that, since my data is continuous, it should be discretized (as it is done in the 3rd example), so I replicate that part of the example:
BiocManager::install("BHC")
library(BHC)
library(RCurl)
library(factoextra)
breastCancer <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
names <- c('id_number', 'diagnosis', 'radius_mean',
'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean',
'concavity_mean','concave_points_mean',
'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se',
'area_se', 'smoothness_se', 'compactness_se',
'concavity_se', 'concave_points_se',
'symmetry_se', 'fractal_dimension_se',
'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst',
'concavity_worst', 'concave_points_worst',
'symmetry_worst', 'fractal_dimension_worst')
breastCancer <-
read.table(textConnection(breastCancer),
sep = ',',
col.names = names)
breastCancer.predictors <- breastCancer[3:32]
breastCancer.prcomp <- prcomp(breastCancer.predictors, scale = TRUE, center = TRUE)
breastCancer.PCA <- breastCancer.prcomp$x[, 1:7]
newData2 <- breastCancer.PCA
itemLabels2 <-breastCancer$diagnosis
percentiles <- FindOptimalBinning(newData2, itemLabels2, transposeData=TRUE, verbose=TRUE)
discreteData <- DiscretiseData(t(newData2), percentiles=percentiles)
discreteData <- t(discreteData)
hc3 <- bhc(discreteData, itemLabels2, verbose=TRUE)
plot(hc3, axes=FALSE)
WriteOutClusterLabels(hc3, verbose=TRUE)
However, although I get two clusters, the first one only has one occurrence and the second one have the rest, which is far from my expected result. Am I doing something wrong?
Thanks in advance.