WGCNA: cleaning for sample outliers
0
0
Entering edit mode
@genomic_region-13050
Last seen 3.7 years ago

Hi there,  

I'm working with 363 samples with 10K genes. My workflow is: load data, transpose, get gene names. use hclust. Plot data once and see where abline has to be drawn. I draw plot for clustering with eye-balled abline.

I'm lost with cutheight and min size while cleaning samples. Below are code and my doubts:

exprs_data<-read.table("complete_genes_mapped",header=TRUE)
data_exprs.cleaned<-as.data.frame(t(exprs_data[, -c(1)])); #remove gene column

#add row names, and col names
names(data_exprs.cleaned) = exprs_data$gene
rownames(data_exprs.cleaned) = names(exprs_data)[-c(1)]

#check data for excessive missing values and identi_cation of outlier microarray
gsg = goodSamplesGenes(data_exprs.cleaned, verbose = 3);

#--everything OK with mapped genes
if (!gsg$allOK)
{
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(data_exprs.cleaned)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(data_exprs.cleaned)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
data_exprs.cleaned= data_exprs.cleaned[gsg$goodSamples, gsg$goodGenes]
}

#Check outliers
sampleTree = hclust(dist(data_exprs.cleaned ), method = "average"); #do clustering 

# Plot the sample tree: 
# The user should change the dimensions if the window is too large or too small.

CairoJPEG("sample_outliers_tree.jpeg",width=1200,height=900)
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,cex.axis = 1.5, cex.main = 2)
abline(h=90, col = "red")
dev.off()

But now comes the foggy part:

labels_min10 = cutreeStatic(sampleTree, cutHeight = 90,minSize=10)
table(labels_min10)

labels
  0   1   2   3
  3 298  34  28

labels_def = cutreeStatic(sampleTree, cutHeight = 90) #min size is 50
 table(labels_def)
labels
  0   1
 65 298

I lose 65 samples (throwing samples with label as 0) with cutheight 90 which is ~20% of input sample size with min size as 50. Don't know what to do.?

Also, I cannot decide on how the min size of cluster works here. Following are my questions and doubts:

  1. Does it mean I drop samples that have cluster size less than N (50,10) below cutHeight? 
  2. What do labels 2 and 3 tell for labels_min10 ?

Tutorial link: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-01-dataInput.pdf

 

microarray gene network WGCNA • 2.8k views
ADD COMMENT
1
Entering edit mode

1. Yes, the threshold is to remove samples that are "outliers", that too few follow the same pattern to be reliable

2.Each label is a group of samples, so there are two other groups (besides group 1) that behave differently

ADD REPLY
0
Entering edit mode

Hi Lluis,  

Thank you very much. That helps. :)

ADD REPLY
0
Entering edit mode

Is it advised to keep samples that group besides in cluster 1?

ADD REPLY

Login before adding your answer.

Traffic: 394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6