Question

WGCNA: cleaning for sample outliers

0

Entering edit mode

GENOMIC_region • 0

@genomic_region-13050

Last seen 4.1 years ago

Hi there,

I'm working with 363 samples with 10K genes. My workflow is: load data, transpose, get gene names. use hclust. Plot data once and see where abline has to be drawn. I draw plot for clustering with eye-balled abline.

I'm lost with cutheight and min size while cleaning samples. Below are code and my doubts:

exprs_data<-read.table("complete_genes_mapped",header=TRUE)
data_exprs.cleaned<-as.data.frame(t(exprs_data[, -c(1)])); #remove gene column

#add row names, and col names
names(data_exprs.cleaned) = exprs_data$gene
rownames(data_exprs.cleaned) = names(exprs_data)[-c(1)]

#check data for excessive missing values and identi_cation of outlier microarray
gsg = goodSamplesGenes(data_exprs.cleaned, verbose = 3);

#--everything OK with mapped genes
if (!gsg$allOK)
{
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(data_exprs.cleaned)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(data_exprs.cleaned)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
data_exprs.cleaned= data_exprs.cleaned[gsg$goodSamples, gsg$goodGenes]
}

#Check outliers
sampleTree = hclust(dist(data_exprs.cleaned ), method = "average"); #do clustering 

# Plot the sample tree: 
# The user should change the dimensions if the window is too large or too small.

CairoJPEG("sample_outliers_tree.jpeg",width=1200,height=900)
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,cex.axis = 1.5, cex.main = 2)
abline(h=90, col = "red")
dev.off()

But now comes the foggy part:

labels_min10 = cutreeStatic(sampleTree, cutHeight = 90,minSize=10)
table(labels_min10)

labels
  0   1   2   3
  3 298  34  28

labels_def = cutreeStatic(sampleTree, cutHeight = 90) #min size is 50
 table(labels_def)
labels
  0   1
 65 298

I lose 65 samples (throwing samples with label as 0) with cutheight 90 which is ~20% of input sample size with min size as 50. Don't know what to do.?

Also, I cannot decide on how the min size of cluster works here. Following are my questions and doubts:

Does it mean I drop samples that have cluster size less than N (50,10) below cutHeight?
What do labels 2 and 3 tell for labels_min10 ?

Tutorial link: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-01-dataInput.pdf

microarray gene network WGCNA • 3.0k views

ADD COMMENT • link updated 6.2 years ago by Bioconductor Community 0 • written 6.2 years ago by GENOMIC_region • 0

1

Entering edit mode

1. Yes, the threshold is to remove samples that are "outliers", that too few follow the same pattern to be reliable

2.Each label is a group of samples, so there are two other groups (besides group 1) that behave differently

ADD REPLY • link 6.2 years ago Lluís Revilla Sancho ▴ 760

0

Entering edit mode

Hi Lluis,

Thank you very much. That helps. :)

ADD REPLY • link 6.2 years ago GENOMIC_region • 0

0

Entering edit mode

Is it advised to keep samples that group besides in cluster 1?

ADD REPLY • link 5.1 years ago GENOMIC_region • 0