How to select top 10% highly variable genes in microarray data?
2
1
Entering edit mode
Biologist ▴ 120
@biologist-9801
Last seen 4.8 years ago

Hi,

I use microarray data. I'm using "oligo" R package for background correction and normalisation of expression values. After normalisation I want to calculate Z-score to generate a heatmap.

As they are around 25,000 genes with expression values in the matrix, I want to create a heatmap with only top 10% highly variable genes.

Looking for a best statistical way to select top 10% highly variable genes with which I can plot a heatmap.

With some google search I found the following one:

"normdata" is a matrix with 25,000 genes after background correction and normalisation.

        x <- apply(normdata, 1, IQR) #Calculate IQR
        y <- normdata[x > quantile(x, 0.9), ] #selecting top 10% highly variable genes

Do you think the above code is the right way to select top 10% highly variable genes?

Thank you

r microarray snp6.0 oligo • 4.7k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 10 minutes ago
United States

That's a way to do it, so long as you also account for NA values. Or you could use varFilter in the genefilter package, which will be much faster.

> z <- matrix(rnorm(1e6), ncol = 10)
> system.time(varFilter(z, var.cutoff = 0.9))
   user  system elapsed
   0.05    0.00    0.05

> fun <- function(z){y <- apply(z, 1, IQR); z[y > quantile(y, 0.9),]}
> system.time(fun(z))
   user  system elapsed
   6.08    0.00    6.14

But even with 1e5 'genes' your way only requires you to wait six extra seconds...

ADD COMMENT
0
Entering edit mode

Dear James,

Thanks for the reply. I'm not asking about the which is faster. I'm asking whether the above given code can be used for selecting top 10% highly variable genes or not.

And one more question is - Do I need to select top 10% highly variable genes before normalisation or after normalisation?

Thank you

ADD REPLY
0
Entering edit mode
SamGG ▴ 360
@samgg-6428
Last seen 10 days ago
France/Marseille/Inserm

Hi,

I am not an expert but IMHO your code is correct to achieve your goal.

Selection should take place AFTER normalization, but if your samples are roughly similar there should be not much difference between after or before.

Just a word concerning Z-score. It will relate the data to their dispersion in the heatmap while IQR selection will not use the dispersion at all. I always look at row centred data before using Z-score.

Best.

ADD COMMENT

Login before adding your answer.

Traffic: 994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6