Question

Any efficient workaround to identify highly variable genes in Affymetrix microarray expression data?

2

Entering edit mode

Jurat Shahidin ▴ 80

@jurat-shahidin-9488

Last seen 5.1 years ago

Chicago, IL, USA

Hi:

I am experimenting preprocessed Affymetrix microarrays expression data matrix (Affymetrix probe-sets in rows (32830 probesets), and RNA samples in columns (735 samples)) for my downstream analysis. Here is how the data looks like:

> dim(eset_HTA20)
[1] 32830   735
## load assayData
HTA20_rma <- load("HTA20_RMA.RData")

> eset_HTA20[1:10,1:3]
             Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04
1_at                6.062215        6.125023        5.875502
10_at               3.796484        3.805305        3.450245
100_at              5.849338        6.191562        6.550525
1000_at             3.567779        3.452524        3.316134
10000_at            6.166815        5.678373        6.185059
100009613_at        4.443027        4.773199        4.393488
100009676_at        5.836522        6.143398        5.898364
10001_at            6.330018        5.601745        6.137984
10002_at            4.922339        4.711765        4.628124
10003_at            2.689344        2.771010        2.556756

However, I am interested in to reduce the dimension of the dataset by tossing off low expressed genes in the experiment. To do so, I need to quantify the variation of the gene that expressed. I checked the possible approach and using the coefficient of variation (cv) could be one of the approaches I could try. It is not intuitive for me how to quantify the variation of genes that expressed.

My attempt with genefilter utilities

Here is my attempt to find out highly variable genes by using genefilter utilities:

eset_exprs <- ExpressionSet(eset_HTA20)
f1 <- kOverA(5, 200)
ffun <- filterfun(f1)
wh1 <- genefilter(exprs(eset_exprs), ffun)
sum(wh1)

but results are not very meaningful to me because I am expecting at least 5k genes after filtration, and if I set up expression measure as 200, then no gene is left after all. Why? How can I tune the parameter to get a correct number of filtered genes which shows high variation in expression? Any idea? I am expecting the output as data.frame for filtered genes with significant expression variation.

My attempt by using DESeq2 utilities

require(DESeq)
lib.size <- estimateSizeFactorsForMatrix(eset_HTA20)
ed <- t(t(eset_HTA20)/lib.size)

means <- rowMeans(ed)
vars <- apply(ed,1,var)
cv2 <- vars/means^2
par(mar=c(3.5,3.5,1,1),mgp=c(2,0.65,0),cex=0.9)
smoothScatter(log(means),log(cv2))

but I have a hard time to interpret the information on the scatter plot. Can anyone point me out how to refine the above approach? Can anyone point me out how to lay out this procedure on the above expression data? Is there any Bioconductor package to do this kind of task? any idea to make this happen? thanks

limma microarray genefilter deseq2 • 2.8k views

ADD COMMENT • link 5.9 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

Jurat Shahidin ▴ 80

@jurat-shahidin-9488

Last seen 5.1 years ago

Chicago, IL, USA

Thanks for Gordon's help, I come across with my simple solution as follow:

SD <- apply(eset_HTA20,1, sd)
CV <- base::sqrt(exp(SD^2)-1)
eset_filtered <- eset_HTA20[CV > quantile(CV, probs = 0.1),]

ADD COMMENT • link 5.9 years ago Jurat Shahidin ▴ 80

1

Entering edit mode

Yes, that is the correct way to compute the unlog-scale CV from the log-scale SD, but do you appreciate that filtering on SD is exactly the same as filtering on CV, so there is actually no need to compute CV?

ADD REPLY • link 5.9 years ago Gordon Smyth 52k

0

Entering edit mode

yes, I do highly endorse your solution and already accepted as an answer. Here I just show what I did after learning your solution because I am still learning gene-expression analysis, so it could be good practice for me to come up my own solution. please don't me wrong, I could delete my solution if you want me to do. Thanks

ADD REPLY • link 5.9 years ago Jurat Shahidin ▴ 80

score 4 · Accepted Answer · 2019-06-19

4

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

Using CV is not appropriate for the type of data you have, which is on a log2-scale. CV is only appropriate for unlogged data.

Filtering low-expressed genes is very easy for Affymetrix data, and you don't need a specialist package to do it. However the appropriate filtering method depends on what downstream analysis you are intending.

As a first step, you need to decide whether are you after highly variable genes (as the title of your question says) or highly expressed genes (as the text of your question says). For PCA analysis I would go with the first whereas for a DE analysis I would go with the second.

One good procedure might be to keep probes for which the normalized log2-intensity is >4 for at least 10% of the samples. You would do that by:

keep <- rowSums( HTA20_rma > 4 ) > 735/10
HTA20_filtered <- HTA20_rma[keep, ]

This is just ordinary R programming.

ADD COMMENT • link 5.9 years ago Gordon Smyth 52k

0

Entering edit mode

Could you elaborate your point please? My intention is to use the coefficient of variation for each gene (a.k.a, probe), to toss away low expressed genes, ultimately to obtain expression matrix with only highly expressed genes in the experiment.

My downstream analysis, after properly deal with Affymetrix expression data (remove noise, normalize), to train the ML model with it. I am trying to do possible feature engineering on this Affymetrix expression data (such as using statistical methods to measure the proportion of highly expressed genes, find out the correlation between each gene with target experiment observation, and so on)

Would it be possible to extend your comment with possible example demo code instead? How can I refine my approach? Thank you

ADD REPLY • link 5.9 years ago Jurat Shahidin ▴ 80

2

Entering edit mode

Why do you think CV is a measure of how highly expressed a gene is?

You seem to want to do pretty sophisticated analyses, yet at the same time you seem to struggle with the simplest thing and want people to give you example code to help you along. These are mutually exclusive! You cannot do sophisticated things if you cannot do simple things.

Anyway, this support site is intended to help people with technical, programmatic aspects of using Bioconductor tools rather than an ongoing primer for people who don't really know how to use R. While R and Bioconductor are free to download, they are most certainly not free to use. The cost of both is the time and effort required to learn how to use the software, and to have (or gain) the statistical knowledge to know what you are doing.

You can't get that knowledge on a support site! The only way to do so is by either spending the time and effort to learn how to do what you want, or to find someone who already knows what to do and pay for their service. All things equal, the latter is probably better than the former, unless you have a real need to know this stuff.

ADD REPLY • link 5.9 years ago James W. MacDonald 68k

0

Entering edit mode

I do follow community rule and thanks again for your clarification, appreciated.

ADD REPLY • link 5.9 years ago Jurat Shahidin ▴ 80