Hi:
I am experimenting preprocessed Affymetrix microarrays expression data matrix (Affymetrix probe-sets in rows (32830 probesets), and RNA samples in columns (735 samples)) for my downstream analysis. Here is how the data looks like:
> dim(eset_HTA20)
[1] 32830 735
## load assayData
HTA20_rma <- load("HTA20_RMA.RData")
> eset_HTA20[1:10,1:3]
Tarca_001_P1A01 Tarca_003_P1A03 Tarca_004_P1A04
1_at 6.062215 6.125023 5.875502
10_at 3.796484 3.805305 3.450245
100_at 5.849338 6.191562 6.550525
1000_at 3.567779 3.452524 3.316134
10000_at 6.166815 5.678373 6.185059
100009613_at 4.443027 4.773199 4.393488
100009676_at 5.836522 6.143398 5.898364
10001_at 6.330018 5.601745 6.137984
10002_at 4.922339 4.711765 4.628124
10003_at 2.689344 2.771010 2.556756
However, I am interested in to reduce the dimension of the dataset by tossing off low expressed genes in the experiment. To do so, I need to quantify the variation of the gene that expressed. I checked the possible approach and using the coefficient of variation (cv
) could be one of the approaches I could try. It is not intuitive for me how to quantify the variation of genes that expressed.
My attempt with genefilter
utilities
Here is my attempt to find out highly variable genes by using genefilter
utilities:
eset_exprs <- ExpressionSet(eset_HTA20)
f1 <- kOverA(5, 200)
ffun <- filterfun(f1)
wh1 <- genefilter(exprs(eset_exprs), ffun)
sum(wh1)
but results are not very meaningful to me because I am expecting at least 5k genes after filtration, and if I set up expression measure as 200, then no gene is left after all. Why? How can I tune the parameter to get a correct number of filtered genes which shows high variation in expression? Any idea? I am expecting the output as data.frame for filtered genes with significant expression variation.
My attempt by using DESeq2
utilities
require(DESeq)
lib.size <- estimateSizeFactorsForMatrix(eset_HTA20)
ed <- t(t(eset_HTA20)/lib.size)
means <- rowMeans(ed)
vars <- apply(ed,1,var)
cv2 <- vars/means^2
par(mar=c(3.5,3.5,1,1),mgp=c(2,0.65,0),cex=0.9)
smoothScatter(log(means),log(cv2))
but I have a hard time to interpret the information on the scatter plot. Can anyone point me out how to refine the above approach? Can anyone point me out how to lay out this procedure on the above expression data? Is there any Bioconductor package to do this kind of task? any idea to make this happen? thanks
Could you elaborate your point please? My intention is to use the coefficient of variation for each gene (a.k.a, probe), to toss away low expressed genes, ultimately to obtain expression matrix with only highly expressed genes in the experiment.
My downstream analysis, after properly deal with Affymetrix expression data (remove noise, normalize), to train the ML model with it. I am trying to do possible feature engineering on this Affymetrix expression data (such as using statistical methods to measure the proportion of highly expressed genes, find out the correlation between each gene with target experiment observation, and so on)
Would it be possible to extend your comment with possible example demo code instead? How can I refine my approach? Thank you
Why do you think CV is a measure of how highly expressed a gene is?
You seem to want to do pretty sophisticated analyses, yet at the same time you seem to struggle with the simplest thing and want people to give you example code to help you along. These are mutually exclusive! You cannot do sophisticated things if you cannot do simple things.
Anyway, this support site is intended to help people with technical, programmatic aspects of using Bioconductor tools rather than an ongoing primer for people who don't really know how to use R. While R and Bioconductor are free to download, they are most certainly not free to use. The cost of both is the time and effort required to learn how to use the software, and to have (or gain) the statistical knowledge to know what you are doing.
You can't get that knowledge on a support site! The only way to do so is by either spending the time and effort to learn how to do what you want, or to find someone who already knows what to do and pay for their service. All things equal, the latter is probably better than the former, unless you have a real need to know this stuff.
I do follow community rule and thanks again for your clarification, appreciated.