Question

Is filtering necessary to extract significant DE genes with limma ?

0

Entering edit mode

giroudpaul ▴ 40

@giroudpaul-10031

Last seen 5.0 years ago

France

Hi,

So I have this dataset http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5099

and I simply want to extract the significant DE genes between some conditions.

I am new to this, but so far, following the advises to do the hgu133A chips first, I am there :

celpath = "C:/Users/Paul/Documents/Microarray/GSE5099/CEL/U133A/"
fns = list.celfiles(path=celpath,full.names=TRUE)
cat("Reading files:\n",paste(fns,collapse="\n"),"\n")
data = ReadAffy(celfile.path=celpath)

ph = dataA@phenoData
ph@data
ph@data[ ,1] = c("M0_1","M0_2","M0_3","Md_1","Md_2","Md_3","Mph_1","Mph_2","Mph_3","M1_1","M1_2","M1_3","M2_1","M2_2","M2_3")
ph@data[ ,2] = c("M0","M0","M0","Md","Md","Md","Mph","Mph","Mph","M1","M1","M1","M2","M2","M2")
colnames(ph@data)[2]="source"
ph@data

data.rma = rma(data)

###Annotate
ID <- featureNames(data.rma)
Symbol <- getSYMBOL(ID, "hgu133a.db")
fData(data.rma) <- data.frame(Symbol=Symbol)

#create a design matrix based on the sample annotation
groups = ph@data$source
f = factor(groups,levels = c("M0","Md","Mph","M1","M2"))
design = model.matrix(~0 + f)
colnames(design)=c("M0","Md","Mph","M1","M2")
data.fit = lmFit(data.rma,design)

##make pair-wise comparisons between groups
contrast.matrix = makeContrasts(Md-M0, Mph-M0, M1-M0, M2-M0, M1-Mph, M2-Mph, M2-M1, levels=design)
data.fit.con = contrasts.fit(data.fit,contrast.matrix)
data.fit.con.eb = eBayes(data.fit.con)

topTable(data.fit.con.eb, n=10, adjust="BH")

(I also did some Quality plots that I didn't included here since everything seems fine)

Then, I read a lot of posts and tutorial that suggest to filter probes. Since I just want to extract a list of significantly DE genes for each of my contrast. For now, I used this filter, that filter about 2500 lines form 22283 in data.rma

##Filtering:
data.filt <- nsFilter(data.rma, require.entrez=TRUE,
                      require.GOBP=FALSE, require.GOCC=FALSE,
                      require.GOMF=FALSE, require.CytoBand=FALSE,
                      remove.dupEntrez=FALSE, var.func=IQR,
                      var.cutoff=0.5, var.filter=FALSE,
                      filterByQuantile=TRUE, feature.exclude="^AFFX")$eset
dim(data.rma)
dim(data.filt)

However, it removes the AFFX probes, which I could use to determine the p.value cut off using this

results = decideTests(data.fit.con.eb,method='global',adjust.method="BH",p.value=0.05,lfc=1)
i <- grep("AFFX",featureNames(data.rma))
summary(data.fit.con.eb$F.p.value[i])
results <- classifyTestsF(data.fit.con.eb, p.value=0.00000018)
summary(results)

I read that the lesser the probes for the eBayes, the better it is, but I may not have understood everything so well.

My questions are :

Is everything I done correct ?
Should I filter first and then do the eBayes ? It change little, but still, the F.p.value with filtering is a little bit higher.
If I remove the AFFX probes, how do I choose my F.p.value cut off ?

Thanks for your help

> sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C                   LC_TIME=French_France.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] hgu133bcdf_2.18.0     hgu133bprobe_2.18.0   hgu133b.db_3.2.2      hgu133aprobe_2.18.0  
 [5] hgu133acdf_2.18.0     hgu133a.db_3.2.2      org.Hs.eg.db_3.2.3    KEGG.db_3.2.2        
 [9] GSEABase_1.32.0       annotate_1.48.0       XML_3.98-1.4          GOstats_2.36.0       
[13] graph_1.48.0          Category_2.36.0       GO.db_3.2.2           RSQLite_1.0.0        
[17] DBI_0.3.1             AnnotationDbi_1.32.3  IRanges_2.4.8         S4Vectors_0.8.11     
[21] affydata_1.18.0       simpleaffy_2.46.0     genefilter_1.52.1     affyPLM_1.46.0       
[25] preprocessCore_1.32.0 gcrma_2.42.0          affy_1.48.0           BiocInstaller_1.20.1 
[29] Biobase_2.30.0        BiocGenerics_0.16.1   ggplot2_2.1.0         rpart_4.1-10         
[33] Matrix_1.2-4          lattice_0.20-33       limma_3.26.9         

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4            plyr_1.8.3             XVector_0.10.0         tools_3.2.4           
 [5] zlibbioc_1.16.0        gtable_0.2.0           Biostrings_2.38.4      grid_3.2.4            
 [9] RBGL_1.46.0            survival_2.38-3        scales_0.4.0           splines_3.2.4         
[13] AnnotationForge_1.12.2 colorspace_1.2-6       xtable_1.8-2           munsell_0.4.3         
[17] affyio_1.40.0

hgu133a • 2.9k views

ADD COMMENT • link updated 8.7 years ago by Gordon Smyth 52k • written 8.7 years ago by giroudpaul ▴ 40

score 5 · Accepted Answer · 2016-04-13

5

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

limma will handle the DE analysis quite well whether you filter or not. The hgu133A arrays typically don't need a lot of filtering because they only contain relatively well annotated genes, which tend to be expressed in most samples. You could make limma even less dependent on filtering by turning trend=TRUE on for the eBayes step.

I don't understand why you are using control AFFX probes to choose the p-value cutoff. This doesn't seem sensible. Why don't you just use FDR like everyone else? The FDR value is given in the adj.P.value column in the topTable output.

Generally speaking, you can improve the DE analysis by removing probes that are consistently at low intensities, or which have very low average intensities. This is just common sense -- probes that are never expressed at all can't be differentially expressed to any worthwhile degree, and are just ballast to the analysis.

If you want to get fancy, you can use Affymetrix presence/absence calls to decide whether probes are expressed or not. A much simpler way that is just as effective way is to examine the plot from

plotSA(data.fit.con.eb)

Is there a tail of very low variances for low-expressed genes (on the left of the plot)? If there is, then do a bit of filtering like this:

data.fit.con.eb <- eBayes(data.fit.con[data.fit.con$Amean>5,] trend=TRUE)

Increase the cutoff until the trend line is nearly monotonic.

Unfortunately the nsFilter() function doesn't seem to do this obvious and natural thing (filtering non-expressed probes). Actually I'm not quite sure what nsFilter() is doing, which makes it very hard to recommend -- it seems mainly intended for variance filtering, which is entirely incompatible with a limma analysis.

ADD COMMENT • link 8.7 years ago Gordon Smyth 52k

0

Entering edit mode

Ok, thanks for the explanation. So basically, just by choosing a FDR like the very common 0.05 is sufficient, but it may be best to check for very low variance genes and maybe filter them, then redo the eBayes.

About the AFFX probes, I saw this example and thought it was logical, but if you say FDR is simpler and better, I'll go for it.

To answer about the Nsfilter, I understood from the limma userguide that it was incompatible due to the variance filtering, that's why I used the option var.filter=FALSE, but it is still useful to remove control probes and that that does'nt match an EntrezID.

Thanks for your answer !

ADD REPLY • link 8.7 years ago giroudpaul ▴ 40

0

Entering edit mode

it is still useful to remove probes that does'nt match an EntrezID

May be. But I don't see how it can be doing that. hgu133a.db is not an Entrez Gene Id orientated database.

ADD REPLY • link 8.7 years ago Gordon Smyth 52k

0

Entering edit mode

Just a question, at which point do you consider that the trend line is nearly monotonic ?

data.eb.trend <- eBayes(data.fit.con, trend=TRUE)
plotSA(data.eb.trend)

data.eb.A <- eBayes(data.fit.con[data.fit.con$Amean>5,], trend=TRUE)
plotSA(data.eb.A)

data.eb.A <- eBayes(data.fit.con[data.fit.con$Amean>8,], trend=TRUE)
plotSA(data.eb.A)

ADD REPLY • link 8.7 years ago giroudpaul ▴ 40

1

Entering edit mode

They're all ok, but I would probably go with the second (Amean > 5). You shouldn't need to filter more than about 25% of probe-sets for these arrays.

ADD REPLY • link 8.7 years ago Gordon Smyth 52k