Hi,
So I have this dataset http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5099
and I simply want to extract the significant DE genes between some conditions.
I am new to this, but so far, following the advises to do the hgu133A chips first, I am there :
celpath = "C:/Users/Paul/Documents/Microarray/GSE5099/CEL/U133A/" fns = list.celfiles(path=celpath,full.names=TRUE) cat("Reading files:\n",paste(fns,collapse="\n"),"\n") data = ReadAffy(celfile.path=celpath) ph = dataA@phenoData ph@data ph@data[ ,1] = c("M0_1","M0_2","M0_3","Md_1","Md_2","Md_3","Mph_1","Mph_2","Mph_3","M1_1","M1_2","M1_3","M2_1","M2_2","M2_3") ph@data[ ,2] = c("M0","M0","M0","Md","Md","Md","Mph","Mph","Mph","M1","M1","M1","M2","M2","M2") colnames(ph@data)[2]="source" ph@data data.rma = rma(data) ###Annotate ID <- featureNames(data.rma) Symbol <- getSYMBOL(ID, "hgu133a.db") fData(data.rma) <- data.frame(Symbol=Symbol) #create a design matrix based on the sample annotation groups = ph@data$source f = factor(groups,levels = c("M0","Md","Mph","M1","M2")) design = model.matrix(~0 + f) colnames(design)=c("M0","Md","Mph","M1","M2") data.fit = lmFit(data.rma,design) ##make pair-wise comparisons between groups contrast.matrix = makeContrasts(Md-M0, Mph-M0, M1-M0, M2-M0, M1-Mph, M2-Mph, M2-M1, levels=design) data.fit.con = contrasts.fit(data.fit,contrast.matrix) data.fit.con.eb = eBayes(data.fit.con) topTable(data.fit.con.eb, n=10, adjust="BH")
(I also did some Quality plots that I didn't included here since everything seems fine)
Then, I read a lot of posts and tutorial that suggest to filter probes. Since I just want to extract a list of significantly DE genes for each of my contrast. For now, I used this filter, that filter about 2500 lines form 22283 in data.rma
##Filtering: data.filt <- nsFilter(data.rma, require.entrez=TRUE, require.GOBP=FALSE, require.GOCC=FALSE, require.GOMF=FALSE, require.CytoBand=FALSE, remove.dupEntrez=FALSE, var.func=IQR, var.cutoff=0.5, var.filter=FALSE, filterByQuantile=TRUE, feature.exclude="^AFFX")$eset dim(data.rma) dim(data.filt)
However, it removes the AFFX probes, which I could use to determine the p.value cut off using this
results = decideTests(data.fit.con.eb,method='global',adjust.method="BH",p.value=0.05,lfc=1) i <- grep("AFFX",featureNames(data.rma)) summary(data.fit.con.eb$F.p.value[i]) results <- classifyTestsF(data.fit.con.eb, p.value=0.00000018) summary(results)
I read that the lesser the probes for the eBayes, the better it is, but I may not have understood everything so well.
My questions are :
- Is everything I done correct ?
- Should I filter first and then do the eBayes ? It change little, but still, the F.p.value with filtering is a little bit higher.
- If I remove the AFFX probes, how do I choose my F.p.value cut off ?
Thanks for your help
> sessionInfo() R version 3.2.4 Revised (2016-03-16 r70336) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 [4] LC_NUMERIC=C LC_TIME=French_France.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] hgu133bcdf_2.18.0 hgu133bprobe_2.18.0 hgu133b.db_3.2.2 hgu133aprobe_2.18.0 [5] hgu133acdf_2.18.0 hgu133a.db_3.2.2 org.Hs.eg.db_3.2.3 KEGG.db_3.2.2 [9] GSEABase_1.32.0 annotate_1.48.0 XML_3.98-1.4 GOstats_2.36.0 [13] graph_1.48.0 Category_2.36.0 GO.db_3.2.2 RSQLite_1.0.0 [17] DBI_0.3.1 AnnotationDbi_1.32.3 IRanges_2.4.8 S4Vectors_0.8.11 [21] affydata_1.18.0 simpleaffy_2.46.0 genefilter_1.52.1 affyPLM_1.46.0 [25] preprocessCore_1.32.0 gcrma_2.42.0 affy_1.48.0 BiocInstaller_1.20.1 [29] Biobase_2.30.0 BiocGenerics_0.16.1 ggplot2_2.1.0 rpart_4.1-10 [33] Matrix_1.2-4 lattice_0.20-33 limma_3.26.9 loaded via a namespace (and not attached): [1] Rcpp_0.12.4 plyr_1.8.3 XVector_0.10.0 tools_3.2.4 [5] zlibbioc_1.16.0 gtable_0.2.0 Biostrings_2.38.4 grid_3.2.4 [9] RBGL_1.46.0 survival_2.38-3 scales_0.4.0 splines_3.2.4 [13] AnnotationForge_1.12.2 colorspace_1.2-6 xtable_1.8-2 munsell_0.4.3 [17] affyio_1.40.0
Ok, thanks for the explanation. So basically, just by choosing a FDR like the very common 0.05 is sufficient, but it may be best to check for very low variance genes and maybe filter them, then redo the eBayes.
About the AFFX probes, I saw this example and thought it was logical, but if you say FDR is simpler and better, I'll go for it.
To answer about the Nsfilter, I understood from the limma userguide that it was incompatible due to the variance filtering, that's why I used the option
var.filter=FALSE,
but it is still useful to remove control probes and that that does'nt match an EntrezID.Thanks for your answer !
it is still useful to remove probes that does'nt match an EntrezID
May be. But I don't see how it can be doing that. hgu133a.db is not an Entrez Gene Id orientated database.
Just a question, at which point do you consider that the trend line is nearly monotonic ?
They're all ok, but I would probably go with the second (Amean > 5). You shouldn't need to filter more than about 25% of probe-sets for these arrays.