Hi experts,
I have done a gene set test using limma's camera function but am a little confused by my results: in the NGenes column (which is also the length of my index)
, I got a number that is larger than the number of probes in my gene set. How can my probes match a larger number of indices than exist for the gene set?
Here's what I did:
#expressedProbes.lumi is a non-specific filtered lumibatch from my experiment. I have already read in the results with lumi and done a DE analysis with limma. I will call this the "biopsy experiment." #CELLStt3 are the DE probes from a certain timepoint ofa different, similar experiment done in our lab. I will call this the "cells experiment". I have already read in the results with lumi and done a DE analysis with limma for this experiment.
I am using these probes as my gene set because I want to see what kind of overlap there is between the biopsy and cell experiments.
y <- exprs(expressedProbes.lumi)
# exract the entrez IDS from the the cells experiment gene set and put it into the proper list format.
gene.set <- list( as.character( CELLStt3$ENTREZ_GENE_ID ))
# length(gene.set [[1]] ) is 11076 and length (unique(gene.set [[1]] ) is 8873
# preparing the identifiers from biopsy experiment by extracting the entrez IDS from the lumibatch.
identifiers <- as.character( fData( expressedProbes.lumi ) [,"ENTREZ_GENE_ID"] )
#length (identifiers) is 27126 and length(unique(identifiers)) is 18936
#make index with CELLStt3 probes andbiopsy experiment identifiers index<-ids2indices(gene.sets = gene.set, identifiers = identifiers)
#length( unique( index[[ 1 ]] )) is 13476 which > length(gene.set)
#define a matrix for the contrast of my biopsy experiment that I want the results for.
contr <- makeContrasts(TreatV186.24-TreatMock.24, levels = design) #design is the design matrix that I used for DE analysis of my biopsy experiment.
#put arguments into camera
cameraResult<-camera( y = y, index = index, design = design, contrast = contr )
pander ( cameraResult )
-------------------------------------------------------------- NGenes Correlation Direction PValue -------- ------------- ----------- --------- ----------------- 13476 0.001321 Up 0.0001454 ------------------------------------------------------------- > sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets [7] methods base other attached packages: [1] pander_0.5.2 lumi_2.18.0 Biobase_2.26.0 [4] BiocGenerics_0.12.1 limma_3.22.7 loaded via a namespace (and not attached): [1] affy_1.44.0 affyio_1.34.0 [3] annotate_1.44.0 AnnotationDbi_1.28.2 [5] base64_1.1 base64enc_0.1-3 [7] BatchJobs_1.6 BBmisc_1.9 [9] beanplot_1.2 BiocInstaller_1.16.5 [11] BiocParallel_1.0.3 biomaRt_2.22.0 [13] Biostrings_2.34.1 bitops_1.0-6 [15] brew_1.0-6 bumphunter_1.6.0 [17] checkmate_1.6.2 codetools_0.2-14 [19] colorspace_1.2-6 DBI_0.3.1 [21] digest_0.6.8 doRNG_1.6 [23] fail_1.2 foreach_1.4.2 [25] genefilter_1.48.1 GenomeInfoDb_1.2.5 [27] GenomicAlignments_1.2.2 GenomicFeatures_1.18.7 [29] GenomicRanges_1.18.4 grid_3.1.2 [31] illuminaio_0.8.0 IRanges_2.0.1 [33] iterators_1.0.7 KernSmooth_2.23-15 [35] knitr_1.11 lattice_0.20-29 [37] locfit_1.5-9.1 magrittr_1.5 [39] MASS_7.3-44 Matrix_1.2-2 [41] matrixStats_0.14.2 mclust_5.0.2 [43] methylumi_2.12.0 mgcv_1.8-7 [45] minfi_1.12.0 multtest_2.22.0 [47] nleqslv_2.8 nlme_3.1-124 [49] nor1mix_1.2-1 pkgmaker_0.22 [51] plyr_1.8.3 preprocessCore_1.28.0 [53] quadprog_1.5-5 RColorBrewer_1.1-2 [55] Rcpp_0.12.0 RCurl_1.95-4.7 [57] registry_0.3 reshape_0.8.5 [59] rngtools_1.2.4 Rsamtools_1.18.3 [61] RSQLite_1.0.0 rtracklayer_1.26.3 [63] S4Vectors_0.4.0 sendmailR_1.2-1 [65] siggenes_1.40.0 splines_3.1.2 [67] stats4_3.1.2 stringi_0.5-5 [69] stringr_1.0.0 survival_2.38-3 [71] tools_3.1.2 XML_3.98-1.3 [73] xtable_1.7-4 XVector_0.6.0 [75] zlibbioc_1.12.0
Thanks x 10^6 !
Claire Levy
University of Washington/Fred Hutchinson Cancer Research Center
Seattle, WA, USA
Personally I like to leave all the probe-sets in the data even if more than one probe-set corresponds to the same gene. Using multiple probes does help with statistical power if the gene happens to be DE.
Thanks for the speedy response!
These data are from lllumina arrays and it turns out that in a topTable or fData, while the PROBE_ID column (ILMN_xxxx) is unique, the ENTREZ_GENE_ID column is not, because a single ENTREZ_GENE_ID can represent multiple PROBE_IDs. I repeated the analysis but limited the identifiers and gene set to unique ENTREZ_GENE_IDs (so now 18,926 unique identifiers and 8,873 in the gene set) and got the following result:
---------------------------------------------
NGenes Correlation Direction PValue
-------- ------------- ----------- --------
7916 0.003774 Down 0.5059
-------------------------------------------
Ok, now
NGenes
is <length (unique(gene.set [[1]]))
Then I was curious about using the PROBE_ID instead of ENTREZ_ID, since it is already unique, so I did the following:
and got this result:
Still a sane value for NGenes, but now my PValue is totally different and I've got the opposite direction. Both tests used a unique list for the gene.set and a unique vector of identifiers. Based on what Gordon said, I'm inclined to use the PROBE_ID version. I like idea of using all the data instead of filtering based on the entrez ids, but I don't think I could explain why I get such different results.
Side note: Isn't NGenes kind of a misnomer since there are usually multiple probes associated with the same gene, so the overlap is really over probes? I guess this depends on what your definition of "gene" is, but I've noticed that this takes some clarification sometimes, especially some people find it easier to look at (and look up, and think about) lists of gene symbols, rather than Probe ids.
Thanks, it is really helpful to get feedback on this stuff while I'm learning.
Claire
You will probably get the same result if you use the Entrez Gene IDs, but keep the duplicates. And why you get a sign flip probably has to do with a combination of a.) how you excluded the duplicate Entrez Gene IDs, and b.) the predominate direction of differential expression in the duplicates.
The former probably has more to do with the change in p-value (e.g., if you excluded the duplicates at random, or by doing something like
mystuff[!duplicated(mystuff$theentrezgeneids),]
, which isn't random, but is an uncontrolled selection process). If removing dups ended up excluding probes with larger differences between groups, then you will lose power. If however you went through all the genes and just kept the dups with the largest absolute t-statistic, then maybe the sign and p-values might more closely agree. Or at least the p-value. Maybe.The latter probably has more to do with the direction of the gene set. If most of the duplicates are down-regulated genes, then you are removing evidence for down-regulation.
And this is where I part ways with Gordon on the usefulness of including duplicates. While it undeniably increases power to include the duplicates, if you have a set of DE genes with a bunch of duplicates, this can have as much to do with the design of the array as it has with the underlying biology you are trying to explore.
As an example, say you have 20 genes, and 10 have duplicates, and the 10 genes with duplicates are almost all down-regulated. When you do the gene set test, you will say 'Hey, I have 30 things I measured that are, like, genes or something, and 20 of them are down-regulated, so I think my set of genes is perturbed in a down-regulated fashion'. But this directionality and maybe even the strength thereof is a direct artifact of the manufacturer duplicating those 10 down-regulated genes, and not the 10 up-regulated ones! But then I primarily work with Affy arrays, which have varied numbers of duplication. Maybe Illumina arrays have an equal number of probes for each gene, in which case I suppose it all comes out in the wash.
It makes sense that the results would be different because of non-random removal of ( or inclusion of on the part of the manufacturer) duplicates. I don't know the details about how this is done for Illumina though. I think I'm going to stick with vectors that start out unique I don't have to worry (so much) about this. Thanks!