Question

Confused by results from CAMERA gene set testing

0

Entering edit mode

claire.n.levy ▴ 10

@clairenlevy-8745

Last seen 4.9 years ago

United States

Hi experts,

I have done a gene set test using limma's camera function but am a little confused by my results: in the NGenes column (which is also the length of my index), I got a number that is larger than the number of probes in my gene set. How can my probes match a larger number of indices than exist for the gene set?

Here's what I did:

#expressedProbes.lumi is a non-specific filtered lumibatch from my experiment. I have already read in the results with lumi and done a DE analysis with limma. I will call this the "biopsy experiment."

#CELLStt3 are the DE probes from a certain timepoint of a different, similar experiment done in our lab. I will call this the "cells experiment". I have already read in the results with lumi and done a DE analysis with limma for this experiment. I am using these probes as my gene set because I want to see what kind of overlap there is between the biopsy and cell experiments. 

y <- exprs(expressedProbes.lumi) 

# exract the entrez IDS from the the  cells experiment gene set and put it into the proper list format.

gene.set <- list( as.character( CELLStt3$ENTREZ_GENE_ID )) 

# length(gene.set [[1]] ) is 11076 and length (unique(gene.set [[1]] ) is 8873

# preparing the identifiers from  biopsy experiment by extracting the entrez IDS from the lumibatch.

identifiers <- as.character( fData( expressedProbes.lumi ) [,"ENTREZ_GENE_ID"] )

#length (identifiers) is 27126 and length(unique(identifiers)) is 18936

#make index with CELLStt3 probes andbiopsy experiment identifiers

index<-ids2indices(gene.sets = gene.set, identifiers = identifiers)

#length( unique( index[[ 1 ]] )) is 13476 which > length(gene.set)

#define a matrix for the contrast of my biopsy experiment that I want the results for.

contr <- makeContrasts(TreatV186.24-TreatMock.24, levels = design) #design is the design matrix that I used for DE analysis of my biopsy experiment.

#put arguments into camera

cameraResult<-camera( y = y, index = index, design = design, contrast = contr )

pander ( cameraResult )

--------------------------------------------------------------
 NGenes   Correlation   Direction   PValue  
-------- ------------- ----------- --------- -----------------
 13476     0.001321        Up      0.0001454
-------------------------------------------------------------

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
[1] pander_0.5.2        lumi_2.18.0         Biobase_2.26.0     
[4] BiocGenerics_0.12.1 limma_3.22.7       

loaded via a namespace (and not attached):
 [1] affy_1.44.0             affyio_1.34.0          
 [3] annotate_1.44.0         AnnotationDbi_1.28.2   
 [5] base64_1.1              base64enc_0.1-3        
 [7] BatchJobs_1.6           BBmisc_1.9             
 [9] beanplot_1.2            BiocInstaller_1.16.5   
[11] BiocParallel_1.0.3      biomaRt_2.22.0         
[13] Biostrings_2.34.1       bitops_1.0-6           
[15] brew_1.0-6              bumphunter_1.6.0       
[17] checkmate_1.6.2         codetools_0.2-14       
[19] colorspace_1.2-6        DBI_0.3.1              
[21] digest_0.6.8            doRNG_1.6              
[23] fail_1.2                foreach_1.4.2          
[25] genefilter_1.48.1       GenomeInfoDb_1.2.5     
[27] GenomicAlignments_1.2.2 GenomicFeatures_1.18.7 
[29] GenomicRanges_1.18.4    grid_3.1.2             
[31] illuminaio_0.8.0        IRanges_2.0.1          
[33] iterators_1.0.7         KernSmooth_2.23-15     
[35] knitr_1.11              lattice_0.20-29        
[37] locfit_1.5-9.1          magrittr_1.5           
[39] MASS_7.3-44             Matrix_1.2-2           
[41] matrixStats_0.14.2      mclust_5.0.2           
[43] methylumi_2.12.0        mgcv_1.8-7             
[45] minfi_1.12.0            multtest_2.22.0        
[47] nleqslv_2.8             nlme_3.1-124           
[49] nor1mix_1.2-1           pkgmaker_0.22          
[51] plyr_1.8.3              preprocessCore_1.28.0  
[53] quadprog_1.5-5          RColorBrewer_1.1-2     
[55] Rcpp_0.12.0             RCurl_1.95-4.7         
[57] registry_0.3            reshape_0.8.5          
[59] rngtools_1.2.4          Rsamtools_1.18.3       
[61] RSQLite_1.0.0           rtracklayer_1.26.3     
[63] S4Vectors_0.4.0         sendmailR_1.2-1        
[65] siggenes_1.40.0         splines_3.1.2          
[67] stats4_3.1.2            stringi_0.5-5          
[69] stringr_1.0.0           survival_2.38-3        
[71] tools_3.1.2             XML_3.98-1.3           
[73] xtable_1.7-4            XVector_0.6.0          
[75] zlibbioc_1.12.0

Thanks x 10^6 !

Claire Levy

University of Washington/Fred Hutchinson Cancer Research Center

Seattle, WA, USA

limma camera gene set testing • 2.4k views

ADD COMMENT • link 9.0 years ago claire.n.levy ▴ 10

score 1 · Answer 1 · 2016-04-21

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

The ids2indices function is intended to take a list of gene sets and then tell you which of the identifiers in your experiment are in a given set. If you have duplicates in your set of identifiers, then you will end up with more genes than what are in your gene set. As a simple example:

> gene.lst <- list(letters[1:5])
> identifiers <- letters[c(1,1,1,1,2,3,4,5,5,5,6,7)]
> gene.lst
[[1]]
[1] "a" "b" "c" "d" "e"

> identifiers
 [1] "a" "a" "a" "a" "b" "c" "d" "e" "e" "e" "f" "g"
> ids2indices(gene.lst, identifiers)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

> identifiers[ids2indices(gene.lst, identifiers)[[1]]]
 [1] "a" "a" "a" "a" "b" "c" "d" "e" "e" "e"

Does that make sense? I guess an alternative way to state this is that you shouldn't have duplicated genes in your set of differentially expressed genes. Having evidence for differential expression of a given gene multiple times doesn't add any more information.

ADD COMMENT • link 9.0 years ago James W. MacDonald 68k

0

Entering edit mode

Personally I like to leave all the probe-sets in the data even if more than one probe-set corresponds to the same gene. Using multiple probes does help with statistical power if the gene happens to be DE.

ADD REPLY • link 9.0 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks for the speedy response!

These data are from lllumina arrays and it turns out that in a topTable or fData, while the PROBE_ID column (ILMN_xxxx) is unique, the ENTREZ_GENE_ID column is not, because a single ENTREZ_GENE_ID can represent multiple PROBE_IDs. I repeated the analysis but limited the identifiers and gene set to unique ENTREZ_GENE_IDs (so now 18,926 unique identifiers and 8,873 in the gene set) and got the following result:

--------------------------------------------- NGenes Correlation Direction PValue -------- ------------- ----------- -------- 7916 0.003774 Down 0.5059 -------------------------------------------

Ok, now NGenes is < length (unique(gene.set [[1]]))

Then I was curious about using the PROBE_ID instead of ENTREZ_ID, since it is already unique, so I did the following:

gene.set<-list(as.character(CELLStt3$PROBE_ID)) # length is 11,076

identifiers = as.character(fData(expressedProbes.lumi)[,"PROBE_ID"]) #length is 27,126

and got this result:

--------------------------------------------
     NGenes   Correlation   Direction   PValue  
    -------- ------------- ----------- ---------
      9434     0.004606        Up      0.0001851
    --------------------------------------------

Still a sane value for NGenes, but now my PValue is totally different and I've got the opposite direction. Both tests used a unique list for the gene.set and a unique vector of identifiers. Based on what Gordon said, I'm inclined to use the PROBE_ID version. I like idea of using all the data instead of filtering based on the entrez ids, but I don't think I could explain why I get such different results.

Side note: Isn't NGenes kind of a misnomer since there are usually multiple probes associated with the same gene, so the overlap is really over probes? I guess this depends on what your definition of "gene" is, but I've noticed that this takes some clarification sometimes, especially some people find it easier to look at (and look up, and think about) lists of gene symbols, rather than Probe ids.

Thanks, it is really helpful to get feedback on this stuff while I'm learning.

Claire

ADD REPLY • link 9.0 years ago claire.n.levy ▴ 10

0

Entering edit mode

You will probably get the same result if you use the Entrez Gene IDs, but keep the duplicates. And why you get a sign flip probably has to do with a combination of a.) how you excluded the duplicate Entrez Gene IDs, and b.) the predominate direction of differential expression in the duplicates.

The former probably has more to do with the change in p-value (e.g., if you excluded the duplicates at random, or by doing something like mystuff[!duplicated(mystuff$theentrezgeneids),], which isn't random, but is an uncontrolled selection process). If removing dups ended up excluding probes with larger differences between groups, then you will lose power. If however you went through all the genes and just kept the dups with the largest absolute t-statistic, then maybe the sign and p-values might more closely agree. Or at least the p-value. Maybe.

The latter probably has more to do with the direction of the gene set. If most of the duplicates are down-regulated genes, then you are removing evidence for down-regulation.

And this is where I part ways with Gordon on the usefulness of including duplicates. While it undeniably increases power to include the duplicates, if you have a set of DE genes with a bunch of duplicates, this can have as much to do with the design of the array as it has with the underlying biology you are trying to explore.

As an example, say you have 20 genes, and 10 have duplicates, and the 10 genes with duplicates are almost all down-regulated. When you do the gene set test, you will say 'Hey, I have 30 things I measured that are, like, genes or something, and 20 of them are down-regulated, so I think my set of genes is perturbed in a down-regulated fashion'. But this directionality and maybe even the strength thereof is a direct artifact of the manufacturer duplicating those 10 down-regulated genes, and not the 10 up-regulated ones! But then I primarily work with Affy arrays, which have varied numbers of duplication. Maybe Illumina arrays have an equal number of probes for each gene, in which case I suppose it all comes out in the wash.

ADD REPLY • link 9.0 years ago James W. MacDonald 68k

0

Entering edit mode

It makes sense that the results would be different because of non-random removal of ( or inclusion of on the part of the manufacturer) duplicates. I don't know the details about how this is done for Illumina though. I think I'm going to stick with vectors that start out unique I don't have to worry (so much) about this. Thanks!

ADD REPLY • link 9.0 years ago claire.n.levy ▴ 10