Running time of piano's runGSA
1
0
Entering edit mode
rubi ▴ 110
@rubi-6462
Last seen 6.3 years ago

Hi,

 

I'm running piano's runGSA on a list of 9881 genes (with directional fold-changes) and 13244 GO BP gene sets and it takes ~30 min to complete. I'm using the default geneSetStat option and all other arguments are at default values:

Final gene/gene-set association: 9881 genes and 13244 gene-sets
  Details:
  Calculating gene set statistics from 9881 out of 9881 gene-level statistics
  Using all 9881 gene-level statistics for significance estimation
  Removed 0 genes from GSC due to lack of matching gene statistics
  Removed 0 gene sets containing no genes after gene removal
  Removed additionally 0 gene sets not matching the size limits
  Loaded additional information for 0 gene sets

Gene statistic type: F-like
Method: mean
Gene-set statistic name: mean 
Significance: Gene sampling
Adjustment: fdr
Gene set size limit: (1,Inf)
Permutations: 10000 
Total run time: 29.75 min

In contrast, if I upload this genes list to the GORILLA GO enrichment analysis website at: http://cbl-gorilla.cs.technion.ac.il/ i takes a couple of seconds. And, the order of magnitude of the p-values is not smaller.

Also, I'm not sure way all pDistinctDirUp and pDistinctDirDown are NAs.

> sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.1 (Sierra)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] snpEnrichment_1.7.0  BiocInstaller_1.24.0 dplyr_0.5.0          piano_1.14.5         Gviz_1.18.1          GenomicRanges_1.26.1 GenomeInfoDb_1.10.2  IRanges_2.8.1       
 [9] S4Vectors_0.12.0     BiocGenerics_0.20.0 

loaded via a namespace (and not attached):
 [1] bitops_1.0-6                  matrixStats_0.51.0            RColorBrewer_1.1-2            httr_1.2.1                    data.tree_0.6.2               tools_3.3.1                  
 [7] R6_2.2.0                      rpart_4.1-10                  KernSmooth_2.23-15            Hmisc_4.0-2                   DBI_0.5-1                     lazyeval_0.2.0               
[13] colorspace_1.2-7              nnet_7.3-12                   gridExtra_2.2.1               chron_2.3-47                  Biobase_2.34.0                htmlTable_1.7                
[19] influenceR_0.1.0              slam_0.1-40                   rtracklayer_1.34.1            caTools_1.17.1                scales_0.4.1                  relations_0.6-6              
[25] stringr_1.1.0                 digest_0.6.10                 Rsamtools_1.26.1              foreign_0.8-67                XVector_0.14.0                base64enc_0.1-3              
[31] dichromat_2.0-0               htmltools_0.3.5               ensembldb_1.6.2               limma_3.30.2                  BSgenome_1.42.0               htmlwidgets_0.8              
[37] rstudioapi_0.6                RSQLite_1.0.0                 shiny_0.14.2                  visNetwork_1.0.3              jsonlite_1.1                  BiocParallel_1.8.1           
[43] gtools_3.5.0                  acepack_1.4.1                 rgexf_0.15.3                  VariantAnnotation_1.20.2      RCurl_1.95-4.8                magrittr_1.5                 
[49] Formula_1.2-1                 Matrix_1.2-7.1                Rcpp_0.12.7                   munsell_0.4.3                 viridis_0.3.4                 stringi_1.1.2                
[55] yaml_2.1.14                   SummarizedExperiment_1.4.0    zlibbioc_1.20.0               gplots_3.0.1                  plyr_1.8.4                    AnnotationHub_2.6.4          
[61] gdata_2.17.0                  snpStats_1.24.0               lattice_0.20-34               Biostrings_2.42.0             splines_3.3.1                 GenomicFeatures_1.26.2       
[67] knitr_1.15.1                  fgsea_1.0.1                   igraph_1.0.1                  marray_1.52.0                 biomaRt_2.30.0                fastmatch_1.0-4              
[73] XML_3.98-1.5                  biovizBase_1.22.0             latticeExtra_0.6-28           data.table_1.9.6              httpuv_1.3.3                  gtable_0.2.0                 
[79] assertthat_0.1                ggplot2_2.2.1                 mime_0.5                      xtable_1.8-2                  survival_2.40-1               tibble_1.2                   
[85] GenomicAlignments_1.10.0      AnnotationDbi_1.36.0          sets_1.0-16                   cluster_2.0.5                 Rook_1.1-1                    DiagrammeR_0.9.0             
[91] brew_1.0-6                    interactiveDisplayBase_1.12.0

piano runGSA runtime • 1.5k views
ADD COMMENT
0
Entering edit mode

Hi, could you clarify this part: "pMixedDirUp is anti-correlated with pMixedDirUp. I'm guessing the p-value is really 1-pMixedDirUp. This is not true for  pMixedDirDown. Is this a bug?"
Is there a typo in one of the pMixedDirUp? I guess you mean something else?

ADD REPLY
0
Entering edit mode

Could you also clarify what input you are using? The run output indicates that your gene-level statistics are in the range [0,Inf] (are they maybe ranks?) but you also mention directional fold-changes, so I am not sure...

ADD REPLY
0
Entering edit mode

Sorry about the lack of clarity. I dropped the part of the anti correlation between the statMixedDirUp and pMixedDirUp. My question is only about the run-time, which I guess is not solvable.

ADD REPLY
2
Entering edit mode
Leif Väremo ▴ 70
@leif-varemo-5897
Last seen 5.2 years ago
Sweden

The runtime of piano for datasets with a large number of genes and gene-sets is unfortunately slow due to the permutation steps (GORILLA uses a different approach without permutations). It is possible to speed it up by settling for fewer than the default 10,000 permutations (nPerm), or by using the ncpus argument to parallelize the computations. You could try the fgsea method which is very fast. The fgsea method should yield similar or identical results as the fgsea package (piano imports functions from the fgsea package). Just to clarify, piano assumes to receive gene-level statistics that correspond to e.g. fold-changes, so that, if sorted, up-regulated genes would appear on the top whereas down-regulated genes would appear on the bottom.

If the gene-level statistics are "F-like" (as indicated in your case), i.e. ranging from 0 to Inf and with a higher value meaning a "better" score (note that using ranks will not work since the number 1 ranked gene will be interpreted as least important), only the non-directional and mixed-directional p-values will be calculated. Distinct-directional p-values require gene-level statistics that range from negative to positive values. This is because "F-like" statistics do not carry any information about direction. However, if fold-changes are supplied in the 'directions' argument, piano will subset the genes into up- and down-regulated, and hence calculate the mixed-directional scores. This is why NAs are given.

Hope this helps...

ADD COMMENT

Login before adding your answer.

Traffic: 698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6