Hi, I am running clusterprofiler using my differentially expressed genes from DESeq2. I am working with Cannabis sativa. I created the db using annotation hub. Then I changed the symbols from my gene matrix to entrez id. I did that for both my genes of interest and my gene universe. And then I attempted to run enrichGO, but for some reason I get the error message below, and I have no idea why. Please help
library(AnnotationDbi)
library(clusterProfiler)
library(ReactomePA)
ah <- AnnotationHub()
AnnotationHub::query(ah, c("Cannabis", "Sativa"))
Csativa <- ah[["AH101262"]]
columns(Csativa)
keytypes(Csativa)
select(Csativa, head(keys(Csativa)), c("SYMBOL", "GENENAME", "GO"))
annotated_significant_res_aga <- significant_res_aga
annotated_significant_res_aga$symbol <- mapIds(
Csativa,
keys = rownames(annotated_significant_res_aga),
keytype = "ALIAS",
column = "SYMBOL",
multiVals = "first"
)
annotated_significant_res_aga$entrez <- mapIds(
Csativa,
keys = rownames(annotated_significant_res_aga),
keytype = "ALIAS",
column = "ENTREZID",
multiVals = "first"
)
annotated_res_aga <- res_aga
annotated_res_aga$symbol <- mapIds(
Csativa,
keys = rownames(annotated_res_aga),
keytype = "ALIAS",
column = "SYMBOL",
multiVals = "first"
)
annotated_res_aga$entrez <- mapIds(
Csativa,
keys = rownames(annotated_res_aga),
keytype = "ALIAS",
column = "ENTREZID",
multiVals = "first"
)
ego_bp_aga <- enrichGO(gene = as.character(unique(annotated_significant_res_aga$entrez)),
universe = as.character(unique(annotated_res_aga$entrez)),
OrgDb = Csativa,
keyType = "ENTREZID",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
--> No gene can be mapped....
--> Expected input gene ID: 23630686,27215454,27215500,24573811,27215495,27215452
--> return NULL...
sessionInfo( )
```R version 4.2.0 (2022-04-22 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] GOstats_2.62.0 graph_1.74.0 Category_2.62.0 Matrix_1.4-1
[5] GO.db_3.15.0 clusterProfiler_4.4.1 AnnotationDbi_1.58.0 RColorBrewer_1.1-3
[9] pheatmap_1.0.12 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
[13] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7
[17] ggplot2_3.3.6 tidyverse_1.3.1 DESeq2_1.36.0 SummarizedExperiment_1.26.1
[21] Biobase_2.56.0 MatrixGenerics_1.8.0 matrixStats_0.62.0 GenomicRanges_1.48.0
[25] GenomeInfoDb_1.32.1 IRanges_2.30.0 S4Vectors_0.34.0 AnnotationHub_3.4.0
[29] BiocFileCache_2.4.0 dbplyr_2.1.1 BiocGenerics_0.42.0 BiocManager_1.30.17
loaded via a namespace (and not attached):
[1] shadowtext_0.1.2 readxl_1.4.0 backports_1.4.1
[4] fastmatch_1.1-3 plyr_1.8.7 igraph_1.3.1
[7] lazyeval_0.2.2 GSEABase_1.58.0 splines_4.2.0
[10] BiocParallel_1.30.0 digest_0.6.29 yulab.utils_0.0.4
[13] htmltools_0.5.2 GOSemSim_2.22.0 viridis_0.6.2
[16] fansi_1.0.3 magrittr_2.0.3 memoise_2.0.1
[19] tzdb_0.3.0 Biostrings_2.64.0 annotate_1.74.0
[22] graphlayouts_0.8.0 modelr_0.1.8 vroom_1.5.7
[25] enrichplot_1.16.0 colorspace_2.0-3 blob_1.2.3
[28] rvest_1.0.2 rappdirs_0.3.3 ggrepel_0.9.1
[31] haven_2.5.0 crayon_1.5.1 RCurl_1.98-1.6
[34] jsonlite_1.8.0 scatterpie_0.1.7 genefilter_1.78.0
[37] ape_5.6-2 survival_3.3-1 glue_1.6.2
[40] polyclip_1.10-0 gtable_0.3.0 zlibbioc_1.42.0
[43] XVector_0.36.0 DelayedArray_0.22.0 Rgraphviz_2.40.0
[46] scales_1.2.0 DOSE_3.22.0 DBI_1.1.2
[49] Rcpp_1.0.8.3 viridisLite_0.4.0 xtable_1.8-4
[52] tidytree_0.3.9 gridGraphics_0.5-1 bit_4.0.4
[55] AnnotationForge_1.38.0 httr_1.4.3 fgsea_1.22.0
[58] ellipsis_0.3.2 pkgconfig_2.0.3 XML_3.99-0.9
[61] farver_2.1.0 locfit_1.5-9.5 utf8_1.2.2
[64] ggplotify_0.1.0 tidyselect_1.1.2 labeling_0.4.2
[67] rlang_1.0.2 reshape2_1.4.4 later_1.3.0
[70] munsell_0.5.0 BiocVersion_3.15.2 cellranger_1.1.0
[73] tools_4.2.0 cachem_1.0.6 downloader_0.4
[76] cli_3.3.0 generics_0.1.2 RSQLite_2.2.14
[79] broom_0.8.0 fastmap_1.1.0 yaml_2.3.5
[82] ggtree_3.4.0 bit64_4.0.5 fs_1.5.2
[85] tidygraph_1.2.1 KEGGREST_1.36.0 ggraph_2.0.5
[88] RBGL_1.72.0 nlme_3.1-157 mime_0.12
[91] aplot_0.1.4 DO.db_2.9 xml2_1.3.3
[94] compiler_4.2.0 rstudioapi_0.13 filelock_1.0.2
[97] curl_4.3.2 png_0.1-7 interactiveDisplayBase_1.34.0
[100] treeio_1.20.0 reprex_2.0.1 tweenr_1.0.2
[103] geneplotter_1.74.0 stringi_1.7.6 lattice_0.20-45
[106] vctrs_0.4.1 pillar_1.7.0 lifecycle_1.0.1
[109] data.table_1.14.2 bitops_1.0-7 patchwork_1.1.1
[112] httpuv_1.6.5 qvalue_2.28.0 R6_2.5.1
[115] promises_1.2.0.1 gridExtra_2.3 MASS_7.3-57
[118] assertthat_0.2.1 withr_2.5.0 GenomeInfoDbData_1.2.8
[121] parallel_4.2.0 hms_1.1.1 grid_4.2.0
[124] ggfun_0.0.6 ggforce_0.3.3 shiny_1.7.1
[127] lubridate_1.8.0 ```
Hi, thanks so much for your reply. This is the output of the commands you asked about:
[1] "115694767" "115694769" "115694771" "115694784" "115694787" "115694843"
and
[1] NA "115694672" "115694673" "115694674" "115694675" "115694676"
I guess that means the ids are not matching? I don't understand why this could be, since I took the gtf file from nbci and the C. sativa reference is from ncbi?
No, basically what is meant is that your input ids should match with those in the
OrgDb
. Also your significant genes should be part of all annotated genes. That seems not to be the case now.Also, I noticed that you have one or more NA in your gene vectors. Try to remove these before running
enrichGO()
.Lastly, does my example code work for you?
I don't understand why they don't match. I got the gtf and fasta from the ncbi website, the cs10 genome assembly, which is the same as that of annotationhub. I got my significant genes from running DESeq2, so they are definitely a subset of all the genes tested.
How can I find out the annotation file used by annotation hub? Like exactly what gtf/gff file and from where?
AFAIK the annotation file present at the
AnnotationHub
is not generated from a GTF/GFF file, but from data directly downloaded from the NCBI FTP server. If you look at my first post above, you will notice:Thus: the annotation data is derived from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/. If you lookup all info NCBI has on the organism with taxonomy id 3483, you will see that the number of genes currently annotated by NCBI is 31,443. Note that this equals the number of genes in
background.genes
:so these numbers match...
To reverse the question: which GFT file did you use? Where did you get it from?
I used the gtf file from ncbi
Here: https://www.ncbi.nlm.nih.gov/assembly/GCF_900626175.2/
Everything was taken from there