Hi,
I am analysing my RNA-Seq data with DESeq2. At the end I would like to convert significantly expressed ensembl IDs to GeneSymbols. I am using AnnotationDbi for this. However, I realized that not all the ensembl IDs are converted to Gene Symbols. 25072 out of 48607 returned as NA. More than 11000 of these IDs are actually significantly differentially expressed . So to double check, I put the IDs which got "NA" for Gene Symbol to Biomart and it converted them to Gene symbols( as seen in the figure). So now I am confused, am I doing something wrong ? Or is there any other alternative to extract all Gene symbols ?
Thanks in advance,
Gokce
library("AnnotationDbi")
library("org.Hs.eg.db")
res<-results(dds,alpha=.05, contrast=c("Type", "Disease", "Control"))
res$symbol <- mapIds(org.Hs.eg.db,keys=row.names(res),column="SYMBOL", keytype="ENSEMBL", multiVals="first")
> sessionInfo() R version 3.2.0 (2015-04-16) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: CentOS release 6.5 (Final) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] edgeR_3.10.5 limma_3.24.15 [3] amap_0.8-14 sva_3.14.0 [5] mgcv_1.8-10 nlme_3.1-122 [7] doParallel_1.0.10 iterators_1.0.8 [9] foreach_1.4.3 reshape_0.8.5 [11] cluster_2.0.3 matrixStats_0.50.1 [13] flashClust_1.01-2 WGCNA_1.51 [15] fastcluster_1.1.16 dynamicTreeCut_1.62 [17] xlsx_0.5.7 xlsxjars_0.6.1 [19] rJava_0.9-7 pheatmap_1.0.8 [21] genefilter_1.50.0 gplots_2.17.0 [23] RColorBrewer_1.1-2 vsn_3.36.0 [25] org.Hs.eg.db_3.1.2 RSQLite_1.0.0 [27] DBI_0.3.1 DESeq2_1.8.2 [29] RcppArmadillo_0.6.400.2.2 Rcpp_0.12.7 [31] BiocParallel_1.2.22 GenomicAlignments_1.4.2 [33] GenomicFeatures_1.20.6 AnnotationDbi_1.30.1 [35] Biobase_2.28.0 Rsamtools_1.20.5 [37] Biostrings_2.36.4 XVector_0.8.0 [39] GenomicRanges_1.20.8 GenomeInfoDb_1.4.3 [41] IRanges_2.2.9 S4Vectors_0.6.6 [43] BiocGenerics_0.14.0 Hmisc_3.17-1 [45] ggplot2_2.1.0 Formula_1.2-1 [47] survival_2.38-3 lattice_0.20-33 [49] BiocInstaller_1.20.3 loaded via a namespace (and not attached): [1] splines_3.2.0 gtools_3.5.0 affy_1.46.1 [4] latticeExtra_0.6-26 impute_1.42.0 colorspace_1.2-6 [7] preprocessCore_1.30.0 Matrix_1.2-3 plyr_1.8.4 [10] XML_3.98-1.3 biomaRt_2.24.1 zlibbioc_1.14.0 [13] xtable_1.8-0 GO.db_3.1.2 scales_0.4.0 [16] gdata_2.17.0 affyio_1.36.0 annotate_1.46.1 [19] nnet_7.3-11 foreign_0.8-66 tools_3.2.0 [22] munsell_0.4.3 locfit_1.5-9.1 lambda.r_1.1.7 [25] caTools_1.17.1 futile.logger_1.4.1 grid_3.2.0 [28] RCurl_1.95-4.7 bitops_1.0-6 gtable_0.2.0 [31] codetools_0.2-14 gridExtra_2.0.0 rtracklayer_1.28.10 [34] futile.options_1.0.0 KernSmooth_2.23-15 geneplotter_1.46.0 [37] rpart_4.1-10 acepack_1.3-3.3
Thanks a lot for the suggestion and clarification Johannes. I will implement it to my analysis as soon as I solve my R version problem.
Actually when I was running the code with org.Hs.eg.db, I was expecting to see all the corresponding HGNC symbols but it did not return which actually surprised me. So now when I run using EnsDb.Hsapiens.v79, it will return gene names or gene symbols ?
Best regards,
Gokce
EnsDb.Hsapiens.v79
will return you the gene names (even if you specify"SYMBOL"
). I decided to go for the gene name in all cases, as that is species-independent.Here is the working link to the ensembldb vignette:
http://www.bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html