Hi everyone!
So, I have experienced a problem retrieving annotation for prob IDs to obtain the Symbol names from Affymetrix HTA2.0 using biomaRt.
The ID's from my data appears in this format: "TC05002592.hg.1", but in biomart, the ID's do not contain the ".1": "TC05002592.hg.1"
The data I am interested in has the GEO identifier: "GSE137574". [Platform: Platforms (1) GPL17586 [HTA-2_0] Affymetrix Human Transcriptome Array 2.0 [transcript (gene) version]]
I did a work around, but I am not sure if this is correct or if I missing something.
Do you have any suggestions?
Thank you very much in advance!
Here it is my code:
> ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
> mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
> listAttributes(mart)[grep("affy", listAttributes(mart)[,1]),]
> G_list <- getBM(filters= "with_affy_hta_2_0", attributes= c("affy_hta_2_0", "ensembl_gene_id",
"hgnc_symbol"),values=TRUE,mart= mart)
> head(G_list)
affy_hta_2_0 ensembl_gene_id hgnc_symbol
1 TC0M000002.hg ENSG00000210049 MT-TF
2 TC07000959.hg ENSG00000210049 MT-TF
3 TC11001412.hg ENSG00000210049 MT-TF
4 TC0M000002.hg ENSG00000211459 MT-RNR1
5 TC07000959.hg ENSG00000211459 MT-RNR1
6 TC11001412.hg ENSG00000211459 MT-RNR1
> head(limma_res_table) # my data
ID logFC AveExpr t P.Value adj.P.Val B
1 TC01006074.hg.1 10.387105 10.598539 64.98987 1.087062e-26 7.666286e-22 49.02568
2 TC09000652.hg.1 -6.714061 7.605793 -58.89460 9.425114e-26 3.245665e-21 47.38025
3 TC01003777.hg.1 8.687729 8.227119 57.87758 1.380684e-25 3.245665e-21 47.08051
4 TC05003313.hg.1 8.806027 8.610895 56.62782 2.227560e-25 3.927355e-21 46.70139
5 TC0Y000053.hg.1 -6.634259 8.264292 -54.74166 4.678188e-25 6.598397e-21 46.10554
6 TC09002259.hg.1 -7.784896 7.888734 -49.51870 4.197832e-24 4.934062e-20 44.29154
> limma_res_table <- rename(limma_res_table, c("affy_hta_2_0" = "ID"))
> limma_res_table <- merge(limma_res_table,G_list,by="affy_hta_2_0")
[1] affy_hta_2_0 logFC AveExpr t P.Value adj.P.Val B ensembl_gene_id.x
[9] GeneSymbol ensembl_gene_id.y hgnc_symbol
<0 rows> (or 0-length row.names)
# So... this is my work around:
> an <- c(G_list$affy_hta_2_0)
> G_list$anno = paste(an,".1", sep = "")
> G_list <- rename(G_list, c("old" = "affy_hta_2_0"))
> G_list <- rename(G_list, c("affy_hta_2_0" = "anno"))
> limma_res_table <- rename(limma_res_table, c("affy_hta_2_0" = "ID"))
> limma_res_table <- merge(limma_res_table,G_list,by="affy_hta_2_0")
> head(limma_res_table)
affy_hta_2_0 logFC AveExpr t P.Value adj.P.Val B old ensembl_gene_id hgnc_symbol
1 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000223972 DDX11L1
2 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000248472 DDX11L9
3 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000288170 DDX11L9
4 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000233614 DDX11L10
5 TC01000001.hg.1 -0.04142239 4.130454 -0.4328826 0.6693042 0.8680199 -7.574996 TC01000001.hg ENSG00000279928
6 TC01000002.hg.1 0.02323564 3.956609 0.2290388 0.8209495 0.9363863 -7.644989 TC01000002.hg ENSG00000283921 MIR1302-9
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] fgsea_1.14.0 Rsubread_2.2.6 limma_3.44.3 ggplot2_3.3.2 pheatmap_1.0.12 dplyr_1.0.2 GEOquery_2.56.0
[8] Biobase_2.48.0 BiocGenerics_0.34.0 biomaRt_2.44.4
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lattice_0.20-41 tidyr_1.1.2 prettyunits_1.1.1 assertthat_0.2.1 digest_0.6.27 BiocFileCache_1.12.1
[8] R6_2.5.0 stats4_4.0.3 RSQLite_2.2.1 httr_1.4.2 pillar_1.4.6 rlang_0.4.8 progress_1.2.2
[15] curl_4.3 rstudioapi_0.13 data.table_1.13.2 blob_1.2.1 S4Vectors_0.26.1 Matrix_1.2-18 labeling_0.4.2
[22] BiocParallel_1.22.0 statmod_1.4.35 readr_1.4.0 stringr_1.4.0 bit_4.0.4 munsell_0.5.0 compiler_4.0.3
[29] pkgconfig_2.0.3 askpass_1.1 openssl_1.4.3 tidyselect_1.1.0 gridExtra_2.3 tibble_3.0.4 IRanges_2.22.2
[36] XML_3.99-0.5 fansi_0.4.1 crayon_1.3.4 dbplyr_2.0.0 withr_2.3.0 rappdirs_0.3.1 grid_4.0.3
[43] gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0 magrittr_2.0.1 scales_1.1.1 cli_2.1.0 stringi_1.5.3
[50] farver_2.0.3 xml2_1.3.2 ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.5 fastmatch_1.1-0 RColorBrewer_1.1-2
[57] tools_4.0.3 bit64_4.0.5 glue_1.4.2 purrr_0.3.4 hms_0.5.3 AnnotationDbi_1.50.3 colorspace_2.0-0
[64] memoise_1.1.0