Hi, I am trying to extract some basic information using the library Homo.sapiens. One of the variables I am trying to obtain is EXONRANK. When I select a particular transcript (only one shown for simplicity) I obtain different rows because the column EXONRANK has different values for the same transcript (REFSEQ) which I do not understand. Is this working as intended? Is it something obvious that I am missing?
Thanks in advance
# include your problematic code here with any corresponding output
# please also include the results of running the following in an R session
library("Homo.sapiens")
library(tidyr)
library(dplyr) #Load here so it does not interfere with the other select function
keys="NM_000341"
#Extract the relevant information from the database
raw_data <- AnnotationDbi::select(Homo.sapiens, keys=keys, columns=c("EXONCHROM","SYMBOL","REFSEQ",
"EXONRANK", "EXONSTART","EXONEND", "EXONSTRAND"), keytype="REFSEQ")
raw_data
REFSEQ SYMBOL EXONCHROM EXONSTRAND EXONSTART EXONEND EXONRANK
1 NM_000341 SLC3A1 chr2 + 44502597 44503104 1
2 NM_000341 SLC3A1 chr2 + 44507855 44508034 2
3 NM_000341 SLC3A1 chr2 + 44508526 44508680 3
4 NM_000341 SLC3A1 chr2 + 44513171 44513296 4
5 NM_000341 SLC3A1 chr2 + 44527110 44527229 5
6 NM_000341 SLC3A1 chr2 + 44528142 44528556 6
7 NM_000341 SLC3A1 chr2 + 44528142 44528266 6
8 NM_000341 SLC3A1 chr2 + 44531282 44531477 7
9 NM_000341 SLC3A1 chr2 + 44539725 44539929 8
10 NM_000341 SLC3A1 chr2 + 44539725 44539892 8
11 NM_000341 SLC3A1 chr2 + 44540974 44542382 9
12 NM_000341 SLC3A1 chr2 + 44540974 44541090 9
13 NM_000341 SLC3A1 chr2 + 44545257 44545894 10
14 NM_000341 SLC3A1 chr2 + 44547338 44547962 10
15 NM_000341 SLC3A1 chr2 + 44512222 44513296 1
16 NM_000341 SLC3A1 chr2 + 44527110 44527229 2
17 NM_000341 SLC3A1 chr2 + 44528142 44528266 3
18 NM_000341 SLC3A1 chr2 + 44531282 44531477 4
19 NM_000341 SLC3A1 chr2 + 44539725 44539892 5
20 NM_000341 SLC3A1 chr2 + 44540974 44541090 6
21 NM_000341 SLC3A1 chr2 + 44547338 44547962 7
22 NM_000341 SLC3A1 chr2 + 44530945 44531477 1
23 NM_000341 SLC3A1 chr2 + 44539725 44539892 2
24 NM_000341 SLC3A1 chr2 + 44540974 44541090 3
25 NM_000341 SLC3A1 chr2 + 44547338 44547962 4
sessionInfo( )
sessionInfo( )
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
[3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8
[5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] dplyr_1.0.5
[2] tidyr_1.1.2
[3] Homo.sapiens_1.3.1
[4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[5] org.Hs.eg.db_3.10.0
[6] GO.db_3.10.0
[7] OrganismDbi_1.28.0
[8] GenomicFeatures_1.38.2
[9] GenomicRanges_1.38.0
[10] GenomeInfoDb_1.22.1
[11] AnnotationDbi_1.48.0
[12] IRanges_2.20.2
[13] S4Vectors_0.24.4
[14] Biobase_2.46.0
[15] BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lattice_0.20-41
[3] prettyunits_1.1.1 Rsamtools_2.2.3
[5] Biostrings_2.54.0 assertthat_0.2.1
[7] utf8_1.1.4 BiocFileCache_1.10.2
[9] R6_2.5.0 RSQLite_2.2.4
[11] httr_1.4.2 pillar_1.5.0
[13] zlibbioc_1.32.0 rlang_0.4.10
[15] progress_1.2.2 curl_4.2
[17] blob_1.2.1 Matrix_1.3-2
[19] BiocParallel_1.20.1 stringr_1.4.0
[21] RCurl_1.98-1.3 bit_4.0.4
[23] biomaRt_2.42.1 DelayedArray_0.12.3
[25] compiler_3.6.3 rtracklayer_1.46.0
[27] pkgconfig_2.0.3 askpass_1.1
[29] openssl_1.4.3 tidyselect_1.1.0
[31] SummarizedExperiment_1.16.1 tibble_3.1.0
[33] GenomeInfoDbData_1.2.2 matrixStats_0.58.0
[35] XML_3.99-0.3 fansi_0.4.2
[37] crayon_1.4.1 dbplyr_2.1.0
[39] GenomicAlignments_1.22.1 bitops_1.0-6
[41] rappdirs_0.3.3 RBGL_1.62.1
[43] grid_3.6.3 lifecycle_1.0.0
[45] DBI_1.1.1 magrittr_2.0.1
[47] graph_1.64.0 stringi_1.5.3
[49] cachem_1.0.4 XVector_0.26.0
[51] ellipsis_0.3.1 generics_0.1.0
[53] vctrs_0.3.6 tools_3.6.3
[55] bit64_4.0.5 glue_1.4.2
[57] purrr_0.3.4 hms_1.0.0
[59] fastmap_1.1.0 BiocManager_1.30.10
[61] memoise_2.0.0
I see! Thanks for the information :)
I am trying to implement your code in hg38 succesfully building and using TxDb.Hsapiens.UCSC.hg38.refGene. However I think that the Homo.sapiens package is only supporting hg19, therefore this line here might not be working:
TXNAME and REFSEQ do not coincide for some genes (like "CTNS"). Is there a way to use Homo.sapiens in hg38?
I found a solution using biomaRt library, however I would like to understand how to make it using TxDb
The main issue here is that you are using
select
, which is a valid thing to do, but if you have multiple columns you end up getting back more than you might have expected. An alternative is to use <del>transcriptsBy</del>exonsBy
instead.Which you could coerce to something else if you like
Much clearer now, thanks a lot!