Hi,
I am trying to obtain the exact coordinates of transcriptome-wide 3'UTRs. To this end I simply obtained a list of all mm10 Gencode-annotated transcript isoforms and then planned on using bioMart to obtain the exact 3'UTR coordinates for them.
I am running into a surprising issue: for many transcripts, I obtain two different 3'UTR coordinates (e.g. ENSMUST00000000001,ENSMUST00000000003,ENSMUST00000000028) which are not even overlapping between each other
Below a snippet of my code and the sessioninfo information
#create a vector with ENSMUST00000000001, ENSMUST00000000003, ENSMUST00000000010, ENSMUST00000000028
>mat1.data <- c("ENSMUST00000000001", "ENSMUST00000000003", "ENSMUST00000000010", "ENSMUST00000000028")
>mat1 <- matrix(mat1.data,nrow=4,ncol=1,byrow = T)
>mat1
[,1] [1,] "ENSMUST00000000001"
[2,] "ENSMUST00000000003"
[3,] "ENSMUST00000000010"
[4,] "ENSMUST00000000028"
>library(biomaRt)
>db <- useMart(host="uswest.ensembl.org",biomart = "ENSEMBL_MART_ENSEMBL",dataset = "mmusculus_gene_ensembl")
>attributes = listAttributes(db)
> coordinates <- getBM(attributes=c("ensembl_transcript_id","3_utr_start","3_utr_end","chromosome_name","strand","transcript_biotype"),filters="ensembl_transcript_id", values=mat1[,1],mart=db)
for all the IDs in the example except for ENSMUST00000000010 I obtain >1 3'UTR coordinates. Can someone help understand what this issue arises from and which of the listed coordinates are correct? I listg the output of sessionInfo() below and thank you in advance for your help
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.48.3
loaded via a namespace (and not attached):
[1] KEGGREST_1.32.0 progress_1.2.2 tidyselect_1.1.2 purrr_0.3.4 vctrs_0.4.1 generics_0.1.2 stats4_4.1.0
[8] BiocFileCache_2.0.0 utf8_1.2.2 blob_1.2.3 XML_3.99-0.9 rlang_1.0.2 pillar_1.7.0 withr_2.5.0
[15] glue_1.6.2 DBI_1.1.2 rappdirs_0.3.3 BiocGenerics_0.38.0 bit64_4.0.5 dbplyr_2.1.1 GenomeInfoDbData_1.2.6
[22] lifecycle_1.0.1 stringr_1.4.0 zlibbioc_1.38.0 Biostrings_2.60.2 memoise_2.0.1 Biobase_2.52.0 IRanges_2.26.0
[29] fastmap_1.1.0 GenomeInfoDb_1.28.4 parallel_4.1.0 curl_4.3.2 AnnotationDbi_1.54.1 fansi_1.0.3 Rcpp_1.0.8.3
[36] filelock_1.0.2 cachem_1.0.6 S4Vectors_0.30.2 XVector_0.32.0 bit_4.0.4 hms_1.1.1 png_0.1-7
[43] digest_0.6.29 stringi_1.7.6 dplyr_1.0.9 cli_3.3.0 tools_4.1.0 bitops_1.0-7 magrittr_2.0.3
[50] RCurl_1.98-1.6 RSQLite_2.2.14 tibble_3.1.7 crayon_1.5.1 pkgconfig_2.0.3 ellipsis_0.3.2 xml2_1.3.3
[57] prettyunits_1.1.1 assertthat_0.2.1 httr_1.4.3 rstudioapi_0.13 R6_2.5.1 compiler_4.1.0
Hi Steve,
Thanks for your answer! As a rule of thumb I think I can just focus on the longest "3'UTR exon" then. I think what confused me is that when I tried doing exactly what you sugest but for ENSMUST00000000010, the coordinates listed by biomaRt did not overlap in anyway with the UCSC genome browser annotation and that sent me into a bout of confusion :) now I realised that simply arises from having obtained the 3'UTRs from the mm10 (Grcm38) version and the Genome Browser having updated to mm39 since (XD). When I used the archived version from Apr 2022 for biomaRt and the Grcm38 genome browser this issue is resolved.
I am thinking the NA arises from the intron within the 3'UTR, but I am not sure. Thanks again for your help Steve!