Question

Biomart - upstream region of 'coding'

0

Entering edit mode

andrebolerbarros ▴ 20

@andrebolerbarros-16788

Last seen 11 months ago

Portugal

Hi everyone,

I want to perform motif enrichment analysis and, so, I want to use 500bp region before and 100bp region into the coding region. For the coding region, I think I am set but, I wanted to confirm I am doing everything fine by using coding_gene_flank. Currently, I am doing this:

seq1 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="coding_gene_flank", upstream = 500, mart = ensembl,verbose = T)
seq2 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="coding", mart = ensembl,verbose = T)

seq<-paste0(seq1,substr(seq2,1,100))

Thanks in advance!

sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=Portuguese_Portugal.utf8  LC_CTYPE=Portuguese_Portugal.utf8   
[3] LC_MONETARY=Portuguese_Portugal.utf8 LC_NUMERIC=C                        
[5] LC_TIME=Portuguese_Portugal.utf8    

time zone: Europe/Lisbon
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.56.1

loaded via a namespace (and not attached):
 [1] KEGGREST_1.40.0         gtable_0.3.4            xfun_0.40              
 [4] ggplot2_3.4.2           rstatix_0.7.2           Biobase_2.60.0         
 [7] vctrs_0.6.3             tools_4.3.0             bitops_1.0-7           
[10] generics_0.1.3          stats4_4.3.0            curl_5.0.2             
[13] tibble_3.2.1            fansi_1.0.4             AnnotationDbi_1.62.2   
[16] RSQLite_2.3.1           blob_1.2.4              pkgconfig_2.0.3        
[19] dbplyr_2.3.3            S4Vectors_0.38.1        lifecycle_1.0.3        
[22] GenomeInfoDbData_1.2.10 compiler_4.3.0          stringr_1.5.0          
[25] Biostrings_2.68.1       progress_1.2.2          munsell_0.5.0          
[28] carData_3.0-5           GenomeInfoDb_1.36.2     htmltools_0.5.6        
[31] yaml_2.3.7              RCurl_1.98-1.12         car_3.1-2              
[34] tidyr_1.3.0             pillar_1.9.0            ggpubr_0.6.0           
[37] crayon_1.5.2            cachem_1.0.8            abind_1.4-5            
[40] tidyselect_1.2.0        zip_2.3.0               digest_0.6.33          
[43] stringi_1.7.12          purrr_1.0.2             dplyr_1.1.2            
[46] forcats_1.0.0           fastmap_1.1.1           grid_4.3.0             
[49] colorspace_2.1-0        cli_3.6.1               magrittr_2.0.3         
[52] XML_3.99-0.14           utf8_1.2.3              broom_1.0.5            
[55] backports_1.4.1         prettyunits_1.1.1       filelock_1.0.2         
[58] scales_1.2.1            rappdirs_0.3.3          bit64_4.0.5            
[61] rmarkdown_2.24          XVector_0.40.0          httr_1.4.7             
[64] bit_4.0.5               ggsignif_0.6.4          png_0.1-8              
[67] hms_1.1.3               openxlsx_4.2.5.2        evaluate_0.21          
[70] memoise_2.0.1           knitr_1.43              IRanges_2.34.0         
[73] BiocFileCache_2.8.0     rlang_1.1.1             Rcpp_1.0.10            
[76] glue_1.6.2              DBI_1.1.3               xml2_1.3.5             
[79] BiocGenerics_0.46.0     rstudioapi_0.15.0       R6_2.5.1               
[82] zlibbioc_1.46.0

biomaRt • 832 views

ADD COMMENT • link updated 19 months ago by Mike Smith ★ 6.6k • written 19 months ago by andrebolerbarros ▴ 20

score 0 · Answer 1 · 2023-09-11

0

Entering edit mode

Mike Smith ★ 6.6k

@mike-smith

Last seen 22 hours ago

EMBL Heidelberg

This looks like a good start. However, I think you need to consider that your seq2 object might return multiple sequences. That's because seqType="coding" returns a sequence per transcript, rather than per gene. Given the transcripts can start in different places, it might not make sense to paste a single upstream flank. If you want to do this on a per transcript basis, you probably want to use type="ensembl_transcript_id" in the first query.

ADD COMMENT • link 19 months ago Mike Smith ★ 6.6k

0

Entering edit mode

Thanks Mike Smith! What I am doing right now is to confirm if upstream + coding sequence appear on the gene sequence (added a buffer upstream as well):

seq3 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="gene_exon_intron", mart = ensembl,verbose = T)
seq4 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="gene_flank", upstream = 500, mart = ensembl,verbose = T)
full = paste0 (seq3,seq4)

Then, if the upstream + coding sequence are in this full region, I save the combination; otherwise, I do not consider it.

ADD REPLY • link 19 months ago andrebolerbarros ▴ 20