ISO a solution to rank-order introns by transcript
1
0
Entering edit mode
mat149 ▴ 80
@mat149-11450
Last seen 1 hour ago
United States

I need help with a method/operation to rank each intron by transcript -- based on the granges start/end. Specifically, I'm looking for a way to generate a unique identifier corresponding to: 1st intron, 2nd intron, ... last intron for each transcript, and place them in a granges mcols() Thanks

tx<-txdbmaker::makeTxDbFromGFF("genomic.gff",format="auto",chrominfo=s,dataSource="GCA_007735645.1",organism="Venturia.effusa",taxonomyId=50376,dbxrefTag="GENEID")

int<-GenomicFeatures::intronsByTranscript(tx,use.names=TRUE)
tail(int)

GRangesList object of length 6:
$`rna-gnl|PRJNA551043|FKW77_001900-T1_mrna`
GRanges object with 1 range and 0 metadata columns:
        seqnames        ranges strand
           <Rle>     <IRanges>  <Rle>
  [1] CP042204.1 460547-460594      -
  -------
  seqinfo: 21 sequences from ASM773564v1 genome

$`rna-gnl|PRJNA551043|FKW77_001938-T1_mrna`
GRanges object with 6 ranges and 0 metadata columns:
        seqnames        ranges strand
           <Rle>     <IRanges>  <Rle>
  [1] CP042204.1 466517-467022      -
  [2] CP042204.1 467072-467209      -
  [3] CP042204.1 467341-468214      -
  [4] CP042204.1 468703-468750      -
  [5] CP042204.1 469072-469125      -
  [6] CP042204.1 469246-469709      -
  -------
  seqinfo: 21 sequences from ASM773564v1 genome

$`rna-gnl|PRJNA551043|FKW77_001959-T1_mrna`
GRanges object with 2 ranges and 0 metadata columns:
        seqnames        ranges strand
           <Rle>     <IRanges>  <Rle>
  [1] CP042204.1 471047-471094      -
  [2] CP042204.1 471781-471826      -
  -------
  seqinfo: 21 sequences from ASM773564v1 genome

...
<3 more elements>



sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Vinaequalis.NCBI.ASM976944v1_1.0.0
 [2] BSgenome.Vnashicola.NCBI.ASM452265v2_1.0.0 
 [3] BSgenome.Veffusa.NCBI.ASM773564v1_1.0.0    
 [4] shape_1.4.6.1                              
 [5] RMariaDB_1.3.2                             
 [6] modeest_2.4.0                              
 [7] seqLogo_1.70.0                             
 [8] GenomicDistributions_1.12.0                
 [9] MotifDb_1.46.0                             
[10] rlist_0.4.6.2                              
[11] msaR_0.6.0                                 
[12] deepredeff_0.1.1                           
[13] seqinr_4.2-36                              
[14] ape_5.8                                    
[15] msa_1.36.1                                 
[16] OrganismDbi_1.46.0                         
[17] BSgenomeForge_1.4.0                        
[18] BSgenome_1.72.0                            
[19] BiocIO_1.14.0                              
[20] karyoploteR_1.30.0                         
[21] regioneR_1.36.0                            
[22] biovizBase_1.52.0                          
[23] Gviz_1.48.0                                
[24] AnnotationForge_1.46.0                     
[25] AnnotationHub_3.12.0                       
[26] BiocFileCache_2.12.0                       
[27] dbplyr_2.5.0                               
[28] UniProt.ws_2.44.0                          
[29] RSQLite_2.3.7                              
[30] gridExtra_2.3                              
[31] ggbio_1.52.0                               
[32] Rsubread_2.18.0                            
[33] ShortRead_1.62.0                           
[34] BiocParallel_1.38.0                        
[35] Rhisat2_1.20.0                             
[36] trackViewer_1.40.0                         
[37] rtracklayer_1.64.0                         
[38] R.utils_2.12.3                             
[39] R.oo_1.26.0                                
[40] R.methodsS3_1.8.2                          
[41] RColorBrewer_1.1-3                         
[42] scatterplot3d_0.3-44                       
[43] ggplot2_3.5.1                              
[44] gplots_3.1.3.1                             
[45] Cairo_1.6-2                                
[46] motifStack_1.48.0                          
[47] openxlsx_4.2.6.1                           
[48] GenomicFeatures_1.56.0                     
[49] AnnotationDbi_1.66.0                       
[50] GenomicAlignments_1.40.0                   
[51] Rsamtools_2.20.0                           
[52] Biostrings_2.72.1                          
[53] XVector_0.44.0                             
[54] SummarizedExperiment_1.34.0                
[55] Biobase_2.64.0                             
[56] MatrixGenerics_1.16.0                      
[57] matrixStats_1.3.0                          
[58] GenomicRanges_1.56.1                       
[59] GenomeInfoDb_1.40.1                        
[60] IRanges_2.38.1                             
[61] S4Vectors_0.42.1                           
[62] BiocGenerics_0.50.0                        
[63] devtools_2.4.5                             
[64] usethis_3.0.0                              

loaded via a namespace (and not attached):
  [1] fs_1.6.4                    ProtGenerics_1.36.0        
  [3] bitops_1.0-8                DirichletMultinomial_1.46.0
  [5] TFBSTools_1.42.0            httr_1.4.7                 
  [7] InteractionSet_1.32.0       profvis_0.3.8              
  [9] tools_4.4.1                 backports_1.5.0            
 [11] utf8_1.2.4                  R6_2.5.1                   
 [13] lazyeval_0.2.2              rhdf5filters_1.16.0        
 [15] urlchecker_1.0.1            withr_3.0.1                
 [17] prettyunits_1.2.0           GGally_2.2.1               
 [19] cli_3.6.3                   SGSeq_1.38.0               
 [21] grImport_0.9-7              readr_2.1.5                
 [23] txdbmaker_1.0.1             foreign_0.8-87             
 [25] dichromat_2.0-0.1           sessioninfo_1.2.2          
 [27] plotrix_3.8-4               rstudioapi_0.16.0          
 [29] generics_0.1.3              hwriter_1.3.2.1            
 [31] gtools_3.9.5                dplyr_1.1.4                
 [33] zip_2.3.1                   GO.db_3.19.1               
 [35] Matrix_1.7-0                interp_1.1-6               
 [37] fansi_1.0.6                 abind_1.4-5                
 [39] lifecycle_1.0.4             yaml_2.3.10                
 [41] rhdf5_2.48.0                SparseArray_1.4.8          
 [43] blob_1.2.4                  promises_1.3.0             
 [45] crayon_1.5.3                pwalign_1.0.0              
 [47] miniUI_0.1.1.1              lattice_0.22-6             
 [49] annotate_1.82.0             KEGGREST_1.44.1            
 [51] pillar_1.9.0                knitr_1.48                 
 [53] statip_0.2.3                rjson_0.2.21               
 [55] codetools_0.2-20            strawr_0.0.92              
 [57] glue_1.7.0                  data.table_1.15.4          
 [59] remotes_2.5.0               vctrs_0.6.5                
 [61] png_0.1-8                   gtable_0.3.5               
 [63] poweRlaw_0.80.0             cachem_1.1.0               
 [65] xfun_0.46                   S4Arrays_1.4.1             
 [67] mime_0.12                   pracma_2.4.4               
 [69] timeDate_4032.109           ellipsis_0.3.2             
 [71] nlme_3.1-165                bit64_4.0.5                
 [73] progress_1.2.3              filelock_1.0.3             
 [75] fBasics_4032.96             KernSmooth_2.23-24         
 [77] rpart_4.1.23                splitstackshape_1.4.8      
 [79] colorspace_2.1-1            DBI_1.2.3                  
 [81] Hmisc_5.1-3                 nnet_7.3-19                
 [83] ade4_1.7-22                 tidyselect_1.2.1           
 [85] timeSeries_4032.109         bit_4.0.5                  
 [87] compiler_4.4.1              curl_5.2.1                 
 [89] httr2_1.0.2                 graph_1.82.0               
 [91] rjsoncons_1.3.1             htmlTable_2.4.3            
 [93] bezier_1.1.2                xml2_1.3.6                 
 [95] DelayedArray_0.30.1         checkmate_2.3.2            
 [97] scales_1.3.0                caTools_1.18.2             
 [99] spatial_7.3-17              RBGL_1.80.0                
[101] rappdirs_0.3.3              stringr_1.5.1              
[103] digest_0.6.36               rmarkdown_2.28             
[105] htmltools_0.5.8.1           pkgconfig_2.0.3            
[107] jpeg_0.1-10                 base64enc_0.1-3            
[109] stabledist_0.7-2            fastmap_1.2.0              
[111] ensembldb_2.28.0            rlang_1.1.4                
[113] htmlwidgets_1.6.4           UCSC.utils_1.0.0           
[115] shiny_1.9.1                 jsonlite_1.8.8             
[117] VariantAnnotation_1.50.0    RCurl_1.98-1.16            
[119] magrittr_2.0.3              Formula_1.2-5              
[121] GenomeInfoDbData_1.2.12     Rhdf5lib_1.26.0            
[123] munsell_0.5.1               Rcpp_1.0.13                
[125] reticulate_1.38.0           bamsignals_1.36.0          
[127] stringi_1.8.4               stable_1.1.6               
[129] zlibbioc_1.50.0             MASS_7.3-61                
[131] plyr_1.8.9                  pkgbuild_1.4.4             
[133] ggstats_0.6.0               parallel_4.4.1             
[135] deldir_2.0-4                CNEr_1.40.0                
[137] hms_1.1.3                   igraph_2.0.3               
[139] RUnit_0.4.33                rmutil_1.1.10              
[141] reshape2_1.4.4              biomaRt_2.60.1             
[143] pkgload_1.4.0               TFMPvalue_0.0.9            
[145] BiocVersion_3.19.1          XML_3.99-0.17              
[147] evaluate_0.24.0             latticeExtra_0.6-30        
[149] BiocManager_1.30.23         tzdb_0.4.0                 
[151] httpuv_1.6.15               tidyr_1.3.1                
[153] purrr_1.0.2                 clue_0.3-65                
[155] BiocBaseUtils_1.6.0         xtable_1.8-4               
[157] restfulr_0.0.15             AnnotationFilter_1.28.0    
[159] later_1.3.2                 tibble_3.2.1               
[161] memoise_2.0.1               cluster_2.1.6
GenomicRanges GenomicFeatures • 349 views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 24 minutes ago
United States

Something like

> z <- intronsByTranscript(tx, use.names = TRUE)
> z <- z[lengths(z) > 0L]
> znam <- sapply(strsplit(names(z), "\\|"), "[", 3)
> zz <- unlist(z)
> zz$intron <- paste(rep(znam, lengths(z)), sapply(lengths(z), seq_len), sep = "_")
> head(zz$intron, 30)
 [1] "FKW77_000178-T1_mrna_1"  
 [2] "FKW77_000228-T1_mrna_1"  
 [3] "FKW77_000243-T1_mrna_1"  
 [4] "FKW77_000249-T1_mrna_1:3"
 [5] "FKW77_000249-T1_mrna_1:2"
 [6] "FKW77_000249-T1_mrna_1:3"
 [7] "FKW77_000264-T1_mrna_1:2"
 [8] "FKW77_000264-T1_mrna_1:4"
 [9] "FKW77_000300-T1_mrna_1"  
[10] "FKW77_000300-T1_mrna_1:2"
[11] "FKW77_000300-T1_mrna_1:2"
[12] "FKW77_000331-T1_mrna_1:2"
[13] "FKW77_000331-T1_mrna_1"  
[14] "FKW77_000372-T1_mrna_1:4"
[15] "FKW77_000372-T1_mrna_1:3"
[16] "FKW77_000372-T1_mrna_1"  
[17] "FKW77_000372-T1_mrna_1:2"
[18] "FKW77_000383-T1_mrna_1"  
[19] "FKW77_000400-T1_mrna_1:5"
[20] "FKW77_000400-T1_mrna_1:5"
[21] "FKW77_000451-T1_mrna_1:2"
[22] "FKW77_000451-T1_mrna_1:2"
[23] "FKW77_000460-T1_mrna_1:7"
[24] "FKW77_000460-T1_mrna_1:8"
[25] "FKW77_000469-T1_mrna_1"  
[26] "FKW77_000584-T1_mrna_1:2"
[27] "FKW77_000584-T1_mrna_1:2"
[28] "FKW77_000584-T1_mrna_1:2"
[29] "FKW77_000584-T1_mrna_1:2"
[30] "FKW77_000602-T1_mrna_1"
1
Entering edit mode

There's an error in the last line. Note that sapply(lengths(z), seq_len) will return a list because the lengths vary. We need a vector, so have to convert that list to a vector. There are (at least) two ways to do that, using either unlist or (my preference) do.call, which is (probably) more likely to do what you expect.

> zz$intron <- paste(rep(znam, lengths(z)), do.call(c, sapply(lengths(z), seq_len)), sep = "_")
> head(zz$intron, 30)
 [1] "FKW77_000178-T1_mrna_1"
 [2] "FKW77_000228-T1_mrna_1"
 [3] "FKW77_000243-T1_mrna_1"
 [4] "FKW77_000249-T1_mrna_1"
 [5] "FKW77_000249-T1_mrna_2"
 [6] "FKW77_000249-T1_mrna_3"
 [7] "FKW77_000264-T1_mrna_1"
 [8] "FKW77_000264-T1_mrna_2"
 [9] "FKW77_000300-T1_mrna_1"
[10] "FKW77_000300-T1_mrna_2"
[11] "FKW77_000300-T1_mrna_3"
[12] "FKW77_000331-T1_mrna_1"
[13] "FKW77_000331-T1_mrna_2"
[14] "FKW77_000372-T1_mrna_1"
[15] "FKW77_000372-T1_mrna_2"
[16] "FKW77_000372-T1_mrna_3"
[17] "FKW77_000372-T1_mrna_4"
[18] "FKW77_000383-T1_mrna_1"
[19] "FKW77_000400-T1_mrna_1"
[20] "FKW77_000400-T1_mrna_2"
[21] "FKW77_000451-T1_mrna_1"
[22] "FKW77_000451-T1_mrna_2"
[23] "FKW77_000460-T1_mrna_1"
[24] "FKW77_000460-T1_mrna_2"
[25] "FKW77_000469-T1_mrna_1"
[26] "FKW77_000584-T1_mrna_1"
[27] "FKW77_000584-T1_mrna_2"
[28] "FKW77_000584-T1_mrna_3"
[29] "FKW77_000584-T1_mrna_4"
[30] "FKW77_000602-T1_mrna_1"
ADD REPLY
0
Entering edit mode

Hey, In revisiting this question, the following code works for the '+' strand. I am having issues with demarcating intron rank by transcript on '-' strand introns. Negative strand intron ranks by transcript are in the reverse orientation....(e.g. the output is 1-2-3-4-5 when it should be 5-4-3-2-1) as granges 'end' and 'start' should be inverted for the negative strand intron ranking convention. Is there a subsetting operation that can address this?

z <- intronsByTranscript(tx, use.names = TRUE)
z <- z[lengths(z) > 0L]
znam <- sapply(strsplit(names(z), "\\|"), "[", 3)
zz <- unlist(z)
ind[,7]<-unlist(sapply(lengths(z), seq_len))
colnames(ind)[7]<-"IntronRankByTranscript"

head(ind)
            seqnames  start    end width strand          GID IntronRankByTranscript
VEintron1 CP042185.1 102664 102719    56      + FKW77_000178                      1
VEintron2 CP042185.1 114417 114472    56      + FKW77_000228                      1
VEintron3 CP042185.1 117912 117961    50      + FKW77_000243                      1
VEintron4 CP042185.1 118970 119025    56      +         CYT1                      1
VEintron5 CP042185.1 119184 119241    58      +         CYT1                      2
VEintron6 CP042185.1 119905 119957    53      +         CYT1                      3

tail(ind)
                seqnames  start    end width strand          GID IntronRankByTranscript
VEintron20729 CP042204.1 501893 502450   558      - FKW77_002098                      1
VEintron20730 CP042204.1 502552 502598    47      - FKW77_002098                      2
VEintron20731 CP042204.1 502620 502666    47      - FKW77_002098                      3
VEintron20732 CP042204.1 502765 502810    46      - FKW77_002098                      4
VEintron20733 CP042204.1 502858 503688   831      - FKW77_002098                      5
VEintron20734 CP042204.1 512952 513671   720      - FKW77_002145                      1
ADD REPLY

Login before adding your answer.

Traffic: 537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6