Could GenomicFeature include genes that are on multiple chromosomes?
1
0
Entering edit mode
hxlei613 • 0
@hxlei613-16299
Last seen 3.3 years ago
China

Hi, I am working with macaque, whose annotation and reference are not complete. I download Refseq annotation from UCSC and load the gtf using this.

annotation_m = makeTxDbFromGFF(gtfFile,
                              format = 'gtf',
                              dataSource = 'ensemble',
                              organism = 'Macaca fascicularis',
                              chrominfo = chromInfo,
                              metadata = metaData)
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  NM_001283298, NM_001283311, NM_001283379, NM_001283387, NM_001283401,
  NM_001283462, NM_001283504, NM_001283522, NM_001283551, NM_001283594,
  NM_001283671, NM_001283708, NM_001283746, NM_001283802, NM_001283855,
  NM_001283893, NM_001284027, NM_001284076, NM_001284114, NM_001284173,
  NM_001284607, NM_001284630, NM_001284689, NM_001284707, NM_001284756,
  NM_001284835, NM_001284840, NM_001284890, NM_001284912, NM_001285101,
  NM_001285216, NM_001285318, NM_001319465, NM_001319481, NM_001319512,
  NM_001319514, NM_001319538, NM_001319588, NM_001319591

These genes may be paralogous genes .I don't want to drop them. Is there any way to include them ? I haven't find any answers. Thank you very much for helping me!

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /usr/local/lib/libblas.so.3.2.1
LAPACK: /usr/local/lib/liblapack.so.3.2.1

locale:
 [1] LC_CTYPE=en_HK.UTF-8       LC_NUMERIC=C               LC_TIME=en_HK.UTF-8       
 [4] LC_COLLATE=en_HK.UTF-8     LC_MONETARY=en_HK.UTF-8    LC_MESSAGES=en_HK.UTF-8   
 [7] LC_PAPER=en_HK.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_HK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] stats4    parallel  splines   stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] GenomicFeatures_1.30.3 AnnotationDbi_1.40.0   GenomicRanges_1.30.3   GenomeInfoDb_1.14.0   
 [5] IRanges_2.12.0         S4Vectors_0.16.0       bladderbatch_1.16.0    Biobase_2.38.0        
 [9] BiocGenerics_0.24.0    limma_3.34.9           sva_3.26.0             BiocParallel_1.12.0   
[13] genefilter_1.60.0      mgcv_1.8-12            nlme_3.1-128           survival_2.39-4       
[17] BiocInstaller_1.28.0   RColorBrewer_1.1-2     gplots_3.0.1           scales_0.5.0          
[21] cqn_1.24.0             quantreg_5.35          SparseM_1.77           preprocessCore_1.40.0 
[25] nor1mix_1.2-3          mclust_5.4             ggfortify_0.4.5        ggplot2_2.2.1         
[29] hash_2.2.6             pheatmap_1.0.8         stringr_1.3.0         

loaded via a namespace (and not attached):
 [1] bitops_1.0-6               matrixStats_0.53.1         bit64_0.9-7               
 [4] progress_1.2.0             httr_1.3.1                 tools_3.4.3               
 [7] R6_2.2.2                   KernSmooth_2.23-15         DBI_1.0.0                 
[10] lazyeval_0.2.1             colorspace_1.3-2           tidyselect_0.2.5          
[13] gridExtra_2.3              prettyunits_1.0.2          RMySQL_0.10.15            
[16] bit_1.1-14                 compiler_3.4.3             DelayedArray_0.4.1        
[19] rtracklayer_1.38.3         caTools_1.17.1             digest_0.6.15             
[22] Rsamtools_1.30.0           XVector_0.18.0             pkgconfig_2.0.1           
[25] rlang_0.2.2                rstudioapi_0.7             RSQLite_2.1.1             
[28] bindr_0.1.1                gtools_3.5.0               dplyr_0.7.6               
[31] RCurl_1.95-4.10            magrittr_1.5               GenomeInfoDbData_1.0.0    
[34] Matrix_1.2-6               Rcpp_0.12.19               munsell_0.4.3             
[37] stringi_1.1.6              SummarizedExperiment_1.8.1 zlibbioc_1.24.0           
[40] plyr_1.8.4                 grid_3.4.3                 blob_1.1.1                
[43] gdata_2.18.0               crayon_1.3.4               lattice_0.20-33           
[46] Biostrings_2.46.0          annotate_1.56.2            hms_0.4.2                 
[49] pillar_1.2.1               biomaRt_2.34.2             XML_3.98-1.11             
[52] glue_1.3.0                 MatrixModels_0.4-1         gtable_0.2.0              
[55] purrr_0.2.5                tidyr_0.8.1                assertthat_0.2.0          
[58] xtable_1.8-3               tibble_1.4.2               GenomicAlignments_1.14.2  
[61] memoise_1.1.0              bindrcpp_0.2.2            

 

 


GenomicFeature • 762 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

You could use makeTxDbFromUCSC instead, but that has issues as well, for transcripts with weird CDS lengths:

> z <- makeTxDbFromUCSC("macFas5", "refGene")
Download the refGene table ... OK
Download the hgFixed.refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable) :
  UCSC data anomaly in 202 transcript(s): the cds cumulative length is
  not a multiple of 3 for transcripts 'NM_001283848' 'NM_001285023'
  'NM_001283570' 'NM_001285318' 'NM_001283532' 'NM_001283699'
  'NM_001284065' 'NM_001284063' 'NM_001283962' 'NM_001284085'
  'NM_001283287' 'NM_001283322' 'NM_001285249' 'NM_001283901'
  'NM_001283331' 'NM_001283810' 'NM_001283551' 'NM_001284115'
  'NM_001284114' 'NM_001284919' 'NM_001283862' 'NM_001284758'
  'NM_001285021' 'NM_001284944' 'NM_001283699' 'NM_001284840'
  'NM_001291917' 'NM_001292005' 'NM_001291870' 'NM_001285136'
  'NM_001285032' 'NM_001285082' 'NM_001283449' 'NM_001285200'
  'NM_001284085' 'NM_001284637' 'NM_001283244' 'NM_001283657'
  'NM_001284707' 'NM_001283842' 'NM_001285191' 'NM_001285202'
  'NM_001319510' 'NM_001284092' 'NM_001284038' 'NM_001284161'
  'NM_001283240' 'NM_001284745' 'NM_001284986' 'NM_001285081'
  'NM_001284997' 'NM_001285097' 'NM_001285074' 'NM_001283802'
  'NM_001285211' 'NM_001283316' 'NM_001284054' 'NM_001283 [... truncated]

But it doesn't have problems for transcripts on separate chromosomes. For example, NM_001283298 is Entrez Gene ID 101926689:

> zz <- transcriptsBy(z)
> zz["101926689"]
GRangesList object of length 1:
$101926689
GRanges object with 2 ranges and 2 metadata columns:
      seqnames              ranges strand |     tx_id      tx_name
         <Rle>           <IRanges>  <Rle> | <integer>  <character>
  [1]     chr1 138243716-138244832      - |       183 NM_001283298
  [2]    chr11   40858952-40874389      - |      1243 NM_001283298

-------

I should note here that the problems with makeTxDbFromUCSC are simply that you get some warnings, not that you lose transcripts like you do when using the GFF.

 

ADD COMMENT
0
Entering edit mode

First thank you very much. I think this solution is good however I also have a manual annotation which cannot be imported using makeTxDbFromUCSC(). The same gene on multiple chromosome also exists in it. How should I deal with this ?

ADD REPLY
0
Entering edit mode

The TxDb packages are intended to provide information about genomic locations based on a particular annotation service. When  you use an accessor to get those data you get (depending on the accessor) either a GRanges or GRangesList object.

Your question could be formulated as 'How can I put some arbitrary genomic locations into a TxDb object?' or alternatively 'How can I add some arbitrary genomic locations to the data I get from a TxDb object?'. The second question is obviously a simpler thing to answer, and you can do that yourself by reading about the GRanges and GRangesList objects.

ADD REPLY

Login before adding your answer.

Traffic: 656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6