Question

NA transcript values in my Txdb

1

Entering edit mode

A.J. ▴ 20

@aj-24333

Last seen 3.0 years ago

United States

Hi,

I have had this problem for a while now, I kind of solved it by using Ensembl's GFF file for the prairie vole, but it'd really like to get MakeTxdbFromGFF working from the NCBI GFF file (Here)

When I try to make a txdb out of that GFF file, I get

> txdb_maybe <- makeTxDbFromGFF("/mnt/Prairie_Vole_Data/Arjen_Folder/Arjen/genomefiles_mo/NCBI/GTF/GCF_000317375.1_MicOch1.0_genomic.gff") 

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  some transcripts have no "transcript_id" attribute ==> their name ("tx_name" column in the
  TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported from the "transcript_id"
  attribute are not unique
3: In .find_exon_cds(exons, cds) :
  The following transcripts have exons that contain more than one CDS (only the first CDS was
  kept for each exon): rna-NM_001289870.1, rna-NM_001290102.1, rna-NM_001290499.1,

So that if I do:

> transcripts(txdb_maybe)[10:5000]
GRanges object with 4991 ranges and 2 metadata columns:
            seqnames            ranges strand |     tx_id        tx_name
               <Rle>         <IRanges>  <Rle> | <integer>    <character>
     [1] NC_022009.1   3465941-3467344      + |        10 XM_005343117.2
     [2] NC_022009.1   3877767-3886973      + |        11 XM_026782825.1
     [3] NC_022009.1   3880864-3885230      + |        12 XM_026782826.1
     [4] NC_022009.1   3880864-3886973      + |        13 XM_005343118.2
     [5] NC_022009.1   3925279-3992427      + |        14 XM_013349067.2
     ...         ...               ...    ... .       ...            ...
  [4987] NC_022011.1 29621640-29622008      - |      4996           <NA>
  [4988] NC_022011.1 29655176-29690024      - |      4997 XM_005345785.3
  [4989] NC_022011.1 29698713-29726003      - |      4998 XM_005345787.3
  [4990] NC_022011.1 30128495-30135431      - |      4999 XM_005345790.3
  [4991] NC_022011.1 30128495-30135932      - |      5000 XM_026789513.1
  -------
  seqinfo: 571 sequences from an unspecified genome; no seqlengths

Some entries don't get a transcript name. But when I look for this specific entry in the GTF file, I find:

NC_022011.1 Gnomon  gene    29621640    29622008    .   -   .   gene_id "LOC101999479"; db_xref "GeneID:101999479"; gbkey "Gene"; gene "LOC101999479"; gene_biotype "pseudogene"; pseudo "true";

So according to my GTF this entry is a gene without a transcript. But why does it gets marked up as a transcript in my txdb then? And is there any way to prevent that from happening (like exclude those "NA" transcipts from my txdb)?

Many thanks, Arjen

GFF txdb • 3.0k views

ADD COMMENT • link updated 3.5 years ago by James W. MacDonald 68k • written 3.5 years ago by A.J. ▴ 20

score 2 · Answer 1 · 2021-10-13

Hi Arjen,

Hmm.. I'm surprised you can import this GTF file at all. Are we talking about the GTF file GCF_000317375.1_MicOch1.0_genomic.gtf located here? I get the following error when I try to import it with makeTxDbFromGFF():

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("GCF_000317375.1_MicOch1.0_genomic.gtf")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) : 
#   some CDS cannot be mapped to an exon
# In addition: Warning message:
# In .get_cds_IDX(mcols0$type, mcols0$phase) :
#   The "phase" metadata column contains non-NA values for features of type
#   stop_codon. This information was ignored.

But if I use the GFF3 file (from the same location), everything is fine:

txdb <- makeTxDbFromGFF("GCF_000317375.1_MicOch1.0_genomic.gff")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#   some transcripts have no "transcript_id" attribute ==> their name
#   ("tx_name" column in the TxDb object) was set to NA
# 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#   the transcript names ("tx_name" column in the TxDb object) imported
#   from the "transcript_id" attribute are not unique
# 3: In .find_exon_cds(exons, cds) :
#   The following transcripts have exons that contain more than one CDS
#   (only the first CDS was kept for each exon): rna-NM_001289870.1,
#   rna-NM_001290102.1, rna-NM_001290499.1, rna-NM_001291243.1

See my sessionInfo() below.

About the unnamed transcripts: I do see an unnamed transcript at coordinates 29621640-29622008 like you did:

transcripts(txdb)[4994:4998]
# GRanges object with 5 ranges and 2 metadata columns:
#          seqnames            ranges strand |     tx_id        tx_name
#             <Rle>         <IRanges>  <Rle> | <integer>    <character>
#   [1] NC_022011.1 29591627-29617240      - |      4994 XM_005345783.2
#   [2] NC_022011.1 29591875-29614709      - |      4995 XM_013354377.2
#   [3] NC_022011.1 29621640-29622008      - |      4996           <NA>
#   [4] NC_022011.1 29655176-29690024      - |      4997 XM_005345785.3
#   [5] NC_022011.1 29698713-29726003      - |      4998 XM_005345787.3
#   -------
#   seqinfo: 571 sequences from an unspecified genome; no seqlengths

This is actually what I would expect based on what I see in the GFF3 file:

NC_022011.1     Gnomon  pseudogene      29621640        29622008        .       -       .       ID=gene-LOC101999479;Dbxref=GeneID:101999479;Name=LOC101999479;gbkey=Gene;gene=LOC101999479;...
NC_022011.1     Gnomon  exon    29621640        29622008        .       -       .       ID=id-LOC101999479;Parent=gene-LOC101999479;Dbxref=GeneID:101999479;gbkey=exon;gene=LOC101999479;...

This is a gene (pseudogene to be precise) with no reported transcript but one reported exon. What happened here is that makeTxDbFromGFF() treated it as if it had one transcript that spans the entire gene/exon. This is a feature. However it seems that makeTxDbFromGFF() didn't assign a name to this inferred transcript, which may sound unfortunate, but the name of the associated gene is not far away:

transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))[4994:4998]
# GRanges object with 5 ranges and 3 metadata columns:
#         seqnames            ranges strand |         gene_id     tx_id
#            <Rle>         <IRanges>  <Rle> | <CharacterList> <integer>
#   [1] NC_022011.1 29591627-29617240      - |          Klhdc4      4994
#   [2] NC_022011.1 29591875-29614709      - |          Klhdc4      4995
#   [3] NC_022011.1 29621640-29622008      - |    LOC101999479      4996
#   [4] NC_022011.1 29655176-29690024      - |          Slc7a5      4997
#   [5] NC_022011.1 29698713-29726003      - |            Ca5a      4998
#              tx_name
#          <character>
#   [1] XM_005345783.2
#   [2] XM_013354377.2
#   [3]           <NA>
#   [4] XM_005345785.3
#   [5] XM_005345787.3
#   -------
#   seqinfo: 571 sequences from an unspecified genome; no seqlengths

So it's easy to assign the gene name to all the unnamed transcripts with something like:

tx <- transcripts(txdb, columns=c("gene_id", "tx_id", "tx_name"))
unnamed_tx_idx <- which(is.na(mcols(tx)$tx_name))
mcols(tx)$tx_name[unnamed_tx_idx] <- as.character(mcols(tx)$gene_id[unnamed_tx_idx])
tx[4994:4998]
# GRanges object with 5 ranges and 3 metadata columns:
#          seqnames            ranges strand |         gene_id     tx_id
#             <Rle>         <IRanges>  <Rle> | <CharacterList> <integer>
#   [1] NC_022011.1 29591627-29617240      - |          Klhdc4      4994
#   [2] NC_022011.1 29591875-29614709      - |          Klhdc4      4995
#   [3] NC_022011.1 29621640-29622008      - |    LOC101999479      4996
#   [4] NC_022011.1 29655176-29690024      - |          Slc7a5      4997
#   [5] NC_022011.1 29698713-29726003      - |            Ca5a      4998
#              tx_name
#          <character>
#   [1] XM_005345783.2
#   [2] XM_013354377.2
#   [3]   LOC101999479
#   [4] XM_005345785.3
#   [5] XM_005345787.3
#   -------
#   seqinfo: 571 sequences from an unspecified genome; no seqlengths

As for why makeTxDbFromGFF() choked on the GTF file, I'm not sure, I didn't investigate. Maybe the file is too broken for makeTxDbFromGFF()'s taste? Even if NCBI, UCSC, and Ensembl are still using it, GTF is a deprecated format and GFF3 should be used whenever possible. See http://gmod.org/wiki/GFF2.

Hope this helps,

H.

sessionInfo():

R version 4.1.0 beta (2021-05-06 r80268)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.r80268/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r80268/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.44.2 AnnotationDbi_1.54.1   Biobase_2.52.0        
[4] GenomicRanges_1.44.0   GenomeInfoDb_1.28.4    IRanges_2.26.0        
[7] S4Vectors_0.31.0       BiocGenerics_0.38.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7                  lattice_0.20-45            
 [3] prettyunits_1.1.1           png_0.1-7                  
 [5] Rsamtools_2.8.0             Biostrings_2.60.2          
 [7] assertthat_0.2.1            digest_0.6.28              
 [9] utf8_1.2.2                  BiocFileCache_2.0.0        
[11] R6_2.5.1                    RSQLite_2.2.8              
[13] httr_1.4.2                  pillar_1.6.3               
[15] zlibbioc_1.38.0             rlang_0.4.11               
[17] progress_1.2.2              curl_4.3.2                 
[19] rstudioapi_0.13             blob_1.2.2                 
[21] Matrix_1.3-4                BiocParallel_1.26.2        
[23] stringr_1.4.0               RCurl_1.98-1.5             
[25] bit_4.0.4                   biomaRt_2.48.3             
[27] DelayedArray_0.18.0         compiler_4.1.0             
[29] rtracklayer_1.52.1          pkgconfig_2.0.3            
[31] tidyselect_1.1.1            KEGGREST_1.32.0            
[33] SummarizedExperiment_1.22.0 tibble_3.1.5               
[35] GenomeInfoDbData_1.2.6      matrixStats_0.61.0         
[37] XML_3.99-0.8                fansi_0.5.0                
[39] crayon_1.4.1                dplyr_1.0.7                
[41] dbplyr_2.1.1                GenomicAlignments_1.28.0   
[43] bitops_1.0-7                rappdirs_0.3.3             
[45] grid_4.1.0                  lifecycle_1.0.1            
[47] DBI_1.1.1                   magrittr_2.0.1             
[49] stringi_1.7.5               cachem_1.0.6               
[51] XVector_0.32.0              xml2_1.3.2                 
[53] ellipsis_0.3.2              filelock_1.0.2             
[55] generics_0.1.0              vctrs_0.3.8                
[57] rjson_0.2.20                restfulr_0.0.13            
[59] tools_4.1.0                 bit64_4.0.5                
[61] glue_1.4.2                  purrr_0.3.4                
[63] MatrixGenerics_1.4.3        hms_1.1.1                  
[65] fastmap_1.1.0               yaml_2.2.1                 
[67] memoise_2.0.0               BiocIO_1.2.0