Question

Warning in makeTxDbFromGFF

1

Entering edit mode

weir ▴ 10

@weir-21040

Last seen 5.5 years ago

Hi I'm getting some warnings in

makeTxDbFromGFF()

here is full stacktrace:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: /home/weir/RNAedit/human_test/reference/GCF_000001405.38_GRCh38.p12_genomic.gff
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 178581
# exon_nrow: 1945509
# cds_nrow: 1460272
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-06-17 22:31:22 +0800 (Mon, 17 Jun 2019)
# GenomicFeatures version at creation time: 1.34.8
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2
Warning messages:
1: In .extract_exons_from_GRanges(exon_IDX, gr, ID, Name, Parent, feature = "exon",  :
  The following orphan exon were dropped (showing only the 6 first):
         seqid     start       end strand                     ID
1 NC_000001.11  15542166  15542304      +     exon-NR_135613.1-1
2 NC_000001.11  27834401  27834566      +     exon-NR_002997.1-1
3 NC_000001.11 109100193 109100612      +     exon-NR_003023.1-1
4 NC_000001.11 144875032 144875095      - exon-id-LOC107985528-1
5 NC_000001.11 144874355 144874907      - exon-id-LOC107985528-2
6 NC_000001.11 155679108 155679255      -     exon-NR_132762.1-1
           Parent                   Name
1 rna-NR_135613.1     exon-NR_135613.1-1
2 rna-NR_002997.1     exon-NR_002997.1-1
3 rna-NR_003023.1     exon-NR_003023.1-1
4 id-LOC107985528 exon-id-LOC107985528-1
5 id-LOC107985528 exon-id-LOC107985528-2
6 rna-NR_132762.1     exon-NR_132762.1-1
2: In .extract_exons_from_GRanges(cds_IDX, gr, ID, Name, Parent, feature = "cds",  :
  The following orphan CDS were dropped (showing only the 6 first):
         seqid     start       end strand               ID          Parent Name
1 NC_000001.11 144875032 144875080      - cds-LOC107985528 id-LOC107985528 <NA>
2 NC_000001.11 144874585 144874907      - cds-LOC107985528 id-LOC107985528 <NA>
3 NC_000002.12  88857361  88857683      -         cds-IGKC         id-IGKC <NA>
4 NC_000002.12  88860568  88860605      -        cds-IGKJ5        id-IGKJ5 <NA>
5 NC_000002.12  88860886  88860923      -        cds-IGKJ4        id-IGKJ4 <NA>
6 NC_000002.12  88861221  88861258      -        cds-IGKJ3        id-IGKJ3 <NA>
3: In .find_exon_cds(exons, cds) :
  The following transcripts have exons that contain more than one CDS
  (only the first CDS was kept for each exon): rna-NM_001134939.1,
  rna-NM_001172437.2, rna-NM_001184961.1, rna-NM_001301020.1,
  rna-NM_001301302.1, rna-NM_001301371.1, rna-NM_002537.3,
  rna-NM_004152.3, rna-NM_015068.3, rna-NM_016178.2


> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

Matrix products: default
BLAS/LAPACK: /home/weir/anaconda3/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicFeatures_1.34.8 AnnotationDbi_1.44.0   Biobase_2.40.0
[4] GenomicRanges_1.34.0   GenomeInfoDb_1.16.0    IRanges_2.16.0
[7] S4Vectors_0.20.1       AnnotationHub_2.12.1   BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.12.0   progress_1.2.0
 [3] lattice_0.20-38               htmltools_0.3.6
 [5] rtracklayer_1.42.0            yaml_2.2.0
 [7] interactiveDisplayBase_1.18.0 blob_1.1.1
 [9] XML_3.98-1.12                 rlang_0.3.4
[11] later_0.8.0                   DBI_1.0.0
[13] BiocParallel_1.16.0           bit64_0.9-7
[15] matrixStats_0.54.0            GenomeInfoDbData_1.1.0
[17] stringr_1.4.0                 zlibbioc_1.26.0
[19] Biostrings_2.48.0             memoise_1.1.0
[21] biomaRt_2.38.0                httpuv_1.5.1
[23] BiocInstaller_1.30.0          curl_3.3
[25] Rcpp_1.0.1                    xtable_1.8-3
[27] promises_1.0.1                DelayedArray_0.8.0
[29] XVector_0.22.0                mime_0.6
[31] bit_1.1-12                    Rsamtools_1.34.0
[33] hms_0.4.2                     digest_0.6.18
[35] stringi_1.4.3                 shiny_1.2.0
[37] grid_3.5.1                    tools_3.5.1
[39] bitops_1.0-6                  magrittr_1.5
[41] RCurl_1.95-4.12               RSQLite_2.1.1
[43] crayon_1.3.4                  pkgconfig_2.0.2
[45] Matrix_1.2-17                 prettyunits_1.0.2
[47] assertthat_0.2.1              httr_1.4.0
[49] R6_2.4.0                      GenomicAlignments_1.18.0
[51] compiler_3.5.1

The GFF file is download from https://www.ncbi.nlm.nih.gov/genome/?term=human

Can someone help me? Best wishes weir

txdb • 2.5k views

ADD COMMENT • link updated 3.7 years ago by Hervé Pagès 16k • written 5.5 years ago by weir ▴ 10

score 1 · Answer 1 · 2019-06-22

Hi,

The first 2 warnings indicate that the file contains exons and CDS that were dropped because they couldn't be linked to a transcript. I just improved the warning message (the change is in GenomicFeatures 1.36.3) so it displays the number of exons or CDS that get dropped:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("ref_GRCh38.p12_top_level.gff3")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_exons_from_GRanges(exon_IDX, gr, mcols0, tx_IDX, feature="exon",:
#   1558 exons couldn't be linked to a transcript so were dropped
#   (showing only the first 6):
#          seqid     start       end strand       ID     Name ...
# 1 NC_000001.11 144875032 144875095      - id105387 id105387 ...
# 2 NC_000001.11 144874355 144874907      - id105388 id105388 ...
# 3 NC_000002.12  88857361  88857683      - id241515 id241515 ...
# 4 NC_000002.12  88860568  88860605      - id241517 id241517 ...
# 5 NC_000002.12  88860886  88860923      - id241519 id241519 ...
# 6 NC_000002.12  88861221  88861258      - id241521 id241521 ...
# 2: In .extract_exons_from_GRanges(cds_IDX, gr, mcols0, tx_IDX, feature="cds",:
#   1553 CDS couldn't be linked to a transcript so were dropped
#   (showing only the first 6):
#          seqid     start       end strand       ID Name ...
# 1 NC_000001.11 144875032 144875080      -  cds6180 <NA> ...
# 2 NC_000001.11 144874585 144874907      -  cds6180 <NA> ...
# 3 NC_000002.12  88857361  88857683      - cds14156 <NA> ...
# 4 NC_000002.12  88860568  88860605      - cds14157 <NA> ...
# 5 NC_000002.12  88860886  88860923      - cds14158 <NA> ...
# 6 NC_000002.12  88861221  88861258      - cds14159 <NA> ...
# 3: In .find_exon_cds(exons, cds) :
#   The following transcripts have exons that contain more than
#   one CDS (only the first CDS was kept for each exon): rna116402,
#   rna116403, rna137565, rna137566, rna63759, rna63761,
#   rna63764, rna9689, rna9690, rna9691

Note that the file contains some rare transcript types (scRNA, guide_RNA, telomerase_RNA, vault_RNA, Y_RNA -- these are valid Sequence Ontology terms) that makeTxDbFromGFF() didn't recognize as transcripts so this is why the exons and CDS linked to these transcripts were getting dropped. In GenomicFeatures 1.36.3 I added these types to the list of types that should be treated as transcripts so makeTxDbFromGFF() now drops a few less exons. As a consequence, the TxDb object I get contains a few (44) more transcripts and exons than the one you got with your version of GenomicFeatures:

> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: ref_GRCh38.p12_top_level.gff3
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 178625
# exon_nrow: 1945553
# cds_nrow: 1460272
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-06-22 17:05:56 -0700 (Sat, 22 Jun 2019)
# GenomicFeatures version at creation time: 1.37.3
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2

The exons and CDS that still get dropped with GenomicFeatures 1.36.3 are linked to features of type C_gene_segment, D_gene_segment, J_gene_segment, and V_gene_segment. However these Sequence Ontology terms are not offsprings of the transcript term so I'm reluctant to add them to the list of types that makeTxDbFromGFF() should treat as transcripts. But if someone wants to make the case for adding these terms, I'm open to it.

Finally the 3rd warning should be self explanatory: in some rare occasions a GFF3 file can contain a few exons with more than one CDS. makeTxDbFromGFF() does not know how to import more than one CDS per exon at the moment so the warning just says that only the first CDS was kept for each such exon.

GenomicFeatures 1.36.3 should become available to Bioconductor 3.9 users in about 24-48 hours via BiocManager::install(). Note that you're using Bioconductor 3.8 which is not the current release and is no longer supported.

Cheers,

H.