Question

error with: .make_splicings <- function(exons, cds, stop_codons=NULL)

0

Entering edit mode

rsnorman • 0

@e7235ccc

Last seen 2.9 years ago

United States

I am trying to use genomicFeatures as part of a epi2me Labs Differential Gene Expression pipeline using a bacterial genome and GTF annotation files downloaded from GenBank. When run, the following error message is generated:


Import genomic features from the file as a GRanges object ... OK

Prepare the 'metadata' data frame ... OK

Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) : 
  some CDS cannot be mapped to an exon

Calls: makeTxDbFromGFF -> makeTxDbFromGRanges -> .make_splicings

In addition: Warning message:

In .get_cds_IDX(mcols0$type, mcols0$phase) :

  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.

Execution halted

[Sun May 22 22:23:50 2022]

Error in rule de_analysis:
    jobid: 0
    output: de_analysis/results_dge.tsv, de_analysis/results_dge.pdf, de_analysis/results_dtu_gene.tsv, de_analysis/results_dtu_transcript.tsv, de_analysis/results_dtu_stageR.tsv, merged/all_counts_filtered.tsv, merged/all_gene_counts.tsv
    shell:

epi2melabs/differential-expression/pipeline-transcriptome-de/scripts/de_analysis.R

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

A few lines of the gtf annotation file are included here:

#gtf-version 2.2
#!genome-build ASM222426v1
#!genome-build-accession NCBI_Assembly:GCA_002224265.1
#!annotation-date 10/06/2015 12:48:27
#!annotation-source NCBI

CP012881.1  Genbank gene    479 742 .   -   .   gene_id "AOT11_07205"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "AOT11_07205"; 
CP012881.1  GeneMarkS+  CDS 482 742 .   -   0   gene_id "AOT11_07205"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07205"; product "hypothetical protein"; protein_id "ASM95055.1"; transl_table "11"; exon_number "1"; 
CP012881.1  GeneMarkS+  start_codon 740 742 .   -   0   gene_id "AOT11_07205"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07205"; product "hypothetical protein"; protein_id "ASM95055.1"; transl_table "11"; exon_number "1"; 
CP012881.1  GeneMarkS+  stop_codon  479 481 .   -   0   gene_id "AOT11_07205"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07205"; product "hypothetical protein"; protein_id "ASM95055.1"; transl_table "11"; exon_number "1"; 
CP012881.1  Genbank gene    938 1585    .   -   .   gene_id "AOT11_07210"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "AOT11_07210"; 
CP012881.1  GeneMarkS+  CDS 941 1585    .   -   0   gene_id "AOT11_07210"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07210"; product "hypothetical protein"; protein_id "ASM95056.1"; transl_table "11"; exon_number "1"; 
CP012881.1  GeneMarkS+  start_codon 1583    1585    .   -   0   gene_id "AOT11_07210"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07210"; product "hypothetical protein"; protein_id "ASM95056.1"; transl_table "11"; exon_number "1"; 
CP012881.1  GeneMarkS+  stop_codon  938 940 .   -   0   gene_id "AOT11_07210"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07210"; product "hypothetical protein"; protein_id "ASM95056.1"; transl_table "11"; exon_number "1"; 
CP012881.1  Genbank gene    1554    1868    .   -   .   gene_id "AOT11_07215"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "AOT11_07215"; 
CP012881.1  GeneMarkS+  CDS 1557    1868    .   -   0   gene_id "AOT11_07215"; transcript_id "unassigned_transcript_3"; gbkey "CDS"; inference "COORDINATES: ab initio prediction:GeneMarkS+"; locus_tag "AOT11_07215"; product "hypothetical protein"; protein_id "ASM95057.1"; transl_table "11"; exon_number "1";

Any suggestions on how to clear this error will be greatly appreciated!

GenomicFeatures • 2.8k views

ADD COMMENT • link updated 2.9 years ago by Hervé Pagès 16k • written 2.9 years ago by rsnorman • 0

score 0 · Answer 1 · 2022-05-24

That's a very old build, and an old version of the GFF standard. You would be better off getting something more recent.

> txdb <- makeTxDbFromGFF("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/204/915/GCF_002204915.1_ASM220491v1/GCF_002204915.1_ASM220491v1_genomic.gff.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/204/915/GCF_002204915.1_ASM220491v1/GCF_002204915.1_ASM220491v1_genomic.gff.gz'
Content type 'application/x-gzip' length 407136 bytes (397 KB)
downloaded 397 KB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported
  from the "Name" attribute are not unique
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  gene-FORC37_RS07565

> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/204/915/GCF_002204915.1_ASM220491v1/GCF_002204915.1_ASM220491v1_genomic.gff.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 4709
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-05-24 12:48:11 -0400 (Tue, 24 May 2022)
# GenomicFeatures version at creation time: 1.48.1
# RSQLite version at creation time: 2.2.14
# DBSCHEMAVERSION: 1.2
> transcripts(txdb)
GRanges object with 4709 ranges and 2 metadata columns:
              seqnames      ranges strand |     tx_id        tx_name
                 <Rle>   <IRanges>  <Rle> | <integer>    <character>
     [1] NZ_CP016321.1   1399-2499      + |         1           dnaN
     [2] NZ_CP016321.1   2509-3588      + |         2           recF
     [3] NZ_CP016321.1   3607-6024      + |         3           gyrB
     [4] NZ_CP016321.1   6447-6881      + |         4 FORC37_RS00025
     [5] NZ_CP016321.1 10698-11060      + |         5 FORC37_RS00040
     ...           ...         ...    ... .       ...            ...
  [4705] NZ_CP016323.1 12430-12579      - |      4705 FORC37_RS24350
  [4706] NZ_CP016323.1 21223-22191      - |      4706 FORC37_RS23960
  [4707] NZ_CP016323.1 29490-30458      - |      4707 FORC37_RS24060
  [4708] NZ_CP016323.1 32558-33526      - |      4708 FORC37_RS24075
  [4709] NZ_CP016323.1 38662-39629      - |      4709 FORC37_RS24145
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

OR if you like EBI/EMBL better

> txdb2 <- makeTxDbFromGFF("http://ftp.ensemblgenomes.org/pub/bacteria/release-53/gff3/bacteria_73_collection/vibrio_vulnificus_gca_004214765/Vibrio_vulnificus_gca_004214765.ASM421476v1.49.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'http://ftp.ensemblgenomes.org/pub/bacteria/release-53/gff3/bacteria_73_collection/vibrio_vulnificus_gca_004214765/Vibrio_vulnificus_gca_004214765.ASM421476v1.49.gff3.gz'
downloaded 352 KB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> transcripts(txdb2)
GRanges object with 4985 ranges and 2 metadata columns:
                       seqnames        ranges strand |     tx_id     tx_name
                          <Rle>     <IRanges>  <Rle> | <integer> <character>
     [1] NODE_10_length_16019..     1983-2492      + |         1    RZQ00879
     [2] NODE_10_length_16019..     2812-5349      + |         2    RZQ00880
     [3] NODE_10_length_16019..     7162-8211      + |         3    RZQ00883
     [4] NODE_10_length_16019..    8631-10067      + |         4    RZQ00884
     [5] NODE_10_length_16019..   10752-11360      + |         5    RZQ00885
     ...                    ...           ...    ... .       ...         ...
  [4981] NODE_9_length_176042.. 166011-166829      - |      4981    RZQ01380
  [4982] NODE_9_length_176042.. 166950-167969      - |      4982    RZQ01381
  [4983] NODE_9_length_176042.. 174534-175073      - |      4983    RZQ01387
  [4984] NODE_9_length_176042.. 175076-175396      - |      4984    RZQ01388
  [4985] NODE_9_length_176042.. 175398-175982      - |      4985    RZQ01389
  -------
  seqinfo: 82 sequences from an unspecified genome; no seqlengths

There are like 8 or so different versions at Ensembl bacteria that you could choose from.