Error when using ORFik: findORFsFasta gives wrong start and stop codon
1
0
Entering edit mode
@levasseurmaxence-8970
Last seen 3.8 years ago
Canada

I am trying to extract ORFs from the human mitochondrial genome (circular genome) using the ORFik package. Running the following example works properly:

require(BSgenome.Hsapiens.UCSC.hg38)
require(ORFik)

# Load example fasta file
example_genome <- system.file("extdata", "genome.fasta", package = "ORFik")

# Find ORFs
orfs <- findORFsFasta(example_genome, 
                      startCodon = "ATG", 
                      stopCodon = stopDefinition(2),
                      is.circular = TRUE)

# Get ORF sequences
genome <- readDNAStringSet(example_genome, format = "fasta")
names(genome) <- "Chromosome"
BSgenome::getSeq(genome, orfs[2])

Extracted ORF sequence:

DNAStringSet object of length 1: width seq 585 ATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAAT...GGGGTCTATACCTGCGACCCGCGTCAGGTGCCCGATGCGAGG

This sequence is an appropriate ORF. It starts with an ATG, the start codon I requested, and ends with an AGG which is a termination codon for vertebrate mitochondrial code as specified by stopDefinition(2).

However, the approach fails when using a fasta file of the human mitochondrial genome. The resulting ORFs do not have the right start and/or stop codon and I can't explain the difference as the only thing that differs is the fasta file.

# Get human mitochondrial genome
mtDNA <- DNAStringSet(BSgenome.Hsapiens.UCSC.hg38$chrM)
names(mtDNA) <- "chrM"
Biostrings::writeXStringSet(mtDNA, "mitochondrial_genome.fasta")

# Find ORFs from local copy of mitochondrial genome saved as fasta file
mtDNA_orfs <- findORFsFasta("mitochondrial_genome.fasta", 
                            startCodon = "ATG", 
                            stopCodon = stopDefinition(2),
                            is.circular = TRUE)

# Load mitochondrial genome and get ORF sequences
BSgenome::getSeq(mtDNA, mtDNA_orfs[3])

Extracted ORF sequence:

DNAStringSet object of length 1: width seq 135 ATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCA...GCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAA

This sequence does not meet the criteria. It does not start with an ATG codon and does not stop with a known vertebrate mitochondrial termination codon as desired.

Any help would be much appreciated! Thanks!

Max


sessionInfo( )

R version 4.0.4 (2021-02-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] grid parallel stats4 stats graphics grDevices utils datasets methods
[10] base

other attached packages: [1] BSgenome.Hsapiens.UCSC.hg38_1.4.3 BSgenome_1.58.0
[3] rtracklayer_1.50.0 ORFik_1.10.11
[5] GenomicAlignments_1.26.0 Rsamtools_2.6.0
[7] Biostrings_2.58.0 XVector_0.30.0
[9] SummarizedExperiment_1.20.0 Biobase_2.50.0
[11] MatrixGenerics_1.2.1 matrixStats_0.58.0
[13] OrgMassSpecR_0.5-3 seqinr_4.2-5
[15] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[17] IRanges_2.24.1 S4Vectors_0.28.1
[19] BiocGenerics_0.36.0

loaded via a namespace (and not attached): [1] biomartr_0.9.2 bitops_1.0-6 bit64_4.0.5 RColorBrewer_1.1-2
[5] progress_1.2.2 httr_1.4.2 tools_4.0.4 utf8_1.2.1
[9] R6_2.5.0 DBI_1.1.1 colorspace_2.0-0 ade4_1.7-16
[13] gridExtra_2.3 tidyselect_1.1.0 prettyunits_1.1.1 GGally_2.1.1
[17] DESeq2_1.30.1 bit_4.0.4 curl_4.3 compiler_4.0.4
[21] xml2_1.3.2 DelayedArray_0.16.2 scales_1.1.1 genefilter_1.72.1
[25] askpass_1.1 rappdirs_0.3.3 stringr_1.4.0 R.utils_2.10.1
[29] pkgconfig_2.0.3 fst_0.9.4 dbplyr_2.1.0 fastmap_1.1.0
[33] rlang_0.4.10 rstudioapi_0.13 RSQLite_2.2.4 generics_0.1.0
[37] BiocParallel_1.24.1 dplyr_1.0.5 R.oo_1.24.0 RCurl_1.98-1.2
[41] magrittr_2.0.1 GenomeInfoDbData_1.2.4 Matrix_1.3-2 Rcpp_1.0.6
[45] munsell_0.5.0 fansi_0.4.2 lifecycle_1.0.0 R.methodsS3_1.8.1
[49] stringi_1.5.3 MASS_7.3-53 zlibbioc_1.36.0 plyr_1.8.6
[53] BiocFileCache_1.14.0 blob_1.2.1 crayon_1.4.1 lattice_0.20-41
[57] cowplot_1.1.1 splines_4.0.4 GenomicFeatures_1.42.1 annotate_1.68.0
[61] hms_1.0.0 locfit_1.5-9.4 pillar_1.5.1 geneplotter_1.68.0
[65] biomaRt_2.46.3 XML_3.99-0.5 glue_1.4.2 data.table_1.14.0
[69] vctrs_0.3.6 gtable_0.3.0 openssl_1.4.3 purrr_0.3.4
[73] reshape_0.8.8 assertthat_0.2.1 cachem_1.0.4 ggplot2_3.3.3
[77] xtable_1.8-4 survival_3.2-7 tibble_3.1.0 AnnotationDbi_1.52.0
[81] memoise_2.0.0 ellipsis_0.3.1

GRanges ORFik openreadingframe circulargenome findORFsFasta • 1.2k views
ADD COMMENT
0
Entering edit mode

We are currently solving this issue on GitHub. Will add anything here if it is relevant.

ADD REPLY
1
Entering edit mode
@hauken_heyken-13992
Last seen 2.2 years ago
Bergen

This bug was fixed, see https://github.com/Roleren/ORFik/issues/105 for more information.

ADD COMMENT

Login before adding your answer.

Traffic: 456 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6