I am trying to extract ORFs from the human mitochondrial genome (circular genome) using the ORFik package. Running the following example works properly:
require(BSgenome.Hsapiens.UCSC.hg38)
require(ORFik)
# Load example fasta file
example_genome <- system.file("extdata", "genome.fasta", package = "ORFik")
# Find ORFs
orfs <- findORFsFasta(example_genome,
startCodon = "ATG",
stopCodon = stopDefinition(2),
is.circular = TRUE)
# Get ORF sequences
genome <- readDNAStringSet(example_genome, format = "fasta")
names(genome) <- "Chromosome"
BSgenome::getSeq(genome, orfs[2])
Extracted ORF sequence:
DNAStringSet object of length 1: width seq 585 ATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAAT...GGGGTCTATACCTGCGACCCGCGTCAGGTGCCCGATGCGAGG
This sequence is an appropriate ORF. It starts with an ATG, the start codon I requested, and ends with an AGG which is a termination codon for vertebrate mitochondrial code as specified by stopDefinition(2).
However, the approach fails when using a fasta file of the human mitochondrial genome. The resulting ORFs do not have the right start and/or stop codon and I can't explain the difference as the only thing that differs is the fasta file.
# Get human mitochondrial genome
mtDNA <- DNAStringSet(BSgenome.Hsapiens.UCSC.hg38$chrM)
names(mtDNA) <- "chrM"
Biostrings::writeXStringSet(mtDNA, "mitochondrial_genome.fasta")
# Find ORFs from local copy of mitochondrial genome saved as fasta file
mtDNA_orfs <- findORFsFasta("mitochondrial_genome.fasta",
startCodon = "ATG",
stopCodon = stopDefinition(2),
is.circular = TRUE)
# Load mitochondrial genome and get ORF sequences
BSgenome::getSeq(mtDNA, mtDNA_orfs[3])
Extracted ORF sequence:
DNAStringSet object of length 1: width seq 135 ATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCA...GCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAA
This sequence does not meet the criteria. It does not start with an ATG codon and does not stop with a known vertebrate mitochondrial termination codon as desired.
Any help would be much appreciated! Thanks!
Max
sessionInfo( )
R version 4.0.4 (2021-02-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets methods
[10] base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg38_1.4.3 BSgenome_1.58.0
[3] rtracklayer_1.50.0 ORFik_1.10.11
[5] GenomicAlignments_1.26.0 Rsamtools_2.6.0
[7] Biostrings_2.58.0 XVector_0.30.0
[9] SummarizedExperiment_1.20.0 Biobase_2.50.0
[11] MatrixGenerics_1.2.1 matrixStats_0.58.0
[13] OrgMassSpecR_0.5-3 seqinr_4.2-5
[15] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[17] IRanges_2.24.1 S4Vectors_0.28.1
[19] BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] biomartr_0.9.2 bitops_1.0-6 bit64_4.0.5 RColorBrewer_1.1-2
[5] progress_1.2.2 httr_1.4.2 tools_4.0.4 utf8_1.2.1
[9] R6_2.5.0 DBI_1.1.1 colorspace_2.0-0 ade4_1.7-16
[13] gridExtra_2.3 tidyselect_1.1.0 prettyunits_1.1.1 GGally_2.1.1
[17] DESeq2_1.30.1 bit_4.0.4 curl_4.3 compiler_4.0.4
[21] xml2_1.3.2 DelayedArray_0.16.2 scales_1.1.1 genefilter_1.72.1
[25] askpass_1.1 rappdirs_0.3.3 stringr_1.4.0 R.utils_2.10.1
[29] pkgconfig_2.0.3 fst_0.9.4 dbplyr_2.1.0 fastmap_1.1.0
[33] rlang_0.4.10 rstudioapi_0.13 RSQLite_2.2.4 generics_0.1.0
[37] BiocParallel_1.24.1 dplyr_1.0.5 R.oo_1.24.0 RCurl_1.98-1.2
[41] magrittr_2.0.1 GenomeInfoDbData_1.2.4 Matrix_1.3-2 Rcpp_1.0.6
[45] munsell_0.5.0 fansi_0.4.2 lifecycle_1.0.0 R.methodsS3_1.8.1
[49] stringi_1.5.3 MASS_7.3-53 zlibbioc_1.36.0 plyr_1.8.6
[53] BiocFileCache_1.14.0 blob_1.2.1 crayon_1.4.1 lattice_0.20-41
[57] cowplot_1.1.1 splines_4.0.4 GenomicFeatures_1.42.1 annotate_1.68.0
[61] hms_1.0.0 locfit_1.5-9.4 pillar_1.5.1 geneplotter_1.68.0
[65] biomaRt_2.46.3 XML_3.99-0.5 glue_1.4.2 data.table_1.14.0
[69] vctrs_0.3.6 gtable_0.3.0 openssl_1.4.3 purrr_0.3.4
[73] reshape_0.8.8 assertthat_0.2.1 cachem_1.0.4 ggplot2_3.3.3
[77] xtable_1.8-4 survival_3.2-7 tibble_3.1.0 AnnotationDbi_1.52.0
[81] memoise_2.0.0 ellipsis_0.3.1
We are currently solving this issue on GitHub. Will add anything here if it is relevant.