Hi,
I'm running into an error when trying to extract sequences from fasta files using Rsamtools.
I have a fasta file and a GRanges object, and I'd like to extract the sequences corresponding to the ranges from the fasta file. When using the fasta file for the whole genome, scanFa throws an error. Sequences up to chr 5: 32500021 can be retrieved, but trying to retrieve any sequence after chr 5: 32500021 results in an error.
However, when the fasta file for only chromosome 5 is used, the sequences corresponding to both ranges can be retrieved.
fasta files can be downloaded here: ftp://ftp.ensembl.org/pub/release-91/fasta/homo_sapiens/dna/
> library(Rsamtools)
> library(GenomicRanges)
## Create reference to fasta files - one for whole genome, another containing just chr 5
> allSeq <- Rsamtools::FaFile('./Homo_sapiens.GRCh38.dna.primary_assembly.fa')
> chr5Seq <- Rsamtools::FaFile('./Homo_sapiens.GRCh38.dna.chromosome.5.fa')
## Create GRanges object
> gr <- GRanges(seqnames = '5',
+ ranges = IRanges(start = c(32500021, 32500022), width = 1))
## When using the whole fasta file, record 1 works but record 2 fails
> Rsamtools::scanFa(allSeq, gr)
Error in value[[3L]](cond) : record 2 (5:32500022-32500022) failed
file: ./Homo_sapiens.GRCh38.dna.primary_assembly.fa
## When using the chr 5 fasta file, both records can be retrieved
> Rsamtools::scanFa(chr5Seq, gr)
DNAStringSet object of length 2:
width seq names
[1] 1 T 5
[2] 1 A 5
Hope this example helps with finding the issue and thanks in advance! Session Info:
> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252 LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C LC_TIME=English_Singapore.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsamtools_2.4.0 Biostrings_2.56.0 XVector_0.28.0 GenomicRanges_1.40.0 GenomeInfoDb_1.24.0
[6] IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] crayon_1.3.4 bitops_1.0-6 zlibbioc_1.34.0 rstudioapi_0.11 BiocParallel_1.22.0
[6] tools_4.0.0 RCurl_1.98-1.2 compiler_4.0.0 GenomeInfoDbData_1.2.3
Hello,
Strangely, I ran your exact example (after pulling the genome files you linked) with no issues. That is, when scanning in the full genome I get the expected same output as when scanning the 5th chromosome:
We are using the same package versions for
GenomicRanges
andRsamtools
and a similar (but not precisely the same) version of R, though the OS is different. I'd recommend experimenting with different versions of R, and running:to check for possible installation issues (related to dependencies of
GenomicRanges
andRsamtools
, for example). I've attached my full session info at the bottom of this comment.Best,
-Nick
P.S. I'm learning to help others.