You could save on the overall call by creating one instance of FaFile, and not worrying about coercing to character (since DNAStringSet are actually very useful to work with).
fa = open(FaFile('file.fa'))
idx = scanFaIndex(fa)
mySeq = getSeq(fa, idx[seqnames(idx) =='seqid'])
close(fa)
Remember also that getSeq() is vectorized, so easy and efficient to query many sequences in a single call.
I played around a bit using resources from AnnotationHub, for instance
> library(AnnotationHub)
> hub = AnnotationHub()
> query(hub, c("Homo_sapiens", "release-81", "dna"))
AnnotationHub with 4 records
# snapshotDate(): 2017-02-07
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: FaFile
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH49183"]]'
title
AH49183 | Homo_sapiens.GRCh38.cdna.all.fa
AH49184 | Homo_sapiens.GRCh38.dna_rm.toplevel.fa
AH49185 | Homo_sapiens.GRCh38.dna_sm.toplevel.fa
AH49186 | Homo_sapiens.GRCh38.dna.toplevel.fa
> fa = hub[["AH49186"]]
downloading from 'https://annotationhub.bioconductor.org/fetch/55651'
'https://annotationhub.bioconductor.org/fetch/55652'
retrieving 2 resources
|======================================================================| 100%
|======================================================================| 100%
> file.size(path(fa), index(fa))
[1] 1099580445 20991
> system.time({ open(fa); idx = scanFaIndex(fa); getSeq(fa, idx[seqnames(idx) == "17"]) })
user system elapsed
1.884 0.016 1.901
Which is about 83M nucleotides from somewhere in the middle of the file, or
> fa = hub[["AH49183"]]
loading from cache '/home/mtmorgan//.AnnotationHub/55645'
'/home/mtmorgan//.AnnotationHub/55646'
> length(scanFaIndex(fa))
[1] 175372
> id = which.max(width(scanFaIndex(fa))); id
[1] 116918
> width(scanFaIndex(fa))[id]
[1] 109224
> system.time({ open(fa); idx = scanFaIndex(fa); getSeq(fa, id]) })
user system elapsed
0.788 0.012 0.801
which is a smaller sequence but from towards the end of a more complicated file. And
> id = sample(length(idx), 100)
> system.time({ open(fa); idx = scanFaIndex(fa); res <- getSeq(fa, idx[id]) })
user system elapsed
0.732 0.000 0.733
Which is somehow faster (measurement error, I guess) but querying 100 sequences with (for the test above) about 213000 nucleotides. The performance seems ok to me.
I think maybe the issue with my particular dataset is the FASTA index is actually pretty large, with many many scaffolds. I might have to bite the bullet and accept that the scanFaIndex is a little slower, just by a couple second or two though. Thanks for the feedback!
To be more specific on sizes of the FAI files...smallest is 1.5MB, largest is 17MB...darn ultrafragmented genome assemblies