Question

Possible bug in Biostrings::readAAStringSet

0

Entering edit mode

asis.hallab • 0

@asishallab-11502

Last seen 8.6 years ago

Dear Biostrings-Experts and Developers,

thank you for supporting the Bioinformatics community with Biostrings. It is highly useful and has not left my tool-box since some years now.

In a package that I am currently writing I have encountered a possible bug though:

When using Biostrings::readAAStringSet on the latest UniprotKB/Swissprot FASTA-File many sequences are read in wrong. Only random sub-sequences remain after reading in the FASTA File.

I confirmed this manually and also with a different tool from the R package "seqinr". Its method seqinr::read.fasta reads in the AA-Sequences correctly.

Using Biostrings I get the warning:

Warning message:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, :
reading FASTA file /opt/share/blastdb/uniprotkb/FASTA/uniprot_sprot.fasta: ignored 68720968 invalid one-letter sequence codes

Possibly the bug has to do with this message?

I cannot easily reproduce the bug. I try to explain with an example:

Take the Uniprot/Swissprot gene accession Q5V0J7. The AA-Sequence read in with Biostrings is both shorter and thus not identical with the original one, found in the Fasta-File. As I said, I confirmed that manually and by comparison with the sequence read in with seqinr::read.fasta.

However, if I copy paste the part of Swissprot only containing Q5V0J7 thenBiostrings reads in the sequence correctly.

I am using Biostrings version 2.38.4 in R 3.2.0 on Debian Wheezy.

My package has the following DESCRIPTION file:

"

Package: evaluation
Type: Package
Title: Computes performance scores of AHRD and its competitors based on the F1-
Score. DO READ THE README FILE FOR INSTRUCTIONS
Version: 1.0
Date: 2016-07-01
Author: Dr. Asis Hallab
Maintainer: Dr. Asis Hallab <asis.hallab@gmail.com>
Description: Computes performance scores of AHRD and its competitors based on
the F1-Score. DO READ THE README FILE FOR INSTRUCTIONS
License:
Depends:
R (>= 3.2.0),
Rcpp (>= 0.12.3),
RMySQL (>= 0.10.8),
data.table (>= 1.9.6),
Biostrings (>= 2.38.4)
Imports: Rcpp
LinkingTo: Rcpp
RoxygenNote: 5.0.1

"

I hope this helps you.

Thank you very much and have a nice day!

Cheers!

bug biostrings readAAStringSet • 5.8k views

ADD COMMENT • link updated 8.6 years ago by Guangchuang Yu ★ 1.2k • written 8.6 years ago by asis.hallab • 0

score 0 · Answer 1 · 2016-09-19

We don't support outdated versions of Bioconductor. Please update before posting what you think are errors. I have no problems with this using the current release versions:

 > z <- readAAStringSet("/data/tmp2/blast_databases/uniprot/uniprot_sprot.fasta")
 > z
   A AAStringSet instance of length 545536
          width seq                                          names
      [1]   256 MAFSAEDVLKEYDRRRRMEAL...LYDDSFRKIYTDLGWKFTPL sp|Q6GZX4|001R_FR...
      [2]   320 MSIIGATRLQNDKSDTYSAGP...VMFFVAGAVLVAILISTVRW sp|Q6GZX3|002L_FR...
      [3]   458 MASNTVSAQGGSNRPVRDFSN...EDFEYDSDSEDDDSDSEDDC sp|Q197F8|002R_II...
      [4]   156 MYQAINPCPQSWYGSPQLERE...SKTRKNNYGTCRLEPLTYGI sp|Q197F7|003L_II...
      [5]   438 MARPLLGKTSSVRRRLESLSA...KRKAKIQEMFDNMVSRMVTS sp|Q6GZX2|003R_FR...
      ...   ... ...
 [545532]   100 MGNSKSKSKLSANQYEQQTVN...EELPTSIVVPIEPSAPPPED sp|Q6UY62|Z_SABVB...
 [545533]    79 MSSSLEITSFYSFIWTPHIGP...SLRLRVIVLFLAIFFPLLNI sp|P08105|Z_SHEEP...
 [545534]    95 MGNCNRTQKPSSSSNNLEKPP...CWKPLPTSIRVPLEASAPDL sp|Q88470|Z_TACVF...
 [545535]    95 MGLRYSKEVRDRHGDKDPEGR...EVLPKRLTFETTPTAPPYTP sp|A9JR22|Z_TAMVU...
 [545536]    95 MGLRYSKDVKDRYGDREPEGR...EVLPKKLVFENSPSAPPYEA sp|B2ZDY1|Z_WWAVU...
 
 > z[grep("Q5V0J7", names(z)),]
   A AAStringSet instance of length 1
     width seq                                               names
 [1]   173 MHVDIVPVGEVSAQVKREASDGL...SPTVRQVDVKEVSLCGSCQRNVL sp|Q5V0J7|AMZA_HA...
> as.character(z[grep("Q5V0J7", names(z)),])
                                         sp|Q5V0J7|AMZA_HALMA Archaemetzincin OS=Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809) GN=amzA PE=3 SV=2
 "MHVDIVPVGEVSAQVKREASDGLRSVYDCEVSMHEPQSIPAGAYDGDRDQYRAEEFIDLARRVGSGGKNIAITPKDLFYRRRNYVFGLAYLGGSGSVISTYRLQTSSDGGFSNRSAGEIFSQRVRKEVVHEVGHTLGLEHCDNKRCVMNFSPTVRQVDVKEVSLCGSCQRNVL"

Which is identical to what I get when I grep the FASTA file directly.

Edit:

> sessionInfo()
 R version 3.3.0 (2016-05-03)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Debian GNU/Linux 8 (jessie)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats4    parallel  stats     graphics  grDevices utils     datasets
 [8] methods   base

 other attached packages:
 [1] Biostrings_2.40.2   XVector_0.12.1      IRanges_2.6.1
 [4] S4Vectors_0.10.3    BiocGenerics_0.18.0

 loaded via a namespace (and not attached):
 [1] zlibbioc_1.18.0 compiler_3.3.0  tools_3.3.0

score 0 · Answer 2 · 2016-09-19

0

Entering edit mode

Guangchuang Yu ★ 1.2k

@guangchuang-yu-5419

Last seen 5 months ago

China/Guangzhou/Southern Medical Univer…

ignored 68720968 invalid one-letter sequence codes

I think you need to use readBStringSet for your fasta file contains invalid alphabet.

ADD COMMENT • link 8.6 years ago Guangchuang Yu ★ 1.2k