DNAStringSet(x, start=, end=) loses quality information
1
0
Entering edit mode
@gerhard-thallinger-1552
Last seen 10 weeks ago
Austria

When trimming a DNAStringSet containing qualites using DNAStringSet(x, start=, end=), the quality information is lost:

library(Biostrings)
library(ShortRead) # For example data only

fl <- system.file(package="ShortRead", "extdata", "E-MTAB-1147","ERR127302_1_subset.fastq.gz") 
fq <- readDNAStringSet(fl, format="fastq", with.qualities=TRUE)[1:4]
print(class(elementMetadata(fq)))
# [1] "DFrame"
# attr(,"package")
# [1] "S4Vectors"
fq.trimmed <- DNAStringSet(fq, start=30, end=50)
print(class(elementMetadata(fq.trimmed)))
# [1] "NULL"

The quality information is present after reading the data with readDNAStringSet(), after trimming the elementMetadata slot is empty. It would be helpfull, if either a warning is issued that the quality information is discarded, or - even better - if the quality information is trimmed as well.

The environment is as follows:

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.52.0            GenomicAlignments_1.30.0    SummarizedExperiment_1.24.0
 [4] Biobase_2.54.0              MatrixGenerics_1.6.0        matrixStats_0.61.0         
 [7] Rsamtools_2.10.0            GenomicRanges_1.46.1        BiocParallel_1.28.2        
[10] Biostrings_2.62.0           GenomeInfoDb_1.30.0         XVector_0.34.0             
[13] IRanges_2.28.0              S4Vectors_0.32.3            BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
 [1] zlibbioc_1.40.0        lattice_0.20-45        jpeg_0.1-9             hwriter_1.3.2         
 [5] tools_4.1.2            parallel_4.1.2         grid_4.1.2             png_0.1-7             
 [9] latticeExtra_0.6-29    crayon_1.4.2           Matrix_1.3-4           GenomeInfoDbData_1.2.7
[13] RColorBrewer_1.1-2     bitops_1.0-7           RCurl_1.98-1.5         DelayedArray_0.20.0   
[17] compiler_4.1.2
Biostrings • 2.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

You didn't trim a DNAStringSet. You created a new one, so why would there be a warning? You can narrow the DNAStringSet, which would be 'trimming' it. But do note that by default it doesn't narrow the quality scores

> fq.trimmed <- DNAStringSet(fq, start=30, end=50)
> fq.trimmed2 <- narrow(fq, 30, 50)
> fq.trimmed2
DNAStringSet object of length 4:
    width seq                                               names               
[1]    21 GGGACATGAAGTCAATGAAGG                             ERR127302.8493430...
[2]    21 AATGGGTAGCCAGTGGCTTTT                             ERR127302.2140653...
[3]    21 TAAAACCCAGTGGCATTAAGA                             ERR127302.2217310...
[4]    21 TTTCAATGTTCAAAGCATCCC                             ERR127302.1040226...

> all.equal(fq.trimmed, fq.trimmed2, check.attributes = FALSE)
[1] TRUE

> mcols(fq.trimmed2)$qualities
BStringSet object of length 4:
    width seq
[1]    72 HHHHHHHHHHHHHHHHHHHHEBDBB?B:BBGG<D...ABFEFBDBD@DDECEE3>:?;@@@>?=BAB?##
[2]    72 IIIIHIIIGIIIIIIIHIIIIEGBGHIIIIHGII...IIIHIIIHIIIIIGIIIEGIIGBGE@DDGGGIG
[3]    72 GGHBHGBGGGHHHHDHHHHHHHHHFGHHHHHHHH...HHHHHHHHGHFHHHHHHHHHHHHHH8AGDGGG>
[4]    72 IIIIIIIIIIIIIIIIIHIIIGHIIIIIIIIIII...HIIIIIIIIIIHIIIIHIIHIGDIIHIHIIIHF

But you could do that by hand, or write a function, or put in a bug report on the GitHub, or fix it yourself and do a PR.

> z <- mcols(fq.trimmed2)
> z$qualities <- narrow(z$qualities, 30, 50)
> mcols(fq.trimmed2) <- z
> mcols(fq.trimmed2)
DataFrame with 4 rows and 1 column
                                                                    qualities
                                                                 <BStringSet>
ERR127302.8493430 HWI-EAS350_0441:1:34:16191:2123#0/1   BGG<DDAA?AABFEFBDBD@D
ERR127302.21406531 HWI-EAS350_0441:1:88:9330:2587#0/1   IHGIIHIIIIIIIHIIIHIII
ERR127302.22173106 HWI-EAS350_0441:1:91:10434:14757#0/1 HHHHHHHHHHHHHHHHHHGHF
ERR127302.10402268 HWI-EAS350_0441:1:42:13757:9559#0/1  IIIIIIIGIIHIIIIIIIIII
ADD COMMENT
0
Entering edit mode

Thank you for your answer.

You created a new one, so why would there be a warning? You can narrow the DNAStringSet, which would be 'trimming' it. But do note that by default it doesn't narrow the quality scores.

I am not sure why the creation of a new DNAStringSet object from an existing one should make any difference. IMHO, expected behaviour of DNAStringSet(x, start=, end=) would be to modify the quality strings in the same way as the sequences and not to discard them silently.

BTW: narrow() does create a new DNAStringSet object as well, but keeps the qualities (albeit in an incorrect form):

tracemem(fq)
# [1] "<000000002F115F08>"
tracemem(fq.trimmed)
# [1] "<000000002F089BA0>"
tracemem(fq.trimmed2)
# [1] "<000000002D88D3D0>"

But you could do that by hand, or write a function, or put in a bug report on the GitHub, or fix it yourself and do a PR.

As a workaround, I implemented a function to do that and additionally decided to post the issue on the Bioconductor support site so that others are aware of it.

All in all your answer leaves me with the feeling of not being welcome and somehow intimidated to post further questions/comments.

ADD REPLY
0
Entering edit mode

I am sorry if I made you feel intimidated. I was actually asking a question! You used a constructor function, which by definition makes a new thing. If you are making a new thing, should there be a warning that the new thing is different from the old thing you made it from? I mean, it's a new thing and all. That's different from using a transformation function like narrow or subseq, which are intended to transform an existing thing.

But maybe I approach things differently? When using a function, the first thing I do is read the help page to see what the developer says it should do. I sort of assume that's what most people do, and I made the possibly unwarranted assumption that you had read the help page and I was confused that you expected a constructor function (which doesn't even have an argument for the mcols slot) to do or say anything about the mcols slot of the object you started with.

Way back in the day I posted something on the old Bioconductor listserv about the ellipsis argument and made a statement about how (I thought) it works. Gordon Smyth replied with a question 'Why do you think that?', which I took to be an insult to my intelligence. But then I thought about it for a while and decided that he was actually asking me why I thought that, and if I told him my thought process he was offering to explain where I was going wrong. You should take my question in the same spirit.

As a side note, that's not really how you use tracemem. Instead it's something like this:

> tracemem(fq)
[1] "<000000000F8F31E0>"
> narrow(fq, 30, 50)
tracemem[0x000000000f8f31e0 -> 0x000000001cc45ba0]: windows windows narrow narrow 
DNAStringSet object of length 4:
    width seq                                               names               
[1]    21 GGGACATGAAGTCAATGAAGG                             ERR127302.8493430...
[2]    21 AATGGGTAGCCAGTGGCTTTT                             ERR127302.2140653...
[3]    21 TAAAACCCAGTGGCATTAAGA                             ERR127302.2217310...
[4]    21 TTTCAATGTTCAAAGCATCCC                             ERR127302.1040226...
> DNAStringSet(fq, 30, 50)
tracemem[0x000000000f8f31e0 -> 0x000000001cf9fa10]: windows windows narrow narrow XStringSet XStringSet DNAStringSet 
DNAStringSet object of length 4:
    width seq                                               names               
[1]    21 GGGACATGAAGTCAATGAAGG                             ERR127302.8493430...
[2]    21 AATGGGTAGCCAGTGGCTTTT                             ERR127302.2140653...
[3]    21 TAAAACCCAGTGGCATTAAGA                             ERR127302.2217310...
[4]    21 TTTCAATGTTCAAAGCATCCC                             ERR127302.1040226...

Where you can see that narrowing the object does make copies, but that using the constructor does all that, plus makes an entirely new thing.

ADD REPLY
1
Entering edit mode

Actually DNAStringSet(x, start=, end=) calls narrow() internally so we should fix narrow().

@gerhard-thallinger-1552: Can you please open an issue here with a minimal reproducible example and sessionInfo()? Thanks!

Best,

H.

ADD REPLY
0
Entering edit mode

Done. You can find them here and here.

Best,

Gerhard

ADD REPLY
0
Entering edit mode

Thanks. See discussion and proposed resolution in issue #61.

ADD REPLY

Login before adding your answer.

Traffic: 520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6