I am writing a unitary test with testthat for a function handling VCF files. However, I encounter a problem with ref(vcf)
, which comes from DNAStringSet@ranges@start
.
Edit: more specifically, expect_equal(as.character(ref(observed)), as.character(ref(expected)))
passes, but expect_equal(ref(observed), ref(expected))
fails, with the following message
Attributes: < Component "ranges": Attributes: < Component "start": Mean relative difference: 3 > >
To understand how this structure works, here is the example given at the bottom of the CollapsedVCF
help page.
vcf <- VCF(rowRanges = GRanges("chr1", IRanges(1:4*3, width=c(1, 2, 1, 1)))) ref(vcf) <- DNAStringSet(c("G", c("AA"), "T", "G"))
When I do str(ref(vcf))
, here is what I get:
Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots ..@ pool :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots .. .. ..@ xp_list :List of 1 .. .. .. ..$ :<externalptr> .. .. ..@ .link_to_cached_object_list:List of 1 .. .. .. ..$ :<environment: 0x2f8b6d0> ..@ ranges :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots .. .. ..@ group : int [1:4] 1 1 1 1 .. .. ..@ start : int [1:4] 1 2 4 5 .. .. ..@ width : int [1:4] 1 2 1 1 .. .. ..@ NAMES : NULL .. .. ..@ elementType : chr "integer" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ elementType : chr "DNAString" ..@ elementMetadata: NULL ..@ metadata : list()
Can someone explain to me what does the vector 1 2 4 5
from @ranges@start
correspond to, as it doesn't correspond to the start coordinates of the variants?
Edit: thanks to the comments by Martin Morgan below, I may have found the problem. When I filter some records like this, vcf[-c(1)]
, the @ranges@start
vector becomes 2 4 5
whereas I would have expected it to become 1 3 4
. That is, the first element was correctly removed, but the DNAString
wasn't updated. This doesn't seem to be the case when using VariantAnnotation::filterVcf()
, hence creating the difference between the two objects in my test, observed
and expected
. Is there a way to avoid this?
Generally, the internal structure of an object (anything accessed by
@
) should be considered 'none of your business'; in your unit test and code, use the accessors to extract relevant information.I understand. But when I test two VCF objects for equality with
expect_equal
fromtestthat
, it fails because of a difference in@ranges@start
. I thought that a few hints about what@ranges@start
corresponds to may be enough to help me. Or do you prefer me to show the whole code of the tested function as well as the test?It isn't really helpful to show the full code, but rather to figure out how to illustrate the problem with a minimal example (e.g., by starting with the full code and simplifying it); often in the process of creating this simple reproducible example the problem becomes apparent.
I'd explore the object using it's API, e.g., comparing
ref(vcf)
and thenas.character(ref(vcf))
.The underlying model of a DNAStringSet is of a single "DNAString" that has offsets corresponding to each element in the DNAStringSet. So in the example the DNAString is "GAATG" and the interpretation is that this DNAString represents a DNAStringSet with elements starting at position 1, 2, 4, 5. This is actually a fundamental scheme underlying the Compressed*List class (e.g., IntegerList(), GRangesList()), where there is a single underlying object that obeys a 'vector-like' interface, and then a partitioning of that vector into elements; I believe the implementation differs between DNAStringSet and the *List, but the fact that I don't know is the beauty of using the API rather than relying on the underlying structure.
Thanks for your explanations. Yours as well as the one from Hervé helped me a lot.