Question

VariantAnnotation produces duplicate names

0

Entering edit mode

Daniel Cameron ▴ 40

@daniel-cameron-10086

Last seen 2.6 years ago

Australia

VCF requires unique row names, but the formula used to generate placeholder row names can produce duplicates (Such as for pindel with default parameters run on the NA12878 Illumina platinum genomic WGS data). Is this intentional, or just an oversight due to the lack of structural variant data out there?

Reproduction test case:

library(VariantAnnotation)

library(testthat)

#' loads a VCF containing the given records #' @param record string vector of record to write .testrecord <- function(record) { filename=tempfile(fileext=".vcf") write(paste0(c( "##fileformat=VCFv4.2", "##ALT=<ID=DEL,Description=\"Deletion\">", "##INFO=<ID=SVLEN,Number=.,Type=Integer,Description=\"Difference in length between REF and ALT alleles\">", "##contig=<ID=chr1,length=249250621>", "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT", record), collapse="\n"), file=filename) vcf <- readVcf(.testfile(filename), "") # readVcf holds a file handle open so we won't be able to delete the # file even if we tried #file.remove(filename) return(vcf) }

# the forum will likely mangle the TABs in the in-line VCF lines test_that("Placeholder names are unique", { expect_false(any(duplicated(row.names(.testrecord(c(

"chr1 100 . A <DEL> . . SVLEN=-1", "chr1 100 . A <DEL> . . SVLEN=-2")))))) })

sessionInfo()

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C LC_TIME=English_Australia.1252

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] StructuralVariantAnnotation_0.1 VariantAnnotation_1.16.4 Rsamtools_1.22.0 Biostrings_2.38.4 XVector_0.10.0 SummarizedExperiment_1.0.2
[7] Biobase_2.30.0 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1
[13] assertthat_0.1 roxygen2_5.0.1 testthat_0.11.0 devtools_1.10.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 futile.logger_1.4.1 GenomicFeatures_1.22.13 bitops_1.0-6 futile.options_1.0.0 tools_3.2.2
[7] zlibbioc_1.16.0 biomaRt_2.26.1 digest_0.6.9 memoise_1.0.0 RSQLite_1.0.0 BSgenome_1.38.0
[13] DBI_0.3.1 rtracklayer_1.30.4 withr_1.0.1 stringr_1.0.0 BSgenome.Hsapiens.UCSC.hg19_1.4.0 AnnotationDbi_1.32.3
[19] XML_3.98-1.4 BiocParallel_1.4.3 lambda.r_1.1.7 magrittr_1.5 GenomicAlignments_1.6.3 stringi_1.0-1
[25] RCurl_1.95-4.8 crayon_1.3.1

variantannotation • 1.8k views

ADD COMMENT • link updated 8.6 years ago by Valerie Obenchain ★ 6.8k • written 8.6 years ago by Daniel Cameron ▴ 40

score 0 · Answer 1 · 2016-04-12

Hi,

I don't have .testfile() so I've modified this line in .testrecord():

vcf <- readVcf(.testfile(filename), "")

to this:

vcf <- readVcf(filename, "")

With that change we create the test vcf which emits a warning:

> testvcf1 <- 
+ .testrecord(c("chr1    100    .    A    <DEL>    .    .    SVLEN=-1",
+               "chr1    100    .    A    <DEL>    .    .    SVLEN=-2"))
Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

The VCF rowRanges() data is not parsed properly. Everything from 'chr1' to 'SVLEN' has been parsed into the seqnames:

> rowRanges(testvcf1)
GRanges object with 2 ranges and 5 metadata columns:
                                               seqnames    ranges strand |
                                                  <Rle> <IRanges>  <Rle> |
   chr1    100    .    A    <DEL>    .    .    SVLEN=-1   [0, -1]      * |
   chr1    100    .    A    <DEL>    .    .    SVLEN=-2   [0, -1]      * |
   paramRangeID            REF                ALT      QUAL      FILTER
       <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
           <NA>                                           0           
           <NA>                                           0           
  -------
  seqinfo: 3 sequences from  genome

fixed() is also not parsed properly:

> fixed(testvcf1)
DataFrame with 2 rows and 4 columns
             REF                ALT      QUAL      FILTER
  <DNAStringSet> <DNAStringSetList> <numeric> <character>
1                                           0           
2                                           0

To avoid format issues I'd recommend writeVcf() to make a test file but if you must use .testrecord() you need to add the tabs in the character string.

Create a new test vcf (no warnings this time):

> testvcf2 <-
+ .testrecord(c("chr1\t100\t.\tA\tG\t.\t.\tSVLEN=-1",
+               "chr1\t100\t.\tA\tG\t.\t.\tSVLEN=-2"))

Now we have a properly parsed VCF object with rownames and fixed(), info() etc.:

> rowRanges(testvcf2)
GRanges object with 2 ranges and 5 metadata columns:
               seqnames     ranges strand | paramRangeID            REF
                  <Rle>  <IRanges>  <Rle> |     <factor> <DNAStringSet>
  chr1:100_A/G     chr1 [100, 100]      * |         <NA>              A
  chr1:100_A/G     chr1 [100, 100]      * |         <NA>              A
                              ALT      QUAL      FILTER
               <DNAStringSetList> <numeric> <character>
  chr1:100_A/G                  G      <NA>           .
  chr1:100_A/G                  G      <NA>           .
  -------
  seqinfo: 1 sequence from  genome
> fixed(testvcf2)
DataFrame with 2 rows and 4 columns
             REF                ALT      QUAL      FILTER
  <DNAStringSet> <DNAStringSetList> <numeric> <character>
1              A                  G        NA           .
2              A                  G        NA           .

Back to your question about duplicate rownames on the object. It's not illegal to have duplicate rownames or ranges in a GRanges object:

> GRanges("chr1", IRanges(rep(1, 3), width=1, names=rep("FOO", 3)))
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  FOO     chr1    [1, 1]      *
  FOO     chr1    [1, 1]      *
  FOO     chr1    [1, 1]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Before GRanges supported duplicate rownames that restriction was carried over to VCF. I think it's been several releases now that duplicates have been allowed. Maybe you're remembering this restriction from an older version? If you're getting an error that says VCF data must have unique rownames please let me know and show a small example.

Thanks.

Valerie