Problem with summarizeVariants function (VariantAnnotation) when using a multi-sample vcf
1
0
Entering edit mode
juls • 0
@juls-11275
Last seen 3.5 years ago
Austria

Hi all,

I  am using the R package VariantAnnotation which I like very much. 

I hope you can help me with an issue though -  I have a problem concerning the summarizeVariants function when using a multi-sample vcf. 

I have 6 samples in my vcf file. 

# reading vcf 
chromInfo <- seqinfo(scanVcfHeader(fl))
vcf <- readVcf(fl, chromInfo)


The samples were read in fine - see examples below: 
info(vcf) 

DataFrame with 2 rows and 18 columns
            AC            AF        AN BaseQRankSum ClippingRankSum        DP        DS        FS HaplotypeScore InbreedingCoeff         MLEAC         MLEAF        MQ       MQ0 MQRankSum        QD
 <IntegerList> <NumericList> <integer>    <numeric>       <numeric> <integer> <logical> <numeric>      <numeric>       <numeric> <IntegerList> <NumericList> <numeric> <integer> <numeric> <numeric>
1             5         0.833         6       -0.354          -0.825        40     FALSE        NA             NA              NA            NA            NA        NA         0     2.003        NA
2             4             1         4           NA              NA        26     FALSE         0             NA              NA             2             1        NA         0        NA        NA
 ReadPosRankSum                        set
      <numeric>                <character>
1          1.061 variant4-variant5-variant6
2             NA          variant5-variant6

# transcripts 
tx <- transcripts(ann2)
txlst <- splitAsList(tx, seq_len(length(tx)))

# summarize 
promoterVar.sum <- summarizeVariants(txlst, vcf, PromoterVariants())
head(assays(promoterVar.sum)$counts)
colSums(assays(promoterVar.sum)$counts)

# example output 

 Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
25      51      51      51      51      51      51
26       7       7       7       7       7       7
2       44      44      44      44      44      44
3       97      97      97      97      97      97
27      97      97      97      97      97      97
28      97      97      97      97      97      97

As you can see the output is for all samples the same - which is not the case! 
This happens also when I use findOverlaps or any other mode. 

Am I doing something wrong? I very much appreciate your help!

Kind Regards,
Julia

variantannotation • 1.5k views
ADD COMMENT
0
Entering edit mode

Not sure about that issue, but it would be easier to get the transcript annotations like this:

txlst <- exonsBy(ann2)
ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States

Hi Julia,

I'm not able to reproduce this problem using another VCF file with multiple samples, e.g., chr22.vcf.gz used in the ?summarizeOverlaps man page. You also mention you're having this problem with findOverlaps which might point to a problem with the data and not the functions.

Regions in 'txlst' are being overlapped with regions in rowRanges(vcf). You may want to inspect each of those more closely. Are the ranges 'vcf' the same?

sum(duplicated(rowRanges(vcf)))

How about 'txlst'? 

Valerie

ADD COMMENT
0
Entering edit mode

Hi Valerie,

Thank you so much for answer. 

The individual VCF file have been generated with GATK. I used GATK's CombineVariants for merging into one multi-sample VCF. 

table(sum(duplicated(txlst)))

    0 
28872 

rowRanges(vcf)

GRanges object with 60022 ranges and 5 metadata columns:

                                       seqnames           ranges strand   | paramRangeID            REF                ALT      QUAL      FILTER
                                          <Rle>        <IRanges>  <Rle>   |     <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
     chr1_scaffold_2:1084_A/AT  chr1_scaffold_2     [1084, 1084]      *   |         <NA>              A                 AT    282.73           .
     chr1_scaffold_2:6286_CT/C  chr1_scaffold_2     [6286, 6287]      *   |         <NA>             CT                  C    390.73           .

sum(duplicated(rowRanges(vcf)))

[1] 46

Best, Julia 

 

 

ADD REPLY
0
Entering edit mode

Can you make 'txlst' and 'vcf' available to me off-line (valerie.obenchain@roswelpark.org) with dropbox?

Valerie

ADD REPLY
0
Entering edit mode

One more thing you can look at is the GT field. Is there variation across samples?

Valerie

ADD REPLY
0
Entering edit mode

You should have access. I uploaded a test vcf file because the complete file is so huge. Thank you!!!

ADD REPLY
1
Entering edit mode

Thanks for sending the test files.

This is the last line in the test.vcf:

chr1_scaffold_43    929893    .    GA    G    328.73    .    AC=2;AF=0.500;AN=4;DP=40;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;set=variant4-variant5    GT:AD:DP:GQ:PL    ./.    ./.    ./.    0/1:3,10:13:82:366,0,82    0/1:16,11:27:99:364,0,563    ./.

The genotype fields are GT, AD, DP, GQ and PL. It looks like all are missing for samples 1,2,3 and 6. Instead of representing this information as the usual string of colon separated missing values e.g., ./.:.,.:.:.,.,. it's represented as ./.

Because there were no fields to parse, the vcf reader in VariantAnnotation assigned a GT value of '.' to samples with no GT data. Consequently, summarizeVariants() did not recognize '.' as a missing GT value (was expecting "0|0", "0/0", "./.", or ".|."). I've modified summarizeVariants() to treat '.' as missing so it now works on your data. I've made this change in VariantAnnotation 1.19.10 (devel) and 1.18.7 (release).

> assays(promoterVar.sum)$counts[1:5,]
      
       Sample.variant Sample.variant2 Sample.variant3 Sample.variant4
  1764              1               0               0               0
  1740              0               0               0               0
  1741              0               6               0               6
  1765              0               1               0               0
  1742              4               4               6               5
      
       Sample.variant5 Sample.variant6
  1764               0               0
  1740               1               1
  1741               0               6
  1765               0               0
  1742               5               6

I was not aware that ./. was a valid representation of 'all fields missing' and I don't see it mentioned in the vcf specs (https://samtools.github.io/hts-specs/VCFv4.3.pdf). I'm not sure how to best modify the vcf parser (or if we should) to guess at the ploidy of the missing GT value. The header is no help -

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

One could look at other non-missing genotypes or the ALT field but that seems like a lot of fishing. If the 'all fields missing' ./. string is now standard output from GATK we probably need a reasonable default value or documented best guess. Open to ideas.

Valerie

ADD REPLY
0
Entering edit mode

Thank you so much!!

ADD REPLY
0
Entering edit mode

Access where? Did you send me (valerie.obenchain@roswellpark.org) a link?

ADD REPLY

Login before adding your answer.

Traffic: 688 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6