Question

How to prepare .vcf for seqCAT

0

Entering edit mode

abe • 0

@abe-18006

Last seen 6.0 years ago

I ran gatk followed by snpEff and snpSift. Then I combined the individual .vcf's with CombineVariants and gzipped the .vcf. I read the .vcf.gz into R with vcf <- readVcf("combined.calls.filtered.vcf.gz","hg38"). When I run scanVcfHeader(vcf) or create_profile(vcf, "sampleID", "sampleID.profile.txt"), I get the following error:

Reading VCF file ...
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘scanVcfHeader’ for signature ‘"CollapsedVCF"’

I'm having trouble finding a solution.

seqCAT variantannotation gatk • 2.0k views

ADD COMMENT • link updated 6.4 years ago by erikfas ▴ 30 • written 6.4 years ago by abe • 0

0

Entering edit mode

I'm in a similar situation, but the suggestion below brings no joy. I've got germline VCFs from haplotypecaller annotated with snpEff and VEP, but they give the error:

> vcf <-("../VCFs/haplotypecaller/haplotypecaller_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz")
> create_profile(vcf, 'T0_CTL_CTL_rep1', 'T0_CTL_CTL_rep1.txt')

Reading VCF file ...
Creating SNV profile ...
Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID",  : 
  replacement has 1 row, data has 0
In addition: Warning messages:
1: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
2: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames
3: In .bcfHeaderAsSimpleList(header) :
  duplicate keys in header will be forced to unique rownames

The duplicate keys might be referring to two "source=" lines:

## source=GenotypeGVCFs
## source=HaplotypeCaller

I've tried the same with tumor-normal VCFs generated with mutect2, but get an error similar to abe's below:

> vcfm<-('../VCFs/mutect/mutect2_T2_CTL_CTL_rep1_vs_T0_CTL_CTL_rep1_snpEff_VEP.ann.vcf.gz')
> create_profile(vcfm, 'T2_CTL_CTL_rep1', 'T2_CTL_CTL_rep1.txt')
Reading VCF file ...
Creating SNV profile ...
Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID",  : 
  replacement has 1 row, data has 0

I've checked the sample names correspond using VariantAnnotation::scanVcfHeader, but no luck. Any other suggestions?

ADD REPLY • link 6.4 years ago omeally • 0

0

Entering edit mode

Hi, omeally! As this seems to be a different issue, please create a new question and tag it with `seqCAT` and I'll see what I can do to help you troubleshoot it. In that question, could you also include a couple of lines from your VCF?

ADD REPLY • link 6.4 years ago erikfas ▴ 30

score 0 · Answer 1 · 2018-11-30

0

Entering edit mode

erikfas ▴ 30

@erikfas-13696

Last seen 6.0 years ago

Hi, abe!

It looks like you are trying to create a SNV profile from a VCF object that you have already read into the R memory. As described in section 2.1 in the seqCAT vignette the create_profile function works directly on files, and you do not need to first read the VCF with the readVCF function. All you should need to do is something like this:

create_profile("combined.calls.filtered.vcf.gz", "sampleID", "sampleID.profile.txt")

Does that work for you?

ADD COMMENT • link 6.4 years ago erikfas ▴ 30

0

Entering edit mode

Thanks for your reply. I’m one step in the right direction, but I still haven’t got something right within the .vcf.

I ran into errors when running all samples, so I simplified by looking at one sample.

create_profile("SRR1233685.vcf", "SRR1233685", "SRR1233685.profile.txt")

Reading VCF file ...

Creating SNV profile ...

Error in `[<-.data.frame`(`*tmp*`, !grepl("^rs[0-9]+", data$rsID), "rsID",  :

  replacement has 1 row, data has 0

I tried stepping through the function and noticed that Granges is empty.

gr <- SummarizedExperiment::rowRanges(vcf)

GRanges object with 0 ranges and 4 metadata columns:

   seqnames    ranges strand |            REF                ALT      QUAL      FILTER

      <Rle> <IRanges>  <Rle> | <DNAStringSet> <DNAStringSetList> <numeric> <character>

  -------

  seqinfo: no sequences

I checked the seqCAT example data and the above function does output multiple ranges. So, I know the problem is with my .vcf. I’m just not sure why GRanges is empty.

For reference:

vcf_header <- VariantAnnotation::scanVcfHeader(vcf_file)

vcf_header

class: VCFHeader 

samples(2): SRR1233685.variant SRR1233685.variant2

meta(6): SnpSiftCmd SnpSiftVersion ... source contig

fixed(1): FILTER

info(80): AC AF ... SSR PM

geno(5): GT AD DP GQ PL


vcf <- VariantAnnotation::readVcf(vcf_file, param = svp)

vcf

class: CollapsedVCF 

dim: 0 2 

rowRanges(vcf):

  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER

info(vcf):

  DataFrame with 80 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DP, DS, ExcessHet, FS, HaplotypeScore, InbreedingCoeff, MLEAC, MLEAF, MQ...

info(header(vcf)):

                   Number Type    Description                                                                                                      

   --omitted, too many characters

ADD REPLY • link 6.4 years ago abe • 0

0

Entering edit mode

Ah, I see the problem. It's not anything wrong with your VCF per se (at least so far), but you are trying to create profiles for a sample (SRR1233685) that does not exist. Looking at the output of your scanVcfHeader it looks like you have two samples named SRR1233685.variant and SRR1233685.variant2, respectively. Changing the sample name should do the trick!

ADD REPLY • link 6.4 years ago erikfas ▴ 30