Question

VariantAnnotation _ problem in merging files, mutationspectrum per sample

0

Entering edit mode

Anna N. • 0

@anna-n-6889

Last seen 10.3 years ago

France

Hello again

I have encountered some problems and I would like your valuable input.

My analysis pipeline produces either tsv file per patient/sample or vcf file.

Since I wanted to have all my patients in one file I tried several approaches.

I have samples that are different and also they don't have the same ranges, so rbind and cbind would not be very helpfull in my situation since they are recommended either for same sample/ different variants or same variants/ different samples.

One approach was to use readMutect for tsv files; create a VRange object for each tsv file, and then merge the VRanges objects.

Ac216test=readMutect("Ac216_out_test.tsv", strip=TRUE)

Ac295test=readMutect("Ac295_out_test.tsv", strip=TRUE)

Ac375test=readMutect("Ac375_out_test.tsv", strip=TRUE)

c(Ac216test, Ac295test, Ac375test)

vr_total=c(Ac216test, Ac295test, Ac375test)

vr_motifs= mutationContext(vr_total,BSgenome.Hsapiens.UCSC.hg19, unify = TRUE)

vr_mm=motifMatrix(vr_motifs, group="sampleNames", normalize=TRUE)

svg(file="report/p_mutation_spectrum.svg")

plotMutationSpectrum(vr_motifs, "sampleNames")

dev.off()

Then proceed with the rest

Since I don’t have only tsv files but also vcf and at some point I will need to use merged vfs I tried the following in order to compare results:

I used vcf tools to merge my vcf files and produce one vcf file with all samples

My main problem is that when I start from vcf file to make the VRange object the sampleNames column is full of NAs.

When I use for example readVcfAsVRanges I end up with a vrange object with 324 ranges and O metadata columns, whereas sampleNames are all NA, so I cannot plot anything per sample.

fl="out2.vcf"

vcf=readVcf(fl, "hg19")

vr_test=readVcfAsVRanges(fl, "hg19")

Warning messages:

1: In .bcfHeaderAsSimpleList(header) :

  duplicate keys in header will be forced to unique rownames

2: In .vcf_usertag(map, tag, ...) :

  ScanVcfParam ‘geno’ fields not present: ‘AD’

vr_test

VRanges object with 324 ranges and 0 metadata columns:

If I use the as(vcf, “VRanges”) I end up with 1944 ranges and 101 metadata columns, but the sampleNames are retained.

fl="out2.vcf"

v Warning message:

In .bcfHeaderAsSimpleList(header) :

  duplicate keys in header will be forced to unique rownames

cf=readVcf(fl, "hg19")

vr=as(vcf,"VRanges")

vr

VRanges object with 1944 ranges and 101 metadata columns:

When I plot the mutationspectrum per sample I end up with identical spectra for all samples (exactly the same) and very different from what I end up with the tsv approach.

I feel a little lost. Could you point out what I am doing wrong and maybe the best approach in order to be able to plot per sample.

Thank you in advance
Anna

variantannotation • 2.4k views

ADD COMMENT • link updated 10.4 years ago by Julian Gehring ★ 1.3k • written 10.4 years ago by Anna N. • 0

0

Entering edit mode

Let me try to separate and summarize the problems that you describe:

If you import a VCF file, you can maintain the sample names with as(readVcf(fl), "VCF"). So this part looks fine.

What exactly do you mean by "When I plot the mutationspectrum per sample I end up with identical spectra for all samples (exactly the same) and very different from what I end up with the tsv approach."? Is the "tsv approach" not per sample?

Can you inspect the imported variant calls and check if they are identical? This would most likely explain why the spectra are identical.

In your statement "I have samples that are different and also they don't have the same ranges, so rbind and cbind would not be very helpfull in my situation since they are recommended either for same sample/ different variants or same variants/ different samples.", what do you mean by that different samples have different ranges?

ADD REPLY • link 10.4 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Could you please edit your post and add the markup for the code? That makes it much easier to read. Thanks.

ADD REPLY • link 10.4 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Hi Julian

I edited the post, hope it is fine now.

I will try to explain a little better my problem.

The tsv files are Mutect's output, so I can use the readMutect command.

When I merge, the sample names are retained and I can plot the spectrum per sample _ meaning that I end up with a spectrum that has three distinctive spectra _one for each sample. So far so good.

I don't always end up with tsv files, I have also vcf files. For the vcf files I first merge the vcfs with vcftools (vcf merge) and then i try to import my file.

If I use the recommended approach readVcfAsVRanges, my sampleNames are full of NAs and I cannot plot my spectrum per sample (if i change the NA=sample, then I can have a spectrum per total, as everything is seen as "sample"). In this case, comparison is not easy, as I am unable to distinguish between samples, everything is considered as one sample.

If I use the vr=as(vcf,"VRanges") command and also use the param to specify the samples I want, my sample names are retained and I am able to plot the spectrum per sample. But in this case when I compare with the tsv approach, I see that I don't get the same results, and as I mentioned above the spectra are identical while the imported variant calls are not (not very different, but still not identical).

For your last question, I paste an example for chr1 for two different samples :As you can see I have two different samples that do not have identical variant calls.

One question that I still have, is how can one predefine the number of signatures he expects? And what if I expect only one signature?

Thank you in advance
Anna

#CHROM	POS
chr1	1309199
chr1	1309213
chr1	9770606
chr1	21036222
chr1	40535851
chr1	42048764
chr1	150975012
chr1	155294225
chr1	156816374
chr1	175334300
chr1	180905240
chr1	182791267
chr1	185119609
chr1	197297719
chr1	200619583
chr1	206331168
chr1	226180206
chr1	241886740

#CHROM	POS
chr1	1309199
chr1	1309213
chr1	42048764
chr1	64089396
chr1	150975012
chr1	156816374
chr1	162558481
chr1	175306721
chr1	182791267
chr1	200619583
chr1	206331168
chr1	235602158

ADD REPLY • link 10.4 years ago Anna N. • 0

score 0 · Answer 1 · 2014-12-10

I used a different set of samples and I encountered the same problems- when I use the tsv output, after readMutect and merging, my matrix from the motifMatrix function looks like that:

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
CA A.A 0.000000000 0.003007519 0.00000000 0.01775148 0.01219512 0.01754386 0.02061856 0.01587302
CA A.C 0.015151515 0.000000000 0.00000000 0.01183432 0.01219512 0.00000000 0.03092784 0.01587302
CA A.G 0.000000000 0.000000000 0.00000000 0.00000000 0.00000000 0.00000000 0.01030928 0.00000000
CA A.T 0.000000000 0.000000000 0.00000000 0.00000000 0.00000000 0.00000000 0.01030928 0.00000000
CA C.A 0.000000000 0.001503759 0.00000000 0.00000000 0.01219512 0.03508772 0.00000000 0.01587302
CA C.C 0.000000000 0.000000000 0.00000000 0.01775148 0.03658537 0.01754386 0.03092784 0.00000000
CA C.G 0.007575758 0.000000000 0.00000000 0.00591716 0.01219512 0.00000000 0.00000000 0.00000000
CA C.T 0.000000000 0.000000000 0.00000000 0.00000000 0.00000000 0.00877193 0.00000000 0.01587302
CA G.A 0.015151515 0.001503759 0.02380952 0.00000000 0.03658537 0.01754386 0.02061856 0.00000000
CA G.C 0.000000000 0.000000000 0.02380952 0.00591716 0.02439024 0.05263158 0.01030928 0.01587302

while when I use the merged vcf (for the same samples/analysis) and the command

vr=as(vcf,"VRanges")

my matrix looks like that:

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
CA A.A 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444
CA A.C 0.0072697900 0.0072697900 0.0072697900 0.0072697900 0.0072697900 0.0072697900 0.0072697900 0.0072697900
CA A.G 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544
CA A.T 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544 0.0008077544
CA C.A 0.0056542811 0.0056542811 0.0056542811 0.0056542811 0.0056542811 0.0056542811 0.0056542811 0.0056542811
CA C.C 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989
CA C.G 0.0024232633 0.0024232633 0.0024232633 0.0024232633 0.0024232633 0.0024232633 0.0024232633 0.0024232633
CA C.T 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089
CA G.A 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989 0.0088852989
CA G.C 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444 0.0080775444
CA G.G 0.0048465267 0.0048465267 0.0048465267 0.0048465267 0.0048465267 0.0048465267 0.0048465267 0.0048465267
CA G.T 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089 0.0016155089

As you can see all rows have identical values, and this why I end up with identical spectra...Apparently something is wrong with my merged vcf...Any ideas as what I should be looking for?

Thank you
Anna

score 0 · Answer 2 · 2014-12-11

As you suspected, there is mostly likely a problem in merging your VCF files manually. Since we don't have access to the exact code on the command line and R as well as the data itself, there is little we can do from the outside. I would suggest to examine the imported 'VRanges' object with each other, and also to compare equivalent entries of the mutect tsv and your merged vcf files.