Private Germline Count is mostly 0
1
0
Entering edit mode
twtoal ▴ 10
@twtoal-15473
Last seen 4 months ago
United States

Most of my samples show a private.germline.all count of 0.  However, there are tons of germline mutations that are not filtered out, as the PureCN log files show.  Why is the private germline count 0?  I checked one sample's germline mutations manually to see if in fact all of them occurred in another sample, and they do not, although I did not do all the filtering that PureCN did.  However, the PureCN filtering removed a fairly small number of mutations compared to the overall total.

PureCN private_germline mutation_burden • 1.5k views
ADD COMMENT
0
Entering edit mode
@markusriester-9875
Last seen 2.4 years ago
United States

This is a feature for tumor-only analyses. It is the count of mutations not in dbSNP (or whatever database you use for annotation), but predicted to be germline. If you have matched normals, dbSNP status is ignored and the SOMATIC info flag is used instead. 

I use this mainly as a QC check in tumor-only. Private germline rates should be roughly constant across individuals, so if you have outliers here, the germline vs somatic classification is likely poor (most often happens in hyper mutated samples of very high purity when lots of somatic variants are wrongly classified as germline). 

But good question, that should be made clear in the documentation.

 

ADD COMMENT
0
Entering edit mode
That doesn't explain it then.  I have matched normals, and I have the SOMATIC flag, and there are MANY mutations not marked with SOMATIC, and PureCN sees them, according to its log of variant counts seen and filtered.  Yet, below are the numbers of private germline all counts, and taking the first tumor ID in that list, here are number of lines in its VCF file that are flagged with SOMATIC or not, and flagged DB but not SOMATIC, which indicates to me that this one should have around 47 private germline variants:


> bcftools view -H 3-CG-214T1.PureCN.vcf.gz | wc -l
1580
> bcftools view -H 3-CG-214T1.PureCN.vcf.gz | grep SOMATIC | wc -l
9
> bcftools view -H 3-CG-214T1.PureCN.vcf.gz | grep -v SOMATIC | wc -l
1571
> bcftools view -H 3-CG-214T1.PureCN.vcf.gz | grep -v SOMATIC | grep DB | wc -l
1524
> bcftools view -H 3-CG-214T1.PureCN.vcf.gz | grep -v SOMATIC | grep -v DB | wc -l
47
> 

 

 

> df[, c("tumorID", "private.germline.all")]
                 tumorID private.germline.all
          3-CG-214T1   0
          3-CG-214T2   0
          3-CG-217T1   0
          3-CG-217T2   0
          3-CG-217T3   0
          3-CG-217T4   1
          3-CG-218T1   0
          3-CG-218T2   0
          3-CG-220T1   0
          3-CG-220T2   0
          3-CG-220T3   0
          3-CG-220T4   0
          3-CG-222T1   0
          3-CG-222T2   0
          3-CG-222T3   0
          3-CG-222T4   0
          3-CG-223T1   0
          3-CG-223T2   0
          3-CG-223T3   0
          3-CG-223T4   0
          3-CG-226T1   2
          3-CG-226T2   0
          3-CG-226T3   0
          3-CG-226T4   0
          3-CG-231T2   0
          3-CG-231T3   0
          3-CG-231T4   0
          3-CG-232T1   0
          3-CG-232T2   0
          3-CG-232T3   0
          3-CG-232T4   0
    CG-HFLLA-JG-74T1   2
    CG-HFLLA-JG-74T2   3
    CG-HFLLA-JG-74T3   0
    CG-HFLLA-JG-74T4   4
   CG-HFLLA-JJR-72T1   0
   CG-HFLLA-JJR-72T2   0
   CG-HFLLA-JJR-72T3   0
   CG-HFLLA-JJR-72T4   0
   CG-HFLLA-MLR-76T1   0
   CG-HFLLA-MLR-76T2   0
   CG-HFLLA-MLR-76T3   0
   CG-HFLLA-MLR-76T4   0
    CG-HFLLA-YG-71T1   0
    CG-HFLLA-YG-71T2   0
    CG-HFLLA-YG-71T3   0
    CG-HFLLA-YG-71T4   0
             CH-59T1   1
             CH-59T2   0
             CH-59T3   0
             CH-59T4   1
   CT-IBG-2-CG-187T1   0
   CT-IBG-2-CG-187T2   1
   CT-IBG-2-CG-187T3   0
   CT-IBG-2-CG-187T4   0
   CT-IBG-2-CG-187T5   1
   CT-IBG-2-CG-191T1   0
   CT-IBG-2-CG-191T2   0
   CT-IBG-2-CG-191T3   0
   CT-IBG-2-CG-191T4   0
   CT-IBG-2-CG-191T5   0
             DC-52T1   0
             DC-52T2   3
             DC-52T3   0
             DC-52T4   1
    GC-HFLLA-AV-68T1   0
    GC-HFLLA-AV-68T2   2
    GC-HFLLA-AV-68T3   0
    GC-HFLLA-AV-68T4   0
    GC-HFLLA-LR-65T1   0
    GC-HFLLA-LR-65T2   0
    GC-HFLLA-LR-65T3   0
    GC-HFLLA-LR-65T4   0
             HC-55T1   0
             HC-55T2   0
             HC-55T3   0
             HC-55T4   0
HFLLA-IBG-2-CG-189T1   0
HFLLA-IBG-2-CG-189T2   1
HFLLA-IBG-2-CG-189T3   0
HFLLA-IBG-2-CG-189T4   0
      IBG-2-CG-154T1   0
      IBG-2-CG-154T2   0
      IBG-2-CG-154T3   0
      IBG-2-CG-154T4   0
      IBG-2-CG-178T1   0
      IBG-2-CG-178T2   0
      IBG-2-CG-178T3   0
      IBG-2-CG-178T4   0
            JES-61T1   0
            JES-61T2   0
            JES-61T3   0
            NCL-54T1   0
            NCL-54T2   0
            NCL-54T3   0
            NCL-54T4   0
            NCL-54T5   0
             PC-60T1   0
             PC-60T2   2
             PC-60T3   0
             PC-60T4   0
            ST-180T1   0
            ST-180T2   1
            ST-186T1   5
            ST-186T2   1
            ST-194T1   0
            ST-194T2   0
             SV-64T1   0
             SV-64T2   0
             SV-64T3   0
             SV-64T4   0
              T-10T1   0
              T-10T2   0
             VO-56T1   0
             VO-56T2   0
             VO-56T3   1
             VO-56T4   0
             YL-62T1   0
             YL-62T2   0
             YL-62T3   0
             YL-62T4   0

 

ADD REPLY
0
Entering edit mode

I think I may see the problem, but I don't understand it.  In function callMutationBurden(), there is the following statement:

p <- p[p$prior.somatic >= min.prior.somatic & p$prior.somatic <= 
        max.prior.somatic]

This has the effect of eliminating most germline mutations from consideration, because they all have very low prior.somatic, looks like 9.9e-05.  I would think this is an appropriate value for prior.somatic of mutations not marked SOMATIC?

What effect would this have, if any, on the analysis?  Seems like, being in the final stage of computing mutation burden, that it probably has no effect on CNV calculations?

 

ADD REPLY
0
Entering edit mode

Private germline means germline, but not in public germline databases.

The point of the mutation burden function is to remove known AND private SNPs from variant calls in tumor-only analyses. We hopefully by now removed all artifacts, so after germline filtering, we should end up with only somatic calls. Mutation burden is the somatic mutation rate.

This line you quoted takes care of removing known germline. There can be somatic variants at known germline sites, but this should be rare. The next lines in this function remove predicted germline from the novel mutations (not in dbSNP).

It's completely downstream, see the vignette section about callMutationBurden.

If you have matched normals, then calculating mutation burden is trivial, you simply count the somatic calls and normalize by callable region. PureCN will do that for you, but this function is written for tumor-only where this isn't as easy because you don't know if a call is germline or somatic in advance.

In matched tumor/normal, if you get a non-zero number in private germline, it means annotated as SOMATIC, but fits germline much better. This should be rare and is probably an artifact (or the coverage in normal was poor). 

Yes, this is downstream of everything.

ADD REPLY
0
Entering edit mode

>> In matched tumor/normal, if you get a non-zero number in private germline, it means annotated as SOMATIC, but fits germline much better. This should be rare and is probably an artifact (or the coverage in normal was poor). 

Ahh, okay, that's the crucial thing I wanted to know.  You might want to add that as a comment for the private.germline.all column description.

 

ADD REPLY
0
Entering edit mode

Are you saying that what you are doing is checking to see if any of the variants which PureCN judges to be SOMATIC are not marked as such in the VCF?  In which case, you would HOPE to see 0's, and non-0's indicate a possible problem?  If that's the situation, I think the term "private germline" threw me off.  In this MSEQ project of mine, somatic mutations are either clonal (all tumors have it), subclonal (>1 tumors have it but not all), or private (only one tumor has it).  So, I naturally thought that a private GERMLINE mutation is one that is present in only one normal sample out of all the normals in the project.

 

ADD REPLY

Login before adding your answer.

Traffic: 948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6