Inherited what I believe to be a good implementation of PureCN for use in analyzing WXS in a cohort. We see that while there are only a few gene or regional CN/LOH datapoints, some values occur frequently (CN of 7.00 is observed in 90% of the ~150 alterations observed throughout the cohort). This occurs in multiple cohorts with the same sequencing and downstream analyses, which might lead us to believe it is something due to our data source, or perhaps a mis-step in our analysis.
Has this been observed before? We have some samples/regions which are less peculiar (CN is still above and below 7 in some cases). I am naively trying to identify if this issue could be an artifact/bias or misstep in my implementation.
Also, feel free to post an example log file and I can check if setup is fine.
Log file is here: https://drive.google.com/file/d/11VCot5RzNijxzpgZ63qvkQ-AZK19GqGu/view?usp=sharing
Looks great. The only thing you can improve is running Mutect2 with --interval-padding 50 if you don't already. Ideally then also do that on the normal samples and recreate the mapping bias file. This typically increases the number of SNPs quite a lot, thus improving power to call LOH.
Most of the fixes that might affect artifacts should be already in, but you can try updating to 2.0.1.
Would appreciate lists of artifacts.
You might also want to try PureCN.R --fun-segmentation GATK4 (just make sure gatk binary is in path). Some users have reported cleaner profiles in WES over PSCBS. I optimized the PSCBS based function for our panels with cfDNA and I think GATK is more tuned towards WES/WGS.