I see that the _loh.csv file sometimes contains fewer segments than the corresponding _dnacopy.seg file. Usually it contains one or two MORE segments. I would think it would have exactly the same number of segments in the two files. In one case I have 5 segments in the _dnacopy.seg file covering most of chr18 while the _loh.csv file has only 1 segment covering only a small part of it. What is the cause of this?
I discovered a possible part of the answer. The "variants.csv" file does not have any variants within one of the segments found in the _dnacopy.seg file but not in the "loh.csv" file. Maybe the latter file omits segments not having any variants within them. The problem here is that the "num.mark" column in the "_dnacopy.seg" file has the value 6 for this segment, and I would have thought it would be 0.
I think I remember that an option was added to omit segments from the LOH file if there are no SNPs in it. However, since I'm creating that file using PureCN.R and not by calling callLOH directly, I don't think I have access to that option.
But those two changes would not reduce the COVERAGE LENGTH of the LOH file, but that is what I'm seeing. See my late addition to my question.
Could it be that after splitting a segment at the centromere, one of the two pieces has no SNPs in it and so is discarded? No, that isn't it, as the segment I'm looking at does not span or include the centromere.
Would you mind re-running one of those samples with wrong num.marks with the PSCBS segmentation?
Then simply add --funsegmentation PSCBS to the PureCN.R call? I think this should fix it. I never bothered to fix the num.marks for these forced breaks since it's not necessary with PSCBS.
And yes, the coverage length can be shorter at the ends, since the chromsome start and ends are defined by the first and last available SNP.
Well, I don't have time at the moment, will try to get to it later. However, it sounds like you are saying that num.marks could be wrong in cases where a segment that straddled the centromere was split into two segments? But that doesn't explain my data. Let me just give you the _dnacopy.seg segments on chr 13 of this sample:
and the _loh.csv segments (there is only 1):
According to my data, the centromere lies at positions 16000000-18051248 and chromosome length is 114364328. That would put all 5 _dnacopy.seg segments in the q arm.
I need to figure out quickly what to do here as I am working on a poster for next week. The situation arose because I only recently decided to start paying attention to LOH, and I decided I wanted to estimate LOH on each chromosome arm, so I switched from using the _dnacopy.seg file to the _loh.csv file for the segments I was using to look at copy number changes in chromosomes. The results for my calls of gains and losses on full chromosome arms were quite similar but not exactly identical in the two cases, and when I delved into it I found this chr13 difference that is particularly striking.
By the way, I don't suppose PSCBS is able to simultaneously segment multiple samples from the same patient, to try to put segment boundaries at the same position if they appear to be derived from a CNV event in a common ancestor?