Question

Why is PureCN LOH file missing some segments

0

Entering edit mode

twtoal ▴ 10

@twtoal-15473

Last seen 9 months ago

United States

I see that the _loh.csv file sometimes contains fewer segments than the corresponding _dnacopy.seg file. Usually it contains one or two MORE segments. I would think it would have exactly the same number of segments in the two files. In one case I have 5 segments in the _dnacopy.seg file covering most of chr18 while the _loh.csv file has only 1 segment covering only a small part of it. What is the cause of this?

I discovered a possible part of the answer. The "variants.csv" file does not have any variants within one of the segments found in the _dnacopy.seg file but not in the "loh.csv" file. Maybe the latter file omits segments not having any variants within them. The problem here is that the "num.mark" column in the "_dnacopy.seg" file has the value 6 for this segment, and I would have thought it would be 0.

I think I remember that an option was added to omit segments from the LOH file if there are no SNPs in it. However, since I'm creating that file using PureCN.R and not by calling callLOH directly, I don't think I have access to that option.

purecn loh • 1.4k views

ADD COMMENT • link 5.1 years ago twtoal ▴ 10

score 0 · Answer 1 · 2020-03-05

0

Entering edit mode

markus.riester ▴ 130

@markusriester-9875

Last seen 2.8 years ago

United States

There are two differences:

The LOH file merges neighboring segments when they have the same predicted major and minor copy number state. This makes it more robust to noise in the segmentation, but might remove true events, most likely sub-clonal gains and losses.
The default segmentation does not add a breakpoint at the centromeres (the soon to be new default PSCBS does, though), but the LOH output file forces one to report chromosome arm events. That would result in more segments.

ADD COMMENT • link 5.1 years ago markus.riester ▴ 130

0

Entering edit mode

But those two changes would not reduce the COVERAGE LENGTH of the LOH file, but that is what I'm seeing. See my late addition to my question.

Could it be that after splitting a segment at the centromere, one of the two pieces has no SNPs in it and so is discarded? No, that isn't it, as the segment I'm looking at does not span or include the centromere.

ADD REPLY • link 5.1 years ago twtoal ▴ 10

0

Entering edit mode

Would you mind re-running one of those samples with wrong num.marks with the PSCBS segmentation?

# patched PSCBS with support of interval weights
BiocManager::install("lima1/PSCBS", ref="add_dnacopy_weighting")

Then simply add --funsegmentation PSCBS to the PureCN.R call? I think this should fix it. I never bothered to fix the num.marks for these forced breaks since it's not necessary with PSCBS.

And yes, the coverage length can be shorter at the ends, since the chromsome start and ends are defined by the first and last available SNP.

ADD REPLY • link 5.1 years ago markus.riester ▴ 130

0

Entering edit mode

Well, I don't have time at the moment, will try to get to it later. However, it sounds like you are saying that num.marks could be wrong in cases where a segment that straddled the centromere was split into two segments? But that doesn't explain my data. Let me just give you the _dnacopy.seg segments on chr 13 of this sample:

X3.CG.226T4     13      18171749        108949296       765     -0.109074637681159      3
X3.CG.226T4     13      108949297       110124688       6       3.1174  66.4779670043654
X3.CG.226T4     13      110124689       110712385       3       -0.1661 3
X3.CG.226T4     13      110714084       111506698       9       0.114   4
X3.CG.226T4     13      111506699       114353729       13      -0.1241 3

and the _loh.csv segments (there is only 1):

3-CG-226T4,chr13,18171749,102875652,q,3,1,,-0.109074637681159,765,22,FALSE,0.432900432900433,0.452344630429296

According to my data, the centromere lies at positions 16000000-18051248 and chromosome length is 114364328. That would put all 5 _dnacopy.seg segments in the q arm.

I need to figure out quickly what to do here as I am working on a poster for next week. The situation arose because I only recently decided to start paying attention to LOH, and I decided I wanted to estimate LOH on each chromosome arm, so I switched from using the _dnacopy.seg file to the _loh.csv file for the segments I was using to look at copy number changes in chromosomes. The results for my calls of gains and losses on full chromosome arms were quite similar but not exactly identical in the two cases, and when I delved into it I found this chr13 difference that is particularly striking.

By the way, I don't suppose PSCBS is able to simultaneously segment multiple samples from the same patient, to try to put segment boundaries at the same position if they appear to be derived from a CNV event in a common ancestor?

ADD REPLY • link 5.1 years ago twtoal ▴ 10