Question

Reproducibility of DNAcopy segmentation

0

Entering edit mode

Ross Patterson ▴ 20

@ross-patterson-3886

Last seen 10.6 years ago

While performing some copy number analysis on data segmented with the DNAcopy package, I have noticed some variations in the output data, and was hoping someone here could help shed some light on that. Specifically, while running the DNAcopy segmentation on the exact same input data multiple times, I have noticed that the resultant segment data output sometimes contains "extra" segments, caused by the discovery of "extra" breakpoints. In fact, the resultant output data is always different. Digging into the source code a little bit, I saw what appeared to be calls to some random number generating functions, although not being very familiar with Fortran code I could not tell how or why these numbers were being used, or even if that is the source of segmentation discrepancies. I know that in the last few years there have been some changes to the segmentation algorithm to allow it to run in near linear time. Did that require introducing non-deterministic behavior? Is there a way to force the segmentation algorithm to run deterministically, such that the output data can be identically reproduced every time the segmentation is run? Thank you in advance for your help, Ross Patterson [[alternative HTML version deleted]]

DNAcopy DNAcopy • 1.3k views

ADD COMMENT • link updated 15.3 years ago by Sean Davis 21k • written 15.3 years ago by Ross Patterson ▴ 20

score 0 · Answer 1 · 2010-01-13

On Wed, Jan 13, 2010 at 1:42 PM, Ross Patterson <rossjp at="" gmail.com=""> wrote: > While performing some copy number analysis on data segmented with the > DNAcopy package, I have noticed some variations in the output data, and was > hoping someone here could help shed some light on that. ?Specifically, while > running the DNAcopy segmentation on the exact same input data multiple > times, I have noticed that the resultant segment data output sometimes > contains "extra" segments, caused by the discovery of "extra" breakpoints. > In fact, the resultant output data is always different. ?Digging into the > source code a little bit, I saw what appeared to be calls to some random > number generating functions, although not being very familiar with Fortran > code I could not tell how or why these numbers were being used, or even if > that is the source of segmentation discrepancies. ?I know that in the last > few years there have been some changes to the segmentation algorithm to > allow it to run in near linear time. ?Did that require introducing > non-deterministic behavior? ?Is there a way to force the segmentation > algorithm to run deterministically, such that the output data can be > identically reproduced every time the segmentation is run? Hi, Ross. DNAcopy uses an empirical distribution for determining significance. The help for segment() gives some details. The authors can perhaps comment on whether or not there is a way to make things run deterministically. Sean