wrong output of DNAcopy::segment() for chr. with empty copy number data
2
0
Entering edit mode
@alexandre-kuhn-4386
Last seen 4.5 years ago

Hello,

I encountered a rare problem with segment() in the DNAcopy package. Specifically I see artifactual segments with ouptut$loc.start greater than output$end.start and output$num.mark=0 and output$seg.mean=NA.

In my case, it seems to be caused by "chromosomes" (contigs) with empty copy number data point (NAs). Here is a minimal example to try and illustrate the problem:

 

> segment(CNA(genomdat=c(1,2,1,NA,3),chrom=c(rep("chr1",3),"chr2","chr3"),maploc=c(1:3,1,1)))

Analyzing: Sample.1 

Call:

segment(x = CNA(genomdat = c(1, 2, 1, NA, 3), chrom = c(rep("chr1", 

    3), "chr2", "chr3"), maploc = c(1:3, 1, 1)))


        ID chrom loc.start loc.end num.mark seg.mean

1 Sample.1  chr1         1       3        3   1.3333

2 Sample.1  chr1         1       3        0       NA

3 Sample.1  chr3         1       1        1   3.0000

 

So the second segment in the output is assigned to chr1 and has num.mark of 0. It should be omitted I believe. SegRows in segment() output would also show a problem as startRow is greater than endRow for that second segment (with real data I have also seen examples where, in addition, loc.start would be greater than loc.end, which is not the case here).

A simple workaround is to eliminate segments with num.mark=0 and seg.mean=NA. However, it would be great to fix the segment() function directly (if needed). Thanks!

 

> sessionInfo()

R version 3.4.1 (2017-06-30)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: CentOS release 6.8 (Final)


Matrix products: default

BLAS: /home/akuhn/R/R-3.4.1/lib/libRblas.so

LAPACK: /home/akuhn/R/R-3.4.1/lib/libRlapack.so


locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       


attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     


other attached packages:

[1] DNAcopy_1.50.1


loaded via a namespace (and not attached):

[1] compiler_3.4.1

 

DNAcopy dnacopy • 1.7k views
ADD COMMENT
0
Entering edit mode
@markusriester-9875
Last seen 2.6 years ago
United States

I think you answered your question, segment() doesn't like NAs anywhere. 

ADD COMMENT
0
Entering edit mode
@alexandre-kuhn-4386
Last seen 4.5 years ago

It does not seem to be true. segment() handles NAs correctly if there are other, non-NA values on the same chromosome:

> segment(CNA(genomdat=c(1,2,1,NA,2,3),chrom=c(rep("chr1",3),rep("chr2",2),"chr3"),maploc=c(1:3,1:2,1)))

Analyzing: Sample.1 

Call:

segment(x = CNA(genomdat = c(1, 2, 1, NA, 2, 3), chrom = c(rep("chr1", 

    3), rep("chr2", 2), "chr3"), maploc = c(1:3, 1:2, 1)))


        ID chrom loc.start loc.end num.mark seg.mean

1 Sample.1  chr1         1       3        3   1.3333

2 Sample.1  chr2         2       2        1   2.0000

3 Sample.1  chr3         1       1        1   3.0000

 

However, the (problematic) case where read count data are NAs for the entire contig length does occur in real world analyses (and, in particular these days with draft genome assemblies that may have numerous, smaller contigs). It would thus be helpful to users if segment() would handle this case properly.

ADD COMMENT
0
Entering edit mode

I stand corrected. I'm sure though there are other corner cases where it doesn't work with NAs, since I've stumbled across this "negative segment sizes with NAs" a few times before. But you are probably right that this is a bug worth fixing.

 

ADD REPLY

Login before adding your answer.

Traffic: 627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6