Hi
I have some very large bam-files (>200TB in total) and cn.mops is taking a while, so I've been thinking of using mosdepth instead - is my approach like this valid/am I breaking any assumptions in cn.mops?
1) Run mosdepth with a 25000 bp window, identical to the ranges in the cn.mops manual (mosdepth -b 25000 –n output_prefix input_bam
)
2) Load that into R –
library(GenomicRanges) df <- read.table('sample1.bed') colnames(df) <- c('chr', 'start', 'end', 'sample1') #make GenomicRanges object gr <- GRanges(seqnames=df$chr, ranges=IRanges(df$start+1, df$end), strand='*', sample1=df$sample1) # make one big genomicranges object for all individuals df <- read.table(‘sample2.bed’) colnames(df) <- c('chr', 'start', 'end', 'sample2’) gr2 <- GRanges(seqnames=df$chr, ranges=IRanges(df$start+1, df$end), strand='*', sample2=df$sample2) gr$sample2 <- gr2$sample2 # mosdepth guarantees that all windows/IRanges are the same for all individuals # and so on for all individuals in a loop resCNMOPS <- cn.mops(gr) # follow manual from here
gr looks like this:
> gr GRanges object with 2 ranges and 2 metadata columns: seqnames ranges strand | sample1 sample2 <rle> <iranges> <rle> | <integer> <integer> [1] chr1 1-10000 * | 5 10 [2] chr1 10002-20000 * | 10 1000
Would this work?
Thank you for your fast response - you're right, looking at mosdepth's regions.bed.gz it gives out the DOC:
So I will use your formula to get an integer for that, and then it's done - thank you!
Hi
that worked perfectly - here's a crappy Python script that takes all mosdepth output and writes one table:
this prints to STDOUT so redirect using >
Load into R using
Edit: Replaced the matrix input by bed input so that the resulting GRanges input has nice chromosome and basepair positions, instead of
undef
, and fixed a bug with the range calculation when a contig is smaller than 25000