Question

problem with QDNAseq's createBins for new BSgenome package

0

Entering edit mode

penington.j • 0

@peningtonj-8581

Last seen 3.8 years ago

Australia

Hi. I want to use QDNAseq to look for copy number changes in a Plasmodium falciparum sequence. I have made a BSgenome package from the sequence, which looks OK. It gave 2 Notes and 2 Warnings when I ran R CMD check, but it loads successfully:

> library("BSgenome.Pfalciparum3D7.PlasmoDB.3D7v3")
> pfg <- BSgenome.Pfalciparum3D7.PlasmoDB.3D7v3
> seqinfo(pfg)
Seqinfo object with 16 sequences (1 circular) from Pf3D7v3 genome:
  seqnames   seqlengths isCircular  genome
  chrom04       1200490      FALSE Pf3D7v3
  chrom05       1343557      FALSE Pf3D7v3 
...

> head(pfg[['chrom01']])
  6-letter "DNAString" instance
seq: TGAACC

But QDNAseq createBins function creates an empty structure:

> pfBins10k <- createBins(pfg, 10)
Creating bins of 10 kbp for genome pfg
> pfBins10k
[1] chromosome start      end        bases      gc        
<0 rows> (or 0-length row.names)

createBins() worked when I tested it with BSgenome.Celegans.UCSC.ce2, so I think the problem must be in the way I forged the package, but I can't find it. Any advice?

Thank you, Jocelyn

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS release 6.4 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Pfalciparum3D7.PlasmoDB.3D7v3_0.1-0 QDNAseq_1.4.1                               
 [3] BSgenome_1.36.3                              rtracklayer_1.28.6                          
 [5] Biostrings_2.36.2                            XVector_0.8.0                               
 [7] GenomicRanges_1.20.5                         GenomeInfoDb_1.4.1                          
 [9] IRanges_2.2.5                                S4Vectors_0.6.3                             
[11] BiocGenerics_0.14.0

bsgenome biostrings qdnaseq pfalciparum • 3.8k views

ADD COMMENT • link updated 9.7 years ago by ilari.scheinin ▴ 50 • written 9.7 years ago by penington.j • 0

0

Entering edit mode

Problem solved, sort-of: it worked when I changed to ignoreMitochondria=FALSE

> pfBins10k <- createBins(pfg, 10, ignoreMitochondria=FALSE)
Creating bins of 10 kbp for genome pfg
    Processing chrom04 ...
...
    Processing chrom11 ...
    Processing chromMito ...
    Processing chromApico ...
> str(pfBins10k)
'data.frame':	2342 obs. of  5 variables:
 $ chromosome: chr  "om04" "om04" "om04" "om04" ...
 $ start     : num  1 10001 20001 30001 40001 ...
 $ end       : num  1e+04 2e+04 3e+04 4e+04 5e+04 6e+04 7e+04 8e+04 9e+04 1e+05 ...
 $ bases     : num  100 100 100 100 100 100 100 100 100 100 ...
 $ gc        : num  33 32.5 30.3 29.7 22.2 ...

Follow-up question: how is Mitochondrial status determined?

And I can see that I chose poor names for the chromosomes - I will change to more standardised. I thought I would use 'chrom' because 'chr' was confusable with 'character'. The original fasta file calls them "Pf3D7_04_v3", and "PFC10_API_IRAB", etc., and I thought the underscores might be the problem.

ADD REPLY • link 9.7 years ago penington.j • 0

score 1 · Accepted Answer · 2015-08-10

1

Entering edit mode

ilari.scheinin ▴ 50

@ilarischeinin-7020

Last seen 9.2 years ago

Finland

Hi Jocelyn,

The way the mitochondria is detected is with this regexp: "^(chr)?M(T)?$", so I'm actually not sure why it matches here. In general QDNAseq strips "^chr" from the chromosome names, as that seemed to me to be the common naming convention for the BSgenome packages.

createBins() calculates the GC content for each bin, but in order to use correctBins(), you will also need the mappability values for each bin. For human that was done from (bigWig) mappability tracks from ENCODE, but here you might need to create that data yourself. For a FASTA file called genome.fa, they can be created like this:

gem-indexer -i genome.fa -o genome
gem-mappability -I genome.gem -l 50 -o genome.50 -m 2
gem-2-wig -I genome.gem -i genome.50.mappability -o genome.50
wigToBigWig genome.50.wig genome.50.sizes genome.50.bigWig

Ilari

ADD COMMENT • link 9.7 years ago ilari.scheinin ▴ 50

0

Entering edit mode

Thanks! That regexp doesn't match my chromosome names, so it is a mystery. Maybe somewhere else there is a setting that makes it case insensitive?

I will change them anyway, and try that code for making the mappability track

Jocelyn

ADD REPLY • link 9.7 years ago penington.j • 0

0

Entering edit mode

Hi Ilari,

Now I've made the bigWig-format mappability file, how do I add it to my bins? In QDNAseq/doc/QDNAseq.pdf, and in help re createBins, the example command is:

bins$mappability <- calculateMappability(bins,
    bigWigFile='/path/to/wgEncodeCrgMapabilityAlign50mer.bigWig',
    bigWigAverageOverBed='/path/to/bigWigAverageOverBed')

I have access to the utility

bigWigAverageOverBed - Compute average score of big wig over each bed, which may have introns.

usage:

bigWigAverageOverBed in.bw in.bed out.tab

but I don't know what the 'bed's are.

There is no help for calculateMappability, and getBinAnnotations has a reference to a dead link, and no description of the format expected for annotation files. Thanks for your help so far,

Jocelyn

ADD REPLY • link 9.7 years ago penington.j • 0

1

Entering edit mode

Just give the paths to your bigWig file and the bigWigAverageOverBed binary as arguments for calculateMappability(). There is no need to run it from the commandline, just this one command from within R.

bins$mappability <- calculateMappability(bins,
    bigWigFile='/path/to/wgEncodeCrgMapabilityAlign50mer.bigWig',
    bigWigAverageOverBed='/path/to/bigWigAverageOverBed')

Ilari

ADD REPLY • link 9.7 years ago ilari.scheinin ▴ 50

0

Entering edit mode

Can anyone give me an example or instructions for making the 'in.bed' file input needed for the bigWigAverageOverBed utility? Is in.bed a description of the bin locations?

Ilari above has given example commands for making a bigWig file from a genome fasta file, but I don't know how to make the last input, bigWigAverageOverBed, to calculate mappability

ADD REPLY • link 9.7 years ago penington.j • 0

1

Entering edit mode

The 'in.bed' is indeed a list of bin locations in BED format. But there is no need to create this file manually, as QDNAseq will create it for you when you run calculateMappability(). It will first create that file, then run bigWigAverageOverBed (that's why it needs the be provided with the path to that binary), and then process the output.

ADD REPLY • link 9.7 years ago ilari.scheinin ▴ 50

0

Entering edit mode

Thanks for clarifying! :-) Sadly, after all your help and my efforts, I still don't get an output:

> file.access('/usr/local/bioinf/bin/bigWigAverageOverBed', mode=1)
/usr/local/bioinf/bin/bigWigAverageOverBed 
                                         0 
> file.exists('Pfalciparum3D7_Genome50mer.bigWig')

[1] TRUE

> bins10k$mappability <- calculateMappability(bins10k, bigWigFile='Pfalciparum3D7_Genome50mer.bigWig', bigWigAverageOverBed='/usr/local/bioinf/bin/bigWigAverageOverBed')
Calculating mappabilities per bin from file
    Pfalciparum3D7_Genome50mer.bigWig
    Error in orig[[nm]][, j, ..., drop = drop] : subscript out of bounds

This same error is given if the file name is wrong, hence the file.exists and file.access checks, which also confirm I have execute permission for the binary.

ADD REPLY • link 9.7 years ago penington.j • 0

1

Entering edit mode

Hmm, this could also be due to chromosome names. Try specifying parameter chrPrefix="" for calculateMappability().

(The purpose of this parameter is that chromosome names in the BED file QDNAseq creates for bigWigAverageOverBed have to match chromosome names in the bigWig file. For human, chromosome names in the latter will either be of the format "chr1" if it was created from a FASTA file with UCSC style chromosome names, or "1" if from a FASTA file with NCBI/Ensembl style chromosome names. QDNAseq itself uses the NCBI ones (and strips "^chr" from BSgenome chromosome names when running createBins()). Therefore, if the bigWig file was created from a UCSC style FASTA (like the bigWig files provided by ENCODE are), chromosome names need to be prefixed with "chr", which is the default value for chrPrefix. Hence, what happened here was probably that as your chromosome names where prefixed with "chr", they did not match the chromosome names used in your bigWig file.)

If that doesn't help, try typing "calculateMappability" followed by <enter> to see the source code for the calculateMappability() function. It's a rather short one, so easy to run manually line by line and see where the error really comes from.

ADD REPLY • link 9.7 years ago ilari.scheinin ▴ 50

0

Entering edit mode

Thank you! Looking at the source code instantly showed my problem - I was using the wrong input data structure. My 'bins10k' was the formal class QDNAseqReadCounts output of binReadCounts(). I needed to supply the simple data-frame output of createBins(). All fixed now (except for a little embarrassment). Thank you again for your help

Jocelyn

ADD REPLY • link 9.7 years ago penington.j • 0

0

Entering edit mode

I have now succeeded in forging the BSgenome, calculated mappability, read the bam files, run correctBins, but I can't plot with my original sequence names. I have got plot() to work, with a work-around that involves changing all my nuclear chromosome names to just digits. Looking at the code in https://github.com/ccagc/QDNAseq/blob/master/R/plot-methods.R , it looks like the problem is with getting the chromosome as an integer. I can't figure out how chromosomes(x) is working - the only code I can find requires x to be a Cgh object.

I had reverted to the original sequence names in the BSgenome when I couldn't read counts from bam files because the sequence names didn't match.

> plot(counts10k)
Error in pos[chrom > i] + chrom.lengths[as.character(i)] : 
  non-numeric argument to binary operator
In addition: Warning message:
In chromosomes(x) : NAs introduced by coercion

> head(rownames(assayDataElement(counts10k, 'counts')))
[1] "Pf3D7_04_v3:1-10000"     "Pf3D7_04_v3:10001-20000" "Pf3D7_04_v3:20001-30000"
[4] "Pf3D7_04_v3:30001-40000" "Pf3D7_04_v3:40001-50000" "Pf3D7_04_v3:50001-60000"

> head(chromosomes(counts10k))
[1] NA NA NA NA NA NA
Warning message:
In chromosomes(counts10k) : NAs introduced by coercion

newnames <- sapply(seqnames(pfg), function(x) {
if (x=="M76611") {'Mito'} else
if (x=="PFC10_API_IRAB") {'Apico'} else
if (strsplit(x, '_')[[1]][1]=="Pf3D7") {strsplit(x, '_')[[1]][2]}
} )
counts10kchrNew <- counts10k
counts10kchrNew@featureData@data$chromosome <- newnames[counts10kchrNew@featureData@data$chromosome]

I made matching changes to

rownames(assayDataElement(counts10kchrNew, 'counts')) and

featureNames(featureData(counts10kchrNew))

As I said, plot() and correctBins() now work on counts10kchrNew, though I still get warnings about NAs even though I set use=FALSE on the non-nuclear bins.

Cheers,

Jocelyn

ADD REPLY • link 9.7 years ago penington.j • 0

1

Entering edit mode

QDNAseq was originally developed as an extension to CGHcall. The latter uses integers to represent chromosomes, and as a remnant of that, chromosomes() also returns integers for QDNAseq objects. But since QDNAseq internally calls human sex chromosomes "X" and "Y" (instead of 23 and 24), it needed to do conversions between the two, and ended up having that hard-coded in a few places. This essentially made it dependent on chromosome names being only integers, "X", "Y", or "MT".

I now went through the code base and removed these assumptions. This should fix plot() and calculateMappability(), which also had the same conversion hard-coded in.

The changes are currently in my own GitHub clone, and I have created a pull request to have them merged to the main code base by the current package maintainer.

ADD REPLY • link 9.7 years ago ilari.scheinin ▴ 50

0

Entering edit mode

The changes have been merged into the development version of QDNAseq. You can either wait for it to propagate and install with the development version of Bioconductor, or you can directly grab it from github: https://github.com/ccagc/QDNAseq https://github.com/Bioconductor-mirror/QDNAseq Grtz, Daoud On 08/25/2015 03:01 PM, ilari.scheinin [bioc] wrote: > Activity on a post you are following on support.bioconductor.org > <https: support.bioconductor.org=""> > > User ilari.scheinin <https: support.bioconductor.org="" u="" 7020=""/> wrote > Comment: problem with QDNAseq's createBins for new BSgenome package > <https: support.bioconductor.org="" p="" 70925="" #71439="">: > > QDNAseq was originally developed as an extension to CGHcall. The latter > uses integers to represent chromosomes, and as a remnant of that, > chromosomes() also returns integers for QDNAseq objects. But since > QDNAseq internally calls human sex chromosomes "X" and "Y" (instead of > 23 and 24), it needed to do conversions between the two, and ended up > having that hard-coded in a few places. This essentially made it > dependent on chromosome names being only integers, "X", "Y", or "MT". > > I now went through the code base and removed these assumptions. This > should fix plot() and calculateMappability(), which also had the same > conversion hard-coded in. > > The changes are currently in my own GitHub clone, > <https: github.com="" ilarischeinin="" qdnaseq=""> and I have created a pull > request <https: github.com="" ccagc="" qdnaseq="" pull="" 12=""> to have them merged > to the main code base by the current package maintainer. > > ------------------------------------------------------------------------ > > Post tags: bsgenome, biostrings, qdnaseq, pfalciparum > > You may reply via email or visit > C: problem with QDNAseq's createBins for new BSgenome package >

ADD REPLY • link 9.7 years ago d.sie • 0

0

Entering edit mode

Hi Ilari,

Just to clarify, BSgenome packages use whatever naming convention is used by the genome provider i.e. they don't try to enforce any particular convention. So, for example, if the genome comes from UCSC, then most of the times the chromosome names are prefixed with chr. However BSgenome packages can contain genomes from other providers like TAIR, BeeBase, NCBI, etc.... For example:

library(BSgenome.Athaliana.TAIR.TAIR9)
# seqlevels(BSgenome.Athaliana.TAIR.TAIR9)
# [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"

library(BSgenome.Amellifera.BeeBase.assembly4)
# seqlevels(BSgenome.Amellifera.BeeBase.assembly4)
#  [1] "Group1"  "Group2"  "Group3"  "Group4"  "Group5"  "Group6"  "Group7" 
#  [8] "Group8"  "Group9"  "Group10" "Group11" "Group12" "Group13" "Group14"
# [15] "Group15" "Group16"

library(BSgenome.Mfascicularis.NCBI.5.0)
head(seqlevels(BSgenome.Mfascicularis.NCBI.5.0), 30)
# [1] "MFA1"         "MFA2"         "MFA3"         "MFA4"         "MFA5"        
# [6] "MFA6"         "MFA7"         "MFA8"         "MFA9"         "MFA10"       
# [11] "MFA11"        "MFA12"        "MFA13"        "MFA14"        "MFA15"       
# [16] "MFA16"        "MFA17"        "MFA18"        "MFA19"        "MFA20"       
# [21] "MFAX"         "MT"           "Scaffold1372" "Scaffold1442" "Scaffold1519"
# [26] "Scaffold3048" "Scaffold3501" "Scaffold27"   "Scaffold47"   "Scaffold56"

Note that createBins() doesn't seem to work properly with some of these genomes:

createBins(BSgenome.Athaliana.TAIR.TAIR9, 15)
# Creating bins of 15 kbp for genome BSgenome.Athaliana.TAIR.TAIR9
# [1] chromosome start      end        bases      gc        
# <0 rows> (or 0-length row.names)

createBins(BSgenome.Amellifera.BeeBase.assembly4, 15)
# Creating bins of 15 kbp for genome BSgenome.Amellifera.BeeBase.assembly4
# [1] chromosome start      end        bases      gc        
# <0 rows> (or 0-length row.names)

Hope this is helpful,

H.

ADD REPLY • link 9.7 years ago Hervé Pagès 16k

0

Entering edit mode

It seems that filtering out mitochondrial chromosomes is resulting in an empty vector of chromosome names. This only occurs when there is no match with: "^(chr)?M(T)?$" I'm making an issues to address this. It's an easy fix, but I think we should think of a more generic way of filtering seqnames. Grtz, Daoud On 08/11/2015 05:06 AM, Hervé Pagès [bioc] wrote: > Activity on a post you are following on support.bioconductor.org > <https: support.bioconductor.org=""> > > User Hervé Pagès <https: support.bioconductor.org="" u="" 1542=""/> wrote > Comment: problem with QDNAseq's createBins for new BSgenome package > <https: support.bioconductor.org="" p="" 70925="" #70974="">: > > Hi Ilari, > > Just to clarify, BSgenome packages use whatever naming convention is > used by the genome provider i.e. they don't try to enforce any > particular convention. So, for example, if the genome comes from UCSC, > then most of the times the chromosome names are prefixed with |chr|. > However BSgenome packages can contain genomes from other providers like > TAIR, BeeBase, NCBI, etc.... For example: > > library(BSgenome.Athaliana.TAIR.TAIR9) > # seqlevels(BSgenome.Athaliana.TAIR.TAIR9) > # [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC" > > library(BSgenome.Amellifera.BeeBase.assembly4) > # seqlevels(BSgenome.Amellifera.BeeBase.assembly4) > # [1] "Group1" "Group2" "Group3" "Group4" "Group5" "Group6" "Group7" > # [8] "Group8" "Group9" "Group10" "Group11" "Group12" "Group13" "Group14" > # [15] "Group15" "Group16" > > library(BSgenome.Mfascicularis.NCBI.5.0) > head(seqlevels(BSgenome.Mfascicularis.NCBI.5.0), 30) > # [1] "MFA1" "MFA2" "MFA3" "MFA4" "MFA5" > # [6] "MFA6" "MFA7" "MFA8" "MFA9" "MFA10" > # [11] "MFA11" "MFA12" "MFA13" "MFA14" "MFA15" > # [16] "MFA16" "MFA17" "MFA18" "MFA19" "MFA20" > # [21] "MFAX" "MT" "Scaffold1372" "Scaffold1442" "Scaffold1519" > # [26] "Scaffold3048" "Scaffold3501" "Scaffold27" "Scaffold47" "Scaffold56" > > Note that |createBins()| doesn't seem to work properly with some of > these genomes: > > createBins(BSgenome.Athaliana.TAIR.TAIR9, 15) > # Creating bins of 15 kbp for genome BSgenome.Athaliana.TAIR.TAIR9 > # [1] chromosome start end bases gc > # <0 rows> (or 0-length row.names) > > createBins(BSgenome.Amellifera.BeeBase.assembly4, 15) > # Creating bins of 15 kbp for genome BSgenome.Amellifera.BeeBase.assembly4 > # [1] chromosome start end bases gc > # <0 rows> (or 0-length row.names) > > Hope this is helpful, > > H. > > ------------------------------------------------------------------------ > > Post tags: bsgenome, biostrings, qdnaseq, pfalciparum > > You may reply via email or visit > C: problem with QDNAseq's createBins for new BSgenome package >

ADD REPLY • link 9.7 years ago d.sie • 0