(Big)wig file of complete chromosome
1
0
Entering edit mode
Assa Yeroslaviz ★ 1.5k
@assa-yeroslaviz-1597
Last seen 9 days ago
Germany

Hi,

 

I have a list of bigwig files from different data sets. We would like to create a table of all the genomic positions per chromosome to be able to compare the samples with each other. 

As the peaks found in each samples are at different chromosomal positions, I would like to complete each chromosomes and add all the positions which have no reads with a 0. As an example

this is my file:

variableStep chrom=chrV
86      1.00
87      1.00
88      1.00
89      1.00
90      1.00
....
variableStep chrom=chrII
1       2.00
2       2.00
3       2.00
4       3.00
5       4.00

In this sample I would like for Chr V to add positions 1-85 with a 0.

And than split the bigwig file into separate chromosomes and creates a data.frame for each ofthe chromosomes. 

I know it is possible to read bigwig files into R using rtracklayer and also to subset them per chromosome. What I don't know is if it is possible to add the empty positions into the GRanges object created from the bigwig file.

thanks

Assa

 

P.S.

I was thinking about it a bit longer now and was wondering if it is possible to create a GRangeList object from all samples and than modify it within this object so that I can modify all samples at once and than reduce the list into a data.frame for the separate chromosomes

thanks again

 

bigwig rtracklayer import granges grangeslist • 3.4k views
ADD COMMENT
1
Entering edit mode
@michael-lawrence-3846
Last seen 2.9 years ago
United States

Compute the coverage using the coverage from the file as the weight, and then coerce that back into a GRanges. 

Like:

gr_filled <- GRanges(coverage(gr, weight="score"))

 

ADD COMMENT
0
Entering edit mode

Thanks Michael for the neat solution. it is still not what i need though. The coverage function summarize the similar values into one row like that:

GRanges object with 5793227 ranges and 1 metadata column:
            seqnames         ranges strand |     score
               <Rle>      <IRanges>  <Rle> | <numeric>
        [1]     chrI       [ 1, 11]      * |         1
        [2]     chrI       [12, 12]      * |         2
        [3]     chrI       [13, 13]      * |         5
        [4]     chrI       [14, 14]      * |         7
        [5]     chrI       [15, 15]      * |         9
        ...      ...            ...    ... .       ...
  [5793223]    chrmt [85674, 85675]      * |         7
  [5793224]    chrmt [85676, 85677]      * |         6
  [5793225]    chrmt [85678, 85679]      * |         5
  [5793226]    chrmt [85680, 85682]      * |         2
  [5793227]    chrmt [85683, 85779]      * |         0

Is there a way to expand the GRanges object to contain all the positions separately. I need something like this one here (done manually):

GRanges object with 5793227 ranges and 1 metadata column:
            seqnames         ranges strand |     score
               <Rle>      <IRanges>  <Rle> | <numeric>
        [1]     chrI       [ 1, 1]      * |         1
        [2]     chrI       [ 2, 2]      * |         1
        [3]     chrI       [ 3, 3]      * |         1
        [4]     chrI       [ 4, 4]      * |         1
        [5]     chrI       [ 5, 5]      * |         1
        ...      ...            ...    ... .       ...
        [1]     chrI       [10, 10]      * |         1
        [2]     chrI       [11, 11]      * |         1
        [3]     chrI       [12, 12]      * |         2
        [4]     chrI       [13, 13]      * |         5
        [5]     chrI       [14, 14]      * |         5
        ...      ...            ...    ... .       ...

thanks in advance

ADD REPLY
1
Entering edit mode

You want to use a GPos object. I thought this would be simple, but it turns out to require a couple lines:

gpos <- GPos(gr)
gpos$score <- rep(gr$score, width(gr))

 

ADD REPLY
0
Entering edit mode

thanks for the suggestion, but it is still not exactly what i need. The code above doesn't take the ends of each chromosome into account. This leads to different number of entries in the two GRanges objects. e.g.

> InputIP.control[seqnames(InputIP.control) == "I"]
GRanges object with 92413 ranges and 1 metadata column:
          seqnames           ranges strand |            score
             <Rle>        <IRanges>  <Rle> |        <numeric>
      [1]        I     [   1, 6303]      * | 17.5913696289062
      [2]        I     [6304, 6304]      * | 17.6069602966309
      [3]        I     [6305, 6351]      * | 17.5913696289062
      [4]        I     [6352, 6352]      * | 17.6825294494629
      [5]        I     [6353, 6353]      * | 17.7580890655518
      ...      ...              ...    ... .              ...
  [92409]        I [220379, 222823]      * | 17.5913696289062
  [92410]        I [222824, 222824]      * | 17.6069602966309
  [92411]        I [222825, 222825]      * | 17.5913696289062
  [92412]        I [222826, 222826]      * | 17.6069602966309
  [92413]        I [222827, 230210]      * | 17.5913696289062

> InputIgG.control[seqnames(InputIgG.control) == "I"]
GRanges object with 59889 ranges and 1 metadata column:
          seqnames           ranges strand |            score
             <Rle>        <IRanges>  <Rle> |        <numeric>
      [1]        I       [  1, 450]      * |  18.625129699707
      [2]        I       [451, 451]      * |  18.647180557251
      [3]        I       [452, 452]      * |  18.625129699707
      [4]        I       [453, 453]      * | 18.8590793609619
      [5]        I       [454, 454]      * |  19.176929473877
      ...      ...              ...    ... .              ...
  [59885]        I [223907, 223907]      * |  18.647180557251
  [59886]        I [223908, 223908]      * | 18.7531299591064
  [59887]        I [223909, 223910]      * |  18.647180557251
  [59888]        I [223911, 223912]      * | 18.7531299591064
  [59889]        I [223913, 230207]      * |  18.625129699707

or when changed into GPos objects

> gpos.InputIgG.control[seqnames(gpos.InputIgG.control) == "I"]
GPos object with 230207 positions and 1 metadata column:
           seqnames       pos strand |           score
              <Rle> <integer>  <Rle> |       <numeric>
       [1]        I         1      * | 18.625129699707
       [2]        I         2      * | 18.625129699707
       [3]        I         3      * | 18.625129699707
       [4]        I         4      * | 18.625129699707
       [5]        I         5      * | 18.625129699707
       ...      ...       ...    ... .             ...
  [230203]        I    230203      * | 18.625129699707
  [230204]        I    230204      * | 18.625129699707
  [230205]        I    230205      * | 18.625129699707
  [230206]        I    230206      * | 18.625129699707
  [230207]        I    230207      * | 18.625129699707

> gpos.InputIP.control[seqnames(gpos.InputIP.control) == "I"]
GPos object with 230210 positions and 1 metadata column:
           seqnames       pos strand |            score
              <Rle> <integer>  <Rle> |        <numeric>
       [1]        I         1      * | 17.5913696289062
       [2]        I         2      * | 17.5913696289062
       [3]        I         3      * | 17.5913696289062
       [4]        I         4      * | 17.5913696289062
       [5]        I         5      * | 17.5913696289062
       ...      ...       ...    ... .              ...
  [230206]        I    230206      * | 17.5913696289062
  [230207]        I    230207      * | 17.5913696289062
  [230208]        I    230208      * | 17.5913696289062
  [230209]        I    230209      * | 17.5913696289062
  [230210]        I    230210      * | 17.5913696289062

Even though the seqlengths() of the two Granges objects is the same for this chromosome, both of them contain less elements as the sequence length

> seqlengths(gpos.InputIP.control)
      I   
 230218 ...
> seqlengths(gpos.InputIgG.control)
      I  
 230218 ...

Is there a method to expand (extend) every chromosome in each of the GRanges objects based on the seqlengths() of this chromosome? ( What I would like to have for this example are two GPos objects for chromosome I with 230218 positions instead of 230207 and 230210 positions).

I need it so that I can compare the two bigwig files with each other.

 

ADD REPLY
1
Entering edit mode

Hi Assa,

Make sure you start with a GRanges object that has the seqlengths information on it:

library(GenomicRanges)
gr <- GRanges(c("chr1", "chr2", "chr2"),
              IRanges(c(4, 3, 21), c(16, 10, 38)),
              score=1:3/10,
              seqlengths=c(chr1=20, chr2=40))

gr
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <numeric>
#   [1]     chr1  [ 4, 16]      * |       0.1
#   [2]     chr2  [ 3, 10]      * |       0.2
#   [3]     chr2  [21, 38]      * |       0.3
#   -------
#   seqinfo: 2 sequences from an unspecified genome

seqlengths(gr)
# chr1 chr2 
#   20   40 

Insert the "no score regions":

gr_filled <- GRanges(coverage(gr, weight="score"))
gr_filled
# GRanges object with 8 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <numeric>
#   [1]     chr1  [ 1,  3]      * |         0
#   [2]     chr1  [ 4, 16]      * |       0.1
#   [3]     chr1  [17, 20]      * |         0
#   [4]     chr2  [ 1,  2]      * |         0
#   [5]     chr2  [ 3, 10]      * |       0.2
#   [6]     chr2  [11, 20]      * |         0
#   [7]     chr2  [21, 38]      * |       0.3
#   [8]     chr2  [39, 40]      * |         0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Turn into a GPos object:

gpos <- GPos(gr_filled)
## Edit: it is MUCH preferred to use an Rle when propagating the
## score to the GPos object: 
#mcols(gpos)$score <- rep(mcols(gr_filled)$score, width(gr_filled))
mcols(gpos)$score <- Rle(mcols(gr_filled)$score, width(gr_filled))
gpos
# GPos object with 60 positions and 1 metadata column:
#        seqnames       pos strand | score
#           <Rle> <integer>  <Rle> | <Rle>
#    [1]     chr1         1      * |     0
#    [2]     chr1         2      * |     0
#    [3]     chr1         3      * |     0
#    [4]     chr1         4      * |   0.1
#    [5]     chr1         5      * |   0.1
#    ...      ...       ...    ... .   ...
#   [56]     chr2        36      * |   0.3
#   [57]     chr2        37      * |   0.3
#   [58]     chr2        38      * |   0.3
#   [59]     chr2        39      * |     0
#   [60]     chr2        40      * |     0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Then:

gpos[seqnames(gpos) == "chr1"]
# GPos object with 20 positions and 1 metadata column:
#        seqnames       pos strand | score
#           <Rle> <integer>  <Rle> | <Rle>
#    [1]     chr1         1      * |     0
#    [2]     chr1         2      * |     0
#    [3]     chr1         3      * |     0
#    [4]     chr1         4      * |   0.1
#    [5]     chr1         5      * |   0.1
#    ...      ...       ...    ... .   ...
#   [16]     chr1        16      * |   0.1
#   [17]     chr1        17      * |     0
#   [18]     chr1        18      * |     0
#   [19]     chr1        19      * |     0
#   [20]     chr1        20      * |     0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

gpos[seqnames(gpos) == "chr2"]
# GPos object with 40 positions and 1 metadata column:
#        seqnames       pos strand | score
#           <Rle> <integer>  <Rle> | <Rle>
#    [1]     chr2         1      * |     0
#    [2]     chr2         2      * |     0
#    [3]     chr2         3      * |   0.2
#    [4]     chr2         4      * |   0.2
#    [5]     chr2         5      * |   0.2
#    ...      ...       ...    ... .   ...
#   [36]     chr2        36      * |   0.3
#   [37]     chr2        37      * |   0.3
#   [38]     chr2        38      * |   0.3
#   [39]     chr2        39      * |     0
#   [40]     chr2        40      * |     0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Hope this helps.

H.

ADD REPLY
0
Entering edit mode

I made a small edit to my answer above to use an Rle when propagating the score to the GPos object.

ADD REPLY
0
Entering edit mode

Thanks, this looks very nice. I will try it. I have a question though. I am familiar with the seqlengths when creating a GRanges object. But here I am reading a bedgraph file into R. How do I give the import.bw or import.bedgraph the seqlength() parameter? 

ADD REPLY
0
Entering edit mode

I don't know about getting the seqlengths info thru import.bw() or import.bedgraph(), Michael might be able to help with this. However please note that you can always use the seqlengths() setter to add the seqlengths "manually" to the GRanges object returned by import.bw() or import.bedgraph(). For example, since you seem to be working with yeast:

library(GenomeInfoDb)
si <- Seqinfo(genome="sacCer3")
seqlengths(si)
#   chrI   chrII  chrIII   chrIV ...   chrXV  chrXVI    chrM
# 230218  813184  316620 1531933 ... 1091291  948066   85779 
seqlengths(gr) <- seqlengths(si)

H.

ADD REPLY
0
Entering edit mode

Thanks, this I know. I thought there might be a better way.

ADD REPLY
0
Entering edit mode

It should come automatically from a BigWig file (it's embedded within the file). For bedGraph, unless the file has the genome indicated in a UCSC-style "track line", you will need to tell it the genome via the genome= argument (for example, "hg38").  It will look up the seqlengths automatically from the genome.

ADD REPLY
0
Entering edit mode

But what if I'm using t he Ensembl annotations? so my chromosome names differs from the standard UCSC naming?

ADD REPLY
0
Entering edit mode

The approach described by Herve is not so bad, in general. You can pass the Seqinfo object to import, e.g.,

gr <- import(file, seqinfo=Seqinfo(genome="sacCer3"))

 

ADD REPLY
0
Entering edit mode

after trying this example and not understanding, why it doesn't do what I want, I figured it out. The Rle() options looks better, but if I use it, I can't export the data as bedgraph (or bigwig). I get an error message claiming

Scores must be non-NA numeric values

Is there a way to coerce the Rle score column into a numeric values vector?

ADD REPLY
0
Entering edit mode

I fixed this in devel. Work around is just not to use Rles, or coerce them to numeric vectors on export. The export routines are not that optimized for GPos anyway.

ADD REPLY
0
Entering edit mode

yes thanks, I have figured it out by now. I am using now the rep() command instead. 

ADD REPLY

Login before adding your answer.

Traffic: 667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6