Question

Creating the right number/distribution of sequences in bootranges

0

Entering edit mode

sam • 0

@11ddb0c4

Last seen 22 months ago

United Kingdom

I am using the block bootstrapping functionality from the nullranges package in order to determine the significance of the overlap between genomic regions undergoing selection (n=117) and genes encoding a particular class of protein (n=394).

> table(seqnames(lassi))

 chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13
   10    14     7     5     3     3     4     4     7     2     7     5    10
chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
    6     5     3     5     6     1     3     3     4
> summary(width(lassi))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   5922  101154  175522  301892  272527 4351131

I have followed the example which looks at DNase hypersensitivity sites (DHS) here and tried to adapt the code to my own case. Using the example data, the result of the line:

stats = combined %>% 
  group_by(iter) %>%
  summarize(n = n()) %>%
  as_tibble()
head(stats)

is:

## # A tibble: 6 × 2
##   iter      n
##   <fct> <int>
## 1 0      6214
## 2 1      6144
## 3 2      6265
## 4 3      6233
## 5 4      6291
## 6 5      6357

Which I assume is to make sure the number of ranges in each of the bootstraps is the same as in dhs?

When I exactly the same on my data, i.e. changing dhs to a GenomicRanges object of my regions of selection, the result of head(stats) is:

> head(stats)
# A tibble: 6 × 2
  iter      n
  <fct> <int>
1 0       117
2 1       372
3 2       371
4 3       318
5 4       343
6 5       316

and the first 3 iterations look like this, which seems like something has gone wrong. Is there a parameter that I am supposed to change to make the number of ranges in the bootstrap the same as in my target set of sequences (n=117) and is that the reason for the figure looking different (i.e. with the small peaks at zero in the bootstraps).

enter image description here

bootranges nullranges • 1.2k views

ADD COMMENT • link updated 22 months ago by Michael Love 43k • written 22 months ago by sam • 0

score 0 · Answer 1 · 2023-06-16

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

Can you show table(seqnames(...)) and summary(width(...)).

I'm wondering if genomic regions undergoing selection are of different size or distribution that the peaks examples.

ADD COMMENT • link 22 months ago Michael Love 43k

0

Entering edit mode

Thanks, I've edited my question - it does seem like there is quite a large variance in the size of the selection regions.

ADD REPLY • link 22 months ago sam • 0

0

Entering edit mode

We have performed block bootstrapping on peaks, SNPs, etc. where the features tend to be either ~200-1000 bp or 1 bp for the latter. These features make sense with respect to sampling of blocks of size 500kb. But your features include one that is 4 Mb, which is larger than the block sizes we show in the vignette.

What is the question here? Whether the genes are close to the selection regions? Do you want to compute distance by TSS or the gene body? I'm thinking it would be easier to bootstrap the genes instead of the regions.

ADD REPLY • link 22 months ago Michael Love 43k

0

Entering edit mode

Thanks, that makes sense. Yeah the question is whether or not a particular classes of genes are enriched for regions of selection (and thus likely to be under selection). I guess we are interested in the TSS as well, but we usually just added a generic window around the coordinates of the structural gene. I will try bootstrapping the genes. Otherwise, if I am bootstrapping the selection regions, is there any issue with just increasing the block size to e.g. 8Mb or something else >4Mb?

ADD REPLY • link 22 months ago sam • 0

0

Entering edit mode

The problem with increasing the block size to accommodate the large selection regions is that, the idea of the blocks is to capture e.g. structure at the level of isochores. These are around 300kb - 1MB:

enter image description here

from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1457033/

Gene density follows the isochore pattern to some degree. I would recommend boostrapping the genes then. You can bootstrap the entire gene region if you want, e.g. what you get with genes().

ADD REPLY • link 22 months ago Michael Love 43k

0

Entering edit mode

What do you mean by "generic window"?

Note that you can specify maxgap when you do a join_overlap_* in plyranges, following bootRanges.

ADD REPLY • link 22 months ago Michael Love 43k