remove set number of nucleotides from each gene's cds region in GRangesList
1
0
Entering edit mode
Jake ▴ 90
@jake-7236
Last seen 2.4 years ago
United States

Hi,

I imported a gtf file and got a GRangesList with all of the cds by gene using the cdsBy() command. I'd like to remove the first 45 nucleotides of cds region from each gene. I initially was going to just use lapply() to resize the first cds exon to width()-45.

lapply(grl, function(x) resize(x[1],width(x[1])-45))

However, a number of first cds ranges are less than 45nt. I could set up a number of conditions by segregating genes by the length of their first few cds ranges, but I was wondering if there is a better/easier way to do this?

Thanks

genomicranges genomicfeatures • 2.0k views
ADD COMMENT
2
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States

In devel (upcoming version 1.21.20) of the GenomicFeatures package, we can do this:

first45 <- pmapFromTranscripts(IRanges(1, 45), grl)
psetdiff(grl, first45)

This maps nt 1-45 into transcript/cds space and subtracts it from the cds regions.

 

ADD COMMENT
0
Entering edit mode

On the devel package website GenomicFeatures is only up to 1.21.18 (https://www.bioconductor.org/packages/devel/bioc/html/GenomicFeatures.html). Do you have any estimate for when 1.21.20 will be available? Thanks

ADD REPLY
0
Entering edit mode

Hi Michael,

Problem with this is that it doesn't trim the CDS located on the minus strand on the correct side. It also re-orders the CDS, which is problematic:

x <- GRangesList(GRanges("chr1", IRanges(c(201, 101), width=20), "-"),
                 GRanges("chr1", IRanges(501, width=35), "+"))
x
# GRangesList object of length 2:
# [[1]] 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames     ranges strand
#          <Rle>  <IRanges>  <Rle>
#   [1]     chr1 [201, 220]      -
#   [2]     chr1 [101, 120]      -
#
# [[2]] 
# GRanges object with 1 range and 0 metadata columns:
#       seqnames     ranges strand
#   [1]     chr1 [501, 535]      +

psetdiff(x, pmapFromTranscripts(IRanges(1, 5), x))
# GRangesList object of length 2:
# [[1]] 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames     ranges strand
#          <Rle>  <IRanges>  <Rle>
#   [1]     chr1 [101, 120]      -
#   [2]     chr1 [206, 220]      -
#
# [[2]] 
# GRanges object with 1 range and 0 metadata columns:
#       seqnames     ranges strand
#   [1]     chr1 [506, 535]      +

trimCdsHead() below corrects this:

## The CDS in 'x' must be ordered by ascending rank for each transcript,
## that is, by their position in the transcript. This means that, for a
## transcript located on the minus strand, the CDS should typically be
## ordered by descending position on the reference genome. If 'x' was
## obtained with GenomicFeatures::cdsBy() then the CDS are guaranteed to
## be ordered by ascending rank. See '?cdsBy' for more information.
trimCdsHead <- function(x, n=0)
{
    stopifnot(
        is(x, "GRangesList"),
        is.numeric(n),
        length(n) == 1L || length(n) == length(x),
        all(!is.na(n)),
        all(n >= 0)
    )
    if (!is.integer(n))
        n <- as.integer(n)

    keep_range <- cumsum(width(x)) > n
    ans <- x[keep_range]
    n <- n - sum(width(x)[!keep_range])
    cds1_strand <- as.character(phead(strand(ans), n=1L))
    if (any(n != 0L & cds1_strand %in% "*"))
        stop("some CDS to trim are on the \"*\" strand")

    unlisted_ans <- unlist(ans)
    plus_idx <- which(cds1_strand %in% "+")
    if (length(plus_idx) != 0L) {
        pidx <- start(PartitioningByEnd(ans))[plus_idx]
        start(unlisted_ans)[pidx] <- start(unlisted_ans)[pidx] + n[plus_idx]
    }
    minus_idx <- which(cds1_strand %in% "-")
    if (length(minus_idx) != 0L) {
        midx <- start(PartitioningByEnd(ans))[minus_idx]
        end(unlisted_ans)[midx] <- end(unlisted_ans)[midx] - n[minus_idx]
    }
    relist(unlisted_ans, ans)
}

Then:

trimCdsHead(x, 5)
# GRangesList object of length 2:
# [[1]] 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames     ranges strand
#          <Rle>  <IRanges>  <Rle>
#   [1]     chr1 [201, 215]      -
#   [2]     chr1 [101, 120]      -
#
# [[2]] 
# GRanges object with 1 range and 0 metadata columns:
#       seqnames     ranges strand
#  [1]     chr1 [506, 535]      +

H.

ADD REPLY
0
Entering edit mode

The pmapFromTranscripts() call should map the first 45bp according to the direction of transcription, right? I agree the psetdiff might re-order the exons (left to right), but the regions should be correct.  If not, then we should fix the mapping functions.

ADD REPLY
0
Entering edit mode

Thanks. However, for my case if the amount that I want to trim is longer than the first cds region, trimCdsHead() throws an error and doesn't work. 

trimCdsHead(x, 25)
GRangesList object of length 2:
[[1]] 
GRanges object with 1 range and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]     chr1 [101, 115]      -

[[2]] 
GRanges object with 1 range and 0 metadata columns:
      seqnames     ranges strand
  [1]     chr1 [501, 535]      +

-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Warning message:
In x@start[] <- S4Vectors:::numeric2integer(value) :
  number of items to replace is not a multiple of replacement length
ADD REPLY
0
Entering edit mode

Hi Jake,

Oops... PartitioningByEnd() needed to be called on ans, not on x, sorry. I edited the definition of trimCdsHead() in my previous answer.

Also I forgot to mention that n can be an integer vector parallel to x (i.e. of the same length as x) so you can specify the trimming you want for each CDS.

Cheers,

H.

ADD REPLY
0
Entering edit mode

Hi Herve,

My cdsBy() function is not returning cds in the correct order. They are sorted by position along the reference genome rather than position within the transcript. 

GRangesList object of length 22458:
$ENSMUSG00000000001.4 
GRanges object with 8 ranges and 2 metadata columns:
      seqnames                 ranges strand |    cds_id    cds_name
         <Rle>              <IRanges>  <Rle> | <integer> <character>
  [1]     chr3 [108109422, 108109612]      - |     39069        <NA>
  [2]     chr3 [108111935, 108112088]      - |     39070        <NA>
  [3]     chr3 [108112473, 108112602]      - |     39071        <NA>
  [4]     chr3 [108115763, 108115891]      - |     39072        <NA>
  [5]     chr3 [108118301, 108118458]      - |     39073        <NA>
  [6]     chr3 [108123542, 108123683]      - |     39074        <NA>
  [7]     chr3 [108123795, 108123837]      - |     39075        <NA>
  [8]     chr3 [108145888, 108146005]      - |     39076        <NA>

$ENSMUSG00000000003.13 
GRanges object with 6 ranges and 2 metadata columns:
      seqnames               ranges strand | cds_id cds_name
  [1]     chrX [77841883, 77841911]      - | 205797     <NA>
  [2]     chrX [77842515, 77842616]      - | 205798     <NA>
  [3]     chrX [77842897, 77843007]      - | 205799     <NA>
  [4]     chrX [77845019, 77845086]      - | 205800     <NA>
  [5]     chrX [77847975, 77848114]      - | 205801     <NA>
  [6]     chrX [77853409, 77853483]      - | 205802     <NA>

...
<22456 more elements>

I made a TxDb using the following command:

gencodeBasicvM6 <- makeTxDbFromGFF('~/Downloads/gencode.vM6.basic.annotation.gtf', format='gtf')
cdsBy(gencode, by='gene')

Below is my sessionInfo()

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rtracklayer_1.28.9     GenomicFeatures_1.20.3 AnnotationDbi_1.30.1   Biobase_2.28.0         GenomicRanges_1.20.6   GenomeInfoDb_1.4.2    
[7] IRanges_2.2.7          S4Vectors_0.6.4        BiocGenerics_0.14.0   

loaded via a namespace (and not attached):
 [1] XML_3.98-1.3            Rsamtools_1.20.4        Biostrings_2.36.4       GenomicAlignments_1.4.1 bitops_1.0-6            futile.options_1.0.0   
 [7] DBI_0.3.1               RSQLite_1.0.0           zlibbioc_1.14.0         XVector_0.8.0           futile.logger_1.4.1     lambda.r_1.1.7         
[13] BiocParallel_1.2.20     tools_3.2.2             biomaRt_2.24.0          RCurl_1.95-4.7   
ADD REPLY
0
Entering edit mode

There is no way the CDS can be ordered by position within each transcript if you group them by gene. Of course they need to be grouped by transcript:

cdsBy(gencode, by="tx")  # NOT by="gene"

See ?cdsBy for more information about this.

H.

ADD REPLY
0
Entering edit mode

Makes sense. If I want to eventually do RNA-Seq counts over genes rather than transcripts, is the best way to edit the start of each cds by transcript, then combine all these cds by gene, and then use the reduce() command?

ADD REPLY
0
Entering edit mode

That sounds reasonable to me except that I've some hesitations about the reduce() step. Why do you need it? Note that depending on the counting mode you choose when calling summarizeOverlaps(), the reduce() step can impact the counting. For example, after reducing the CDS you'll have more reads that fall completely "within" a given feature:

library(GenomicAlignments)
features <- GRangesList(
    GRanges("chr1", IRanges(c(8, 12), c(14, 16))),
    GRanges("chr1", IRanges(100, 300)))
reads <- GRanges("chr1", IRanges(11, 15))

Then:

assay(summarizeOverlaps(features, reads, mode=IntersectionStrict))
#      reads
# [1,]     0
# [2,]     0

assay(summarizeOverlaps(reduce(features), reads, mode=IntersectionStrict))
#      reads
# [1,]     1
# [2,]     0

My understanding is that the default counting mode (Union) should not be affected by the reduce() step. See ?summarizeOverlaps for more information about the counting modes.

Anyway, unless you know what you are doing and have a good reason for performing the reduce() step, I would recommend you avoid it.

Hope this makes sense,

H.

ADD REPLY
0
Entering edit mode

Thanks. I count with featureCounts which counts via union as well. I'm pretty sure featureCounts accounts for double counting, but I just wanted to reduce() to be sure that a read that fell into two overlapping features was not double counted.

ADD REPLY
0
Entering edit mode

I should have added this: extracting the CDS grouped by transcripts, then editing the start of the CDS within each transcript, and then re-grouping the edited CDS by gene will certainly introduce duplicated CDS within some genes. OTOH when you do cdsBy(, by="gene") you don't get any duplicated CDS within a given gene. So maybe this is something that you are worried about and that you intended to fix with the reduce() step. However, I would argue that the right way to get rid of these duplicates is with unique(), not reduce():

features <- GRangesList(
    GRanges("chr1", IRanges(c(8, 12, 8), c(14, 16, 14))),
    GRanges("chr1", IRanges(100, 300)))
features
# GRangesList object of length 2:
# [[1]] 
# GRanges object with 3 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1  [ 8, 14]      *
#   [2]     chr1  [12, 16]      *
#   [3]     chr1  [ 8, 14]      *
# 
# [[2]] 
# GRanges object with 1 range and 0 metadata columns:
#       seqnames     ranges strand
#   [1]     chr1 [100, 300]      *

unique(features)
# GRangesList object of length 2:
# [[1]] 
# GRanges object with 2 ranges and 0 metadata columns:
#      seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1  [ 8, 14]      *
#   [2]     chr1  [12, 16]      *
#
# [[2]] 
# GRanges object with 1 range and 0 metadata columns:
#       seqnames     ranges strand
#   [1]     chr1 [100, 300]      *

Anyway my understanding is that these duplicates should not affect the result of GenomicAlignments::summarizeOverlaps() and I would be surprised if they affected the output of Rsubread::featureCounts(). Furthermore, it seems to me that no read counting tool should count more than 1 hit for a given read, whatever the counting mode is (the underlying assumption being that a given read comes from at most 1 place on the reference genome). But don't take my word for it...

H.

ADD REPLY

Login before adding your answer.

Traffic: 578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6