Question

summarizeOverlaps counting exons and 5UTR reads

0

Entering edit mode

Walter F. Baumann ▴ 10

@walter-f-baumann-12439

Last seen 7.4 years ago

Hey,

I try to compare counts from 2 RNA seq data. I am interested in the 5'UTR and want to compare this number with the total number of reads per transcript. Here my code:

txdb is the gtf file from Gencode (v19) "Comprehensive gene annotation"

txdb_5utr_transcript <- fiveUTRsByTranscript(txdb, use.names=T)
txdb_exons_transcripts <- exonsBy(txdb, by="tx", use.name=T)

# Count reads per 5UTR in transcript for
## RNAseq exp 1
     x1 <- summarizeOverlaps(features= txdb_5utr_transcript, reads = myrnaseq1,    mode=IntersectionNotEmpty, inter.feature=T) # uses same counting algorithm as in HTSeq; double counted in case of overlapping
     x1 <- assay(x1)
     x1 <- data.frame(Tx=as.character(rownames(x1)), count_rna15UTRt=as.numeric(x1[,1]), row.names = NULL)
## RNAseq exp 2
     x2 <- summarizeOverlaps(features= txdb_5utr_transcript, reads = myrnaseq2, mode=IntersectionNotEmpty, inter.feature=T) # uses same counting algorithm as in HTSeq
     x2 <- assay(x2)
     x2 <- data.frame(Tx=as.character(rownames(x2)), count_rna25UTRt=as.numeric(x2[,1]), row.names = NULL)

# Count reads per exons in transcript for
## RNAseq exp 1
     y1 <- summarizeOverlaps(features= txdb_exons_transcripts, reads = myrnaseq1, mode=IntersectionNotEmpty, inter.feature=T) # uses same counting algorithm as in HTSeq; double counted in case of overlapping
     y1 <- assay(y1)
     y1 <- data.frame(Tx=as.character(rownames(y1)), count_rna1t=as.numeric(y1[,1]), row.names = NULL)
## RNAseq exp 2
     y2 <- summarizeOverlaps(features= txdb_exons_transcripts, reads = myrnaseq2, mode=IntersectionNotEmpty, inter.feature=T) # uses same counting algorithm as in HTSeq
     y2 <- assay(y2)
     y2 <- data.frame(Tx=as.character(rownames(y2)), count_rna2t=as.numeric(y2[,1]), row.names = NULL)

# Merge tables
     z <- merge(x1,x2, by="Tx")
     z <- merge(z,y1, by="Tx")
     z <- merge(z,y2, by="Tx")

# Sum up counts of both reads and calculate difference (this is due to my research question)
     z[,"sum1"] <- apply(z[,2:3], 1, function(i) i[1]+i[2])
     z[,"sum2"] <- apply(z[,4:5], 1, function(i) i[1]+i[2])
     z[,"diff"] <- apply(z[,6:7], 1, function(i) i[2]-i[1])

However, there are multiple cases in which the difference is negative, which means that there are more reads in the 5'UTR than in the whole transcript, which does not make sense.

I double-checked my txdb_5utr_transcript and my txdb_exons_transcripts object: txdb_5utr_transcript is a subset of txdb_exons_transcripts, so this is not the problem. Any ideas?

summarizeoverlaps rnaseq • 1.6k views

ADD COMMENT • link updated 8.2 years ago by Hervé Pagès 16k • written 8.2 years ago by Walter F. Baumann ▴ 10

score 1 · Answer 1 · 2017-02-28

It looks like it's possible:

## 2 transcripts tx1 (2 exons) and tx2 (1 exon):
tx <- GRangesList(tx1=GRanges(c("chr1:1-10", "chr1:21-30")),
                  tx2=GRanges("chr1:1-30"))

## The 5' UTRs for transcripts tx1 and tx2:
utr <- GRangesList(tx1=GRanges("chr1:1-8"),
                   tx2=GRanges("chr1:1-4"))

## A single read:
reads <- GRanges("chr1:6-15")

## Count reads per transcript:
assay(summarizeOverlaps(features=tx, reads=reads,
                        mode=IntersectionNotEmpty,
                        inter.feature=T))
#     reads
# tx1     0
# tx2     1

## Count reads per 5' UTR:
assay(summarizeOverlaps(features=utr, reads=reads,
                        mode=IntersectionNotEmpty,
                        inter.feature=T))
#     reads
# tx1     1
# tx2     0

The exact details of how the IntersectionNotEmpty mode works are explained in ?summarizeOverlaps. The key is that "regions that are shared between features are discarded". This means that when counting reads per transcript, the chr1:1-10 and chr1:21-30 regions are removed from tx1 and tx2 so tx1 is replaced by an empty region and tx2 by region chr1:11-20. This explains the counts: 0 on tx1 and 1 on tx2. Now when counting reads per 5' UTR, the chr1:1-4 region is removed from tx1 and tx2 so tx1 is replaced by region chr1:5-8 and tx2 by an empty region. This explains the counts: 1 on tx1 and 0 on tx2.

Hope this helps,

H.