Hi,
I am trying to figure out how to calculate the expression of a gene via GenomicFeatures and IRanges.
My solution to this is based on the idea where I find group the TxDb by genes, then see if how many transcripts are "above" the gene, that is how many are overlapping, within the range of the gene. This includes obtaining the RNA-seq data from UCSC, then making it a IRange or GRange object, and then using the findOverlaps function from GenomicFeatures to see how many hits I get. Is this the right way to do it?
This is the a small code example:
library(IRanges) library(GenomicFeatures) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb.test=transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,by="gene") gr <- GRanges( seqnames=Rle(c("chr1")), ranges=IRanges(start=c(2244435, 3231243, 2142344), end=c(21444351,3331243, 2342344)),) findOverlaps(gr,txdb.test)
The results then show:
> findOverlaps(gr,txdb.test) Hits of length 251 queryLength: 3 subjectLength: 23459 queryHits subjectHits <integer> <integer> 1 1 275 2 1 481 3 1 484 4 1 637 5 1 885 ... ... ... 247 3 275 248 3 3115 249 3 14052 250 3 17865 251 3 20150
Here, we can see that there a 251 hits, and where it was detected. More hits would mean more expression, which means that these genes are more expressed than the others. I hope I am using the right way of thinking here. Am I doing this right?
Probably you would rather
countOverlaps
in your own code, or useGenomicRanges::summarizeOverlaps()
for a more general purpose tool. As Jim mentions this type of counting operation might be a first step in, e.g., differential expression between groups.