Intergenic coordinates with Genomic Feaures
2
0
Entering edit mode
cookie ▴ 20
@cookie-7508
Last seen 9.8 years ago
USA

Hi,

I would like to extract the coordinates for a chromosome. I made a chunk of code, but as I am new to these packages, I am not sure if I follow the right logic here:
 

library(IRanges)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by = "gene",use.names=TRUE)

#For example, only I am interested in intergenic coordinates in chromosome 1
seqlevels(txdb, force=TRUE) = c("chr1")

#Creates IRanges 

ir = IRanges(start=unlist(start(txdb)), end=unlist(end(txdb)), names=names(txdb)) 

# Reduce overlapping gene positions to continuous range
ir.red = reduce(ir) # Collapses ranges of overlapping genes to continuous range.

#Identify overlaps among the initial and reduced range data sets
overlap = findOverlaps(ir, ir.red) 

So, the overlap should contain the intergenic coordinate of these genes. Is this right?

A subquestions:

What does the Txdb object really contain? By this I am mean: I am confused what grouping really means. For
example, if I group by genes how is that different if I group by exons or introns? Are not introns and exons part of
the gene, so is not that the same? Maybe the problem is just fundamental and revolves around the definition of a gene, but I am confused : if we have transcripts, isnt that the same as genes?


 

genomicranges genomicfeatures iranges • 6.2k views
ADD COMMENT
2
Entering edit mode

The by argument to transcriptsBy() specifies how the output should be grouped.  When you specify 'gene' you get back a GRangesList with one entry per gene - again this is ~23,500.  For each gene there maybe one or more transcripts and you get the start and end position for each.  This is certainly the right direction for what you're trying to do.

Alternatively, if you specified by = 'exon' you get back a much longer GRangesList (~290,000 entries, that's the total number of exons in the genome).  For each of these you get list of the transcripts that exon could be part of.  That might be interesting in some situations, but isn't very helpful in identifying intergenic regions.

I'm not sure that using IRanges is appropriate for the next step though. By doing this we lose the strand information, which may or may not be important to you.  It depends what you're trying to do.  In the example I gave I was working with a stranded RNA-seq library and looking to see how many mapped to intergenic regions.  Since it was a stranded library I wanted to retain this information - you may not.

Assuming you don't care about strand, I'm also not sure of the logic in the findOverlaps() step.  This is asking 'what elements in ir overlap with elements in ir.red ?'.  Since you generated ir.red by reducing ir the answer will be everything.  What you really want to do is find the spaces between elememts of ir.red.  You can do this with gaps(ir.red), with the caveat that you will miss the regions between the start of the chromosome and the first gene and the end of the last gene and the chromosome end.

ADD REPLY
0
Entering edit mode

Thank you again for answering again.

Well, I can see what you are trying to tell me. But, I many more questions:

1) When I group by a gene, does that mean in the output I have just the genes and nothing else? If yes, then this does not make any sense to me. First, this is a TxDb object, so it has only transcripts. How can I get genes from them? Most genes
have overlapping transcripts, therefore how can we know that a transcript belongs to a certain gene, when again two genes share it! I know we can locate the start of a gene from the TSS, but again, the promoter is around that same point, so should this mean that twogene can share a promoter ( of course this would mean that a promoter is upstream and downstream from the TSS). 

2) "For each of these you get list of the transcripts that exon could be part of" 
So this means that many transcripts can be a part of the same exon, but does that mean they are part of the same gene?

3) You did not use the funciton gaps in your answer. Why? Would that be better?

4) What is a stranded RNA-seq library? Is that something good or bad? I know about single strand and pair end strategies, so is this connect to this approach of manipulating the RNA-seq as you did?

 

ADD REPLY
0
Entering edit mode


1) A gene contains a fixed set of exons and as a general statement we can say that an exon is only associated with one gene.  However you can create various forms of the same gene by including only some of the exons.  It is these sets of exons that are called transcripts.  Thus each gene can contain many transcripts and a specifc exon can appear in one, several or all of those transcripts.

The TxDb object contains information on all of these different transcripts.  When you group by gene this gets the start and end coordinates for each the the transcripts and then returns them to you sorted in a way that says 'this bunch come from gene A' and 'this bunch come from gene B'.

2) It's not that many transcripts can be part of the same exon, it's the otherway around.  An exon can be part of many transcripts, and we would say that all those transcripts come from the same gene.

3) gaps() is used in the third stages of my solution, and since we're working with GRanges there (rather than IRanges), it is able to take into account both strands and the start/ends of chromosomes.

4) Stranded RNA-seq is a technique that allows you to know which strand of DNA was transcribed to generate the read.  Sometimes you can get two different genes encoded on opposite strands and so mapping a read to that position may be ambiguous - you don't know which of the two genes was actually being transcribed. Having a stranded library can be useful as it allows you to determine this.

It's hard to say whether you need to be concerned by this without knowing what you're trying to do.

ADD REPLY
5
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 10 hours ago
EMBL Heidelberg

This is how I have extracted the intergenic regions in the past. First load the transcript database.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

Then get the coordinates of all the exons in each gene.  

exbygene <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")

The exbygene object here is a GRangesList which has the same number of entries as there are genes (~23,500).  You'll notice that if you choose to do this by transcript the length of the GRangesList will be much larger as some genes have many transcripts. 

However, you aren't interested in the specific exons that make up genes, you want the regions between genes.  To do this we can use the set of functions here:

intergenicRegions <- gaps(unlist(range(exbygene)))

Breaking that down:

range() does away with the exons and gives the start and end of each gene, but we still have a list of length 23,500

unlist() transforms this into a single GRanges object

gaps() inverts the GRanges it's provided with and returns what's between them. In our case thats the space between the genes i.e. the intergenic regions.

Then if you want to select only a subset of chromosomes you can do something like this.

intergenic.chr1 <- intergenicRegions[seqnames(intergenicRegions) == "chr1",]
ADD COMMENT
1
Entering edit mode

Hi,

A slightly simpler way (taking advantage of the genes() extractor):

genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene, single.strand.genes.only=FALSE)
intergenicRegions2 <- gaps(unlist(genes))
identical(intergenicRegions, intergenicRegions2)  # TRUE

But more importantly, please note that, when using the knownGene track (a.k.a. "UCSC Genes"), what you get in intergenicRegions (or intergenicRegions2) is only an approximation of the intergenic regions. This is due to the fact that sometimes this track reports more than 1 transcript for a gene not because the gene actually has more than 1 transcript (biologically speaking), but because the location of its (single) transcript is not known with certainty. In that case UCSC reports one transcript for each possible site. See this post

A: Problem with TxDb.Mmusculus.UCSC.mm9.knownGene

for an example of such gene (in Mouse) where the 2 sites are 36Mb apart!

The consequence of this is that, when using the knownGene track, any method that tries to infer the intergenic regions based on the range spanned by all the exons (or transcripts) in each gene can miss some regions. It might be possible to infer the intergenic regions more accurately, maybe by using other UCSC tracks or by getting the TxDb directly from Ensembl (with makeTranscriptDbFromBiomart(), which in BioC devel now also retrieves the transcript type), and/or by using some heuristic to get rid of the genes with ambiguous transcript locations, I don't know. It will probably get slightly more complicated than the above methods.

H.

 

ADD REPLY
0
Entering edit mode

Its interesting but I tried both methods and it worked to get these regions, then I did the following:

 intersect_intergenic<-intersect(intergenicRegions2, resMa, ignore.strand=TRUE)

I intersected some data I had which I know contains intervals that overlap a small number of promoters and a large number of introns yet when I intersected with the intergenic regions it said all of my intervals intersect. So I am wondering whats happening? 

ADD REPLY
0
Entering edit mode

Perhaps you can include one or two of the entries in resMa here so we can run it ourselves?

ADD REPLY
0
Entering edit mode

FWIW you can get the subset of ranges in resMa that overlap with at least one intergenic region with subsetByOverlaps(resMa, intergenicRegions2). So you could just show us the output of head(subsetByOverlaps(resMa, intergenicRegions2)) plus your sessionInfo(). Thanks!

ADD REPLY
0
Entering edit mode

Thanks you for your comment!

Can you please also comment my way? Is that the right logic applied to the problem?

ADD REPLY
0
Entering edit mode

And do I have to take notice about the "+" or "-" strand?

ADD REPLY
0
Entering edit mode

Mike Smith according to your info i found the intergenic coordination and also I have the Grange format of my data that I want to find the intergenic peaks based on the reference intergenic coordinate. So I have a question, in refrence and also Grange file i have start and end positions and I do not know how can I compare according to just position? could you please help me in this isuue that how can I continue. I am really really begginer in this field. Thanks in advance.

ADD REPLY
0
Entering edit mode
mohancm89 • 0
@mohancm89-22777
Last seen 4.9 years ago

@rbronste I had the same issue. The reason is both intergenicRegions2 have a GRange for the entire chromosome (for all the chromosomes). This entry you can find at the end of each chromosome. chr start end chr1 0 195471971 . 0 . chr2 0 182113224 . 0 . etc... I am not sure why these are included.

I filtered out these intergenicRegions_new=intergenicRegions2[!strand(intergenicRegions2)=="*"] This is true with intergenicRegions. Thanks for those oneliners to get the intergenic regions.

ADD COMMENT

Login before adding your answer.

Traffic: 505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6