Is there a way to remove or take the first overlapping gene(s) when using makeTxDbFromGFF?
2
0
Entering edit mode
endrebak85 ▴ 40
@endrebak85-10660
Last seen 5.4 years ago
github.com/endrebak/

If there are multiple genes that overlap, I only want the first gene. Is there a way to do this in genomicfeatures already or do I need to preprocess the GTF file before calling makeTxDbFromGFF? 

 

Thanks.

genomicfeatures • 1.8k views
ADD COMMENT
2
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States

No need to preprocess any files. Just import the file as a GRanges, subset it, and pass it to makeTxDbFromGRanges.
 

ADD COMMENT
0
Entering edit mode

How do I remove overlapping genes by subsetting granges?

ADD REPLY
1
Entering edit mode

It would be difficult, because I think it would need to be iterative. For example, you could have this situation:

----
  ----
    ----
      ----

You do not want to simply remove ranges 2,3,4. Rather, I think you want to keep 1 and 3.  Thus, the algorithm would need to remove range 2, decide to keep range 3 since 2 is gone, and remove range 4.

Maybe something this (untested) will help you get started:

m <- as.matrix(findOverlaps(gr))
drop <- integer(0L)
while(nrow(m) > 0L) {
    last <- max(m)
    drop <- c(drop, last)
    m <- m[m[,1] != last & m[,2] != last,]
}
gr <- gr[-drop]

Please consider whether findOverlaps should have ignore.strand=TRUE or not.

ADD REPLY
2
Entering edit mode
@herve-pages-1542
Last seen 9 hours ago
Seattle, WA, United States

Hi,

makeTxDbFromGFF() has nothing to do with finding overlaps. It just imports the genes, transcripts, exons, and CDS from a GFF file into a TxDb object. Once you've done this, you can extract the genomic coordinates of the genes, transcripts, exons, or CDS from the TxDb object with genes(), transcripts(), exons(), or cds():

txdb <- makeTxDbFromGFF("path/to/GFF/file")
gn <- genes(txdb)  # extract the genes (see ?genes for more information)

Then you can use findOverlaps() to find the overlaps between a set of genomic ranges (e.g. some aligned reads) and gn:

findOverlaps(reads, gn)

If you only want the 1st overlapping gene for each range in reads, call findOverlaps() with select="first":

findOverlaps(reads, gn, select="first")

Please see ?findOverlaps for more information.

Cheers,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6