I'm trying to figure out if it's possible to subset a TxDb object made with GenomicFeatures to only include features on a specific strand. I'm using makeTxDbFromEnsembl.
I see that the authors describe how to do a similar split based on chromosome in the package vignette, but it isn't clear to me if that function can be used to this specific end. Maybe some functionality with the tx_attrib parameter
It's not super clear what you're after. You don't really do anything directly with a TxDb object/package, but instead extract information that you can use to do things. For example, a GRangesList of the transcripts:
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> tx <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)
## just get all those transcripts on the positive strand
> txpos <- tx[strand(tx) == "+"]
## and get rid of the resulting empty GRanges
> txpos <- txpos[lengths(txpos) > 0L]
> txpos
GRangesList object of length 14100:
$`10`
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr8 18391282-18401218 + | 92729 ENST00000286479.4
[2] chr8 18391287-18400993 + | 92730 ENST00000520116.1
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
$`100009676`
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr3 101676475-101679217 + | 40065 ENST00000609682.1
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
$`10002`
GRanges object with 3 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr15 71792638-71818259 + | 158012 ENST00000621736.4
[2] chr15 71810554-71818253 + | 158013 ENST00000617575.5
[3] chr15 71810592-71814929 + | 158014 ENST00000621098.1
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
...
<14097 more elements>
Thanks for your reply. I'll try to add some detail on my goals to help clarify, since most of my issue is coming from inexperience. I'm not interested in extracting transcripts by strand per se, but splitting the entire TxDb into two subsets.
I'm using Chipseeker to identify genes proximate to motifs in a custom bed file (being used in place of ChIP peaks). The strand information for the bed file and the TxDb aren't playing nicely, so I'm frequently identifying matches that fall on opposite strand (which I do not want).
To get around this, I'm manually splitting the custom bed into positive and negative strand features, then feeding those two individually into Chipseeker to identify matches. However, because of how I'm using Chipseeker, I'm still getting stuck with some opposite strand matches. Short of (1) manually validating that all proximate genes fall on the same strand as the features in my custom bed or (2) remedying whatever formatting issue is stopping Chipseeker from matching by strand, I presumed I could split the TxDb file such that it only contained positive strand features, then run that with the positive strand custom bed file.
The Chipseeker function being used here is annotatePeak, which takes as input (1) a peak file (my custom bed) and (2) a txdb object.
Correct. I don't think it's any fault with annotatePeak, unless it's an error reading in the peak file. I've attempted manually adjusting headers and symbols (e.g., +/- versus 1/2). It's worth noting that the custom bed/peak file and TxDb are from the same Ensembl gtf, but the custom bed is coming out of a PWM motif identification tool.
The TxDb is basically an SQLite database with a bunch of tables, so subsetting that by strand would be ... tedious. The first argument to annotatePeak can be either a file or a GRanges object, so you don't have to rely on whatever annotatePeak is doing - you can just read it in yourself and ensure that the strand information is OK. In addition, the TxDb argument doesn't have to be a TxDb. It can be a GRanges object as well, so you could use genes from GenomicFeatures or unlist on the GRangesList object you get from transcriptsBy, which is probably what I would do.
In that situation you could subset the GRanges you pass in as the 'TxDb' argument to one strand or the other to force the issue.
Hi James,
Thanks for your reply. I'll try to add some detail on my goals to help clarify, since most of my issue is coming from inexperience. I'm not interested in extracting transcripts by strand per se, but splitting the entire TxDb into two subsets.
I'm using Chipseeker to identify genes proximate to motifs in a custom bed file (being used in place of ChIP peaks). The strand information for the bed file and the TxDb aren't playing nicely, so I'm frequently identifying matches that fall on opposite strand (which I do not want).
To get around this, I'm manually splitting the custom bed into positive and negative strand features, then feeding those two individually into Chipseeker to identify matches. However, because of how I'm using Chipseeker, I'm still getting stuck with some opposite strand matches. Short of (1) manually validating that all proximate genes fall on the same strand as the features in my custom bed or (2) remedying whatever formatting issue is stopping Chipseeker from matching by strand, I presumed I could split the TxDb file such that it only contained positive strand features, then run that with the positive strand custom bed file.
The Chipseeker function being used here is annotatePeak, which takes as input (1) a peak file (my custom bed) and (2) a txdb object.
Hope this clarifies things
And the sameStrand argument to
annotatePeak
doesn't do what you expect?Correct. I don't think it's any fault with annotatePeak, unless it's an error reading in the peak file. I've attempted manually adjusting headers and symbols (e.g., +/- versus 1/2). It's worth noting that the custom bed/peak file and TxDb are from the same Ensembl gtf, but the custom bed is coming out of a PWM motif identification tool.
The
TxDb
is basically an SQLite database with a bunch of tables, so subsetting that by strand would be ... tedious. The first argument toannotatePeak
can be either a file or aGRanges
object, so you don't have to rely on whateverannotatePeak
is doing - you can just read it in yourself and ensure that the strand information is OK. In addition, theTxDb
argument doesn't have to be aTxDb
. It can be aGRanges
object as well, so you could usegenes
fromGenomicFeatures
orunlist
on theGRangesList
object you get fromtranscriptsBy
, which is probably what I would do.In that situation you could subset the
GRanges
you pass in as the 'TxDb' argument to one strand or the other to force the issue.Awesome -- thank you. I'll give these a shot.