Question

Splitting TxDb by Strand

0

Entering edit mode

rproendo ▴ 30

@rproendo-17985

Last seen 13 months ago

United States

Hi all,

I'm trying to figure out if it's possible to subset a TxDb object made with GenomicFeatures to only include features on a specific strand. I'm using makeTxDbFromEnsembl.

I see that the authors describe how to do a similar split based on chromosome in the package vignette, but it isn't clear to me if that function can be used to this specific end. Maybe some functionality with the tx_attrib parameter

Any suggestions would be very appreciated.

GenomicFeatures TxDb • 1.7k views

ADD COMMENT • link 4.2 years ago rproendo ▴ 30

score 0 · Answer 1 · 2021-02-03

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 21 minutes ago

United States

It's not super clear what you're after. You don't really do anything directly with a TxDb object/package, but instead extract information that you can use to do things. For example, a GRangesList of the transcripts:

> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> tx <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene)
## just get all those transcripts on the positive strand
> txpos <- tx[strand(tx) == "+"]
## and get rid of the resulting empty GRanges
 > txpos <- txpos[lengths(txpos) > 0L]
 > txpos
 GRangesList object of length 14100:
 $`10`
 GRanges object with 2 ranges and 2 metadata columns:
       seqnames            ranges strand |     tx_id           tx_name
          <Rle>         <IRanges>  <Rle> | <integer>       <character>
   [1]     chr8 18391282-18401218      + |     92729 ENST00000286479.4
   [2]     chr8 18391287-18400993      + |     92730 ENST00000520116.1
   -------
   seqinfo: 595 sequences (1 circular) from hg38 genome

 $`100009676`
 GRanges object with 1 range and 2 metadata columns:
       seqnames              ranges strand |     tx_id           tx_name
          <Rle>           <IRanges>  <Rle> | <integer>       <character>
   [1]     chr3 101676475-101679217      + |     40065 ENST00000609682.1
   -------
   seqinfo: 595 sequences (1 circular) from hg38 genome

 $`10002`
 GRanges object with 3 ranges and 2 metadata columns:
       seqnames            ranges strand |     tx_id           tx_name
          <Rle>         <IRanges>  <Rle> | <integer>       <character>
   [1]    chr15 71792638-71818259      + |    158012 ENST00000621736.4
   [2]    chr15 71810554-71818253      + |    158013 ENST00000617575.5
   [3]    chr15 71810592-71814929      + |    158014 ENST00000621098.1
   -------
   seqinfo: 595 sequences (1 circular) from hg38 genome

 ...
 <14097 more elements>

Is that the sort of thing you mean?

ADD COMMENT • link 4.2 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James,

Thanks for your reply. I'll try to add some detail on my goals to help clarify, since most of my issue is coming from inexperience. I'm not interested in extracting transcripts by strand per se, but splitting the entire TxDb into two subsets.

I'm using Chipseeker to identify genes proximate to motifs in a custom bed file (being used in place of ChIP peaks). The strand information for the bed file and the TxDb aren't playing nicely, so I'm frequently identifying matches that fall on opposite strand (which I do not want).

To get around this, I'm manually splitting the custom bed into positive and negative strand features, then feeding those two individually into Chipseeker to identify matches. However, because of how I'm using Chipseeker, I'm still getting stuck with some opposite strand matches. Short of (1) manually validating that all proximate genes fall on the same strand as the features in my custom bed or (2) remedying whatever formatting issue is stopping Chipseeker from matching by strand, I presumed I could split the TxDb file such that it only contained positive strand features, then run that with the positive strand custom bed file.

The Chipseeker function being used here is annotatePeak, which takes as input (1) a peak file (my custom bed) and (2) a txdb object.

Hope this clarifies things

ADD REPLY • link 4.2 years ago rproendo ▴ 30

0

Entering edit mode

And the sameStrand argument to annotatePeak doesn't do what you expect?

ADD REPLY • link 4.2 years ago James W. MacDonald 68k

0

Entering edit mode

Correct. I don't think it's any fault with annotatePeak, unless it's an error reading in the peak file. I've attempted manually adjusting headers and symbols (e.g., +/- versus 1/2). It's worth noting that the custom bed/peak file and TxDb are from the same Ensembl gtf, but the custom bed is coming out of a PWM motif identification tool.

ADD REPLY • link 4.2 years ago rproendo ▴ 30

0

Entering edit mode

The TxDb is basically an SQLite database with a bunch of tables, so subsetting that by strand would be ... tedious. The first argument to annotatePeak can be either a file or a GRanges object, so you don't have to rely on whatever annotatePeak is doing - you can just read it in yourself and ensure that the strand information is OK. In addition, the TxDb argument doesn't have to be a TxDb. It can be a GRanges object as well, so you could use genes from GenomicFeatures or unlist on the GRangesList object you get from transcriptsBy, which is probably what I would do.

In that situation you could subset the GRanges you pass in as the 'TxDb' argument to one strand or the other to force the issue.