Hi,
I imported a gtf file and got a GRangesList with all of the cds by gene using the cdsBy()
command. I'd like to remove the first 45 nucleotides of cds region from each gene. I initially was going to just use lapply()
to resize the first cds exon to width()-45.
lapply(grl, function(x) resize(x[1],width(x[1])-45))
However, a number of first cds ranges are less than 45nt. I could set up a number of conditions by segregating genes by the length of their first few cds ranges, but I was wondering if there is a better/easier way to do this?
Thanks
On the devel package website GenomicFeatures is only up to 1.21.18 (https://www.bioconductor.org/packages/devel/bioc/html/GenomicFeatures.html). Do you have any estimate for when 1.21.20 will be available? Thanks
Hi Michael,
Problem with this is that it doesn't trim the CDS located on the minus strand on the correct side. It also re-orders the CDS, which is problematic:
trimCdsHead()
below corrects this:Then:
H.
The
pmapFromTranscripts()
call should map the first 45bp according to the direction of transcription, right? I agree thepsetdiff
might re-order the exons (left to right), but the regions should be correct. If not, then we should fix the mapping functions.Thanks. However, for my case if the amount that I want to trim is longer than the first cds region, trimCdsHead() throws an error and doesn't work.
Hi Jake,
Oops...
PartitioningByEnd()
needed to be called onans
, not onx
, sorry. I edited the definition oftrimCdsHead()
in my previous answer.Also I forgot to mention that
n
can be an integer vector parallel tox
(i.e. of the same length asx
) so you can specify the trimming you want for each CDS.Cheers,
H.
Hi Herve,
My
cdsBy()
function is not returning cds in the correct order. They are sorted by position along the reference genome rather than position within the transcript.I made a TxDb using the following command:
Below is my sessionInfo()
There is no way the CDS can be ordered by position within each transcript if you group them by gene. Of course they need to be grouped by transcript:
See
?cdsBy
for more information about this.H.
Makes sense. If I want to eventually do RNA-Seq counts over genes rather than transcripts, is the best way to edit the start of each cds by transcript, then combine all these cds by gene, and then use the
reduce()
command?That sounds reasonable to me except that I've some hesitations about the
reduce()
step. Why do you need it? Note that depending on the counting mode you choose when callingsummarizeOverlaps()
, thereduce()
step can impact the counting. For example, after reducing the CDS you'll have more reads that fall completely "within" a given feature:Then:
My understanding is that the default counting mode (Union) should not be affected by the
reduce()
step. See?summarizeOverlaps
for more information about the counting modes.Anyway, unless you know what you are doing and have a good reason for performing the
reduce()
step, I would recommend you avoid it.Hope this makes sense,
H.
Thanks. I count with featureCounts which counts via union as well. I'm pretty sure featureCounts accounts for double counting, but I just wanted to
reduce()
to be sure that a read that fell into two overlapping features was not double counted.I should have added this: extracting the CDS grouped by transcripts, then editing the start of the CDS within each transcript, and then re-grouping the edited CDS by gene will certainly introduce duplicated CDS within some genes. OTOH when you do
cdsBy(, by="gene")
you don't get any duplicated CDS within a given gene. So maybe this is something that you are worried about and that you intended to fix with thereduce()
step. However, I would argue that the right way to get rid of these duplicates is withunique()
, notreduce()
:Anyway my understanding is that these duplicates should not affect the result of
GenomicAlignments::summarizeOverlaps()
and I would be surprised if they affected the output ofRsubread::featureCounts()
. Furthermore, it seems to me that no read counting tool should count more than 1 hit for a given read, whatever the counting mode is (the underlying assumption being that a given read comes from at most 1 place on the reference genome). But don't take my word for it...H.