Hi all,
I'm looking for some help when analysing RNA-Seq data for differential exon usage/splicing. I've previously used edgeR for DE analysis and was happy to see diffSpliceDGE as part of that package. What I don't know is:
- how to obtain exon level counts
- whether there is a normalisation step involved like there is in a DGE pipeline (i.e. TMM)
For the first point, I have BAM files for the samples I'd like to analyse that were mapped using STAR to the GRCh38 reference. Is there a way to extract exon-level counts from these files? In case it matters I'm analysing data from an exon skipping experiment and so will be looking at differential splicing for our gene of interest as well as any off-target effects.
For the second point, I've been through the example on the documents page (https://rdrr.io/bioc/edgeR/man/diffSpliceDGE.html) and while that made perfect sense I noticed there was no normalisation step used. How, if at all, should the exon level count data be normalised prior to analysis?
Many thanks for your help and time.
Thank you for your reply. I had an old copy of the edgeR manual and hadn't seen the new case studies. Excellent! I've been through the Case Study on the Pasilla knockdown, which looks exactly what I am after. Although I have a couple of questions.
Here's my code and the resulting error:
error:
My counts data looks like this:
annotation looks like this:
As these are exon-level counts, I'd expect the same ID appearing on multiple rows. Any ideas why there's an error?
I see how the use of
Batch <- factor(c(1,3,4,1,3,4))
is trying to account for both with sequencing type beingc('SE','PE','PE','SE','PE','PE')
but the date is more complex. It looks to me that N1 is run over two dates, N3 is run over two dates and N4 is run on one date, it's the D samples where I'm confused, D1 is run over 4 different days but D3 and D4 are both run on the same day. Given that D3 and D4 are also both PE, should it beBatch <- factor(c(1,3,4,1,3,3))
or something similar?Many thanks again for your help.
The error is caused by having duplicate
row.names
, which is not allowed. To get rid of this error, you can simply drop therow.names
of the count matrix before creating theDGEList
object.The exon information is still preserved in
y.all$genes
.I don't see anything wrong with the design. There are 6 RNA-seq libraries from 3 individual samples under 2 conditions. It is a very classic paired two group comparison. The design matrix accounts for the batch effect between different samples. It doesn't take into account the sequencing type/date.
Thank you again. Regarding the design matrix, I guess my confusion comes from the manual saying
and then:
So I couldn't work out how the
Batch <- factor(c(1,3,4,1,3,4))
accounted for the largest source of variation in this data, the sequencing type. To clarify, is that accounted for simply by the fact that the N* and D* samples are paired and also paired in terms of their sequencing type and so the batch effect seen from sequencing type can be accounted for by simply providing a paired design? Would it be redundant to add another level in there specifically for sequencing type e.g.SeqType <- c(0,1,1,0,1,1)
and then usedesign <- model.matrix(~ Batch + SeqType + Pasilla)
?Apologies if this is obvious, I'm just trying to learn and get it right, so thank you for you help :) .
The
Batch
is nested within yourSeqType
. It would be redundant to addSeqType
in the design matrix whenBatch
is already included.So to be clear on the nesting, the
1
group in theSeqType
is further split into3,4
within theBatch
and so the use ofSeqType
is redundant as it's not adding any more information than just includingBatch
. If that's correct then thank you, you have been very helpful indeed and I'll stop pestering you!Yes, that is correct.