Entering edit mode
Hi Alejandro,
When I say "shrunken FCs" I mean the logFC(case/control) returned by
the
estimate2FoldChanges function
When I say "raw fold changes" I'm talking about manually working out
the
fold change from the raw data.
Is what you describe not essentially calculating a shrunken fold
change
value given that I have effectively 2 exons per gene?
So - to calculate what I call the "shrunken fold change" I can do the
following:
res.nofilt[res.nofilt$geneID == g & res.nofilt$exonID == "E001",
"log2fold(case/control)"] # For exons
res.nofilt[res.nofilt$geneID == g & res.nofilt$exonID == "E002",
"log2fold(case/control)"] # For introns
Then, to calculate the "raw fold change" I do this:
log2(mean(counts(bare.nofilt)[paste(g,"E001",sep=":"), 4:6]) /
mean(counts(bare.nofilt)[paste(g,"E001",sep=":"), 1:3])) # For exons
log2(mean(counts(bare.nofilt)[paste(g,"E002",sep=":"), 4:6]) /
mean(counts(bare.nofilt)[paste(g,"E002",sep=":"), 1:3])) # For introns
For these examples,
bare.nofilt is the exonCountSet, r
es.nofilt is derived from:
bare.nofilt <- estimatelog2FoldChanges(bare.nofilt,
averageOutExpression=FALSE, .denominator="control")
res.nofilt <- DEUresultTable(bare.nofilt)
Then if I plot them, it appears at though the estimatelog2FoldChanges
is in
effect, indirectly producing a "shrunken" fold change? Or something
similar
to that? At least that is how I understand it, I may be way off!
plot here:
http://postimg.org/image/vig3slnht/
Thanks,
Jim
On 30 September 2013 17:29, Alejandro Reyes <alejandro.reyes@embl.de>
wrote:
> Hi James,
>
> Before saying anything... one question, what do you mean with
"shrunken
> FCs" and with "raw fold changes"?
>
> It is a bit confusing, since DEXSeq does not do any shrinkage in the
fold
> changes (so far).
> The fold change from the "estimatelog2FoldChanges" function
calculates the
> interaction
> coefficient between your condition of interest and the exons (using
a glm
> fit) to estimate
> the "fitted splicing", which already should be independent from
overall
> gene expression
> effects. This is what you see plotted in the plotDEXSeq function!
After a
> variance stabilizing
> transformation and a log2 transformation are used and this is what
> estimatelog2FoldChanges
> adds to the fData.
>
> Alejandro
>
>
>
>
>
> Hi Wolfgang,
>>
>> Thanks a lot for all your help on this. One final thing I noticed,
which
>> may be of value to anyone following this problem:
>>
>> I noticed that when I use the raw FCs instead of the shrunken FCs
to make
>> the volcano plot, the pattern is much clearer:
>> http://postimg.org/image/**4ri7pgdht/<http: postimg.org="" image="" 4ri7="" pgdht=""/>
>>
>> Please ignore the x axis label.
>>
>> I also notice that the negative clump remains, which is suggestive
that
>> the
>> majority of "changing" genes show generally an increased expression
in
>> intronic regions compared to exonic regions.
>>
>> Wrt your reply, we am now going through individual examples, there
are
>> some
>> potentially interesting patterns of intersting patterns of
expression. I
>> am
>> also using quite large cut offs e.g. at least 200 reads align to
intronic
>> regions for at least n/2 samples (where n is the total number of
samples).
>> This is quite important given the length of intronic regions and
seems to
>> lead to more interesting results when going through the list of top
genes.
>>
>> Cheers again for your help and guidance with thtis,
>>
>> Jim
>>
>>
>> On 29 September 2013 10:30, Wolfgang Huber <whuber@embl.de> wrote:
>>
>> Dear James
>>>
>>> thanks. What you do below makes sense, as far as I can see. What
DEXSeq
>>> considers is, applied to your situation:
>>>
>>> number of reads mapping to E001
>>> ------------------------------**-
>>> number of reads mapping to both
>>>
>>> (and similarly for E002), and it tries to find significant changes
of
>>> that
>>> ratio along with your treatment. Maybe making the Volcano plot
with the
>>> difference of the logarithm of these ratios across condition as
'effect
>>> size' would also be informative, but I am not sure it has
additional
>>> info.
>>> The plots you sent below look OK to me, I'd indeed now go down the
list
>>> of
>>> genes and introns and see whether you can independently verify it
(e.g.
>>> through a biological hypothesis).
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>
>>>
>>>
>>>
>>> Il giorno Sep 26, 2013, alle ore 5:00 pm, James Perkins <
>>> jimrperkins@gmail.com> ha scritto:
>>>
>>> Hi again,
>>>>
>>>> Maybe I should rephrase the last part better.
>>>>
>>>> We have a fold change for the E001 "exon" i.e. the summed exonic
>>>>
>>> expression
>>>
>>>> we have a fold change for the E002 "exon" i.e. the summed
intronic
>>>>
>>> expression
>>>
>>>> However what we are really interested in is the discrepancy
between the
>>>>
>>> two, i.e. introns go one way, exons the other, or exons don't
change but
>>> introns do, or whatever.
>>>
>>>> I've made a "sort of" volcano plot - I've subtracted the log2
fold
>>>>
>>> change calculated using intron only counting from the log2FC
calculated
>>> using exon only counting.
>>>
>>>> http://postimg.org/image/**o9hi6o4pj/<http: postimg.org="" image="" o9="" hi6o4pj=""/>
>>>>
>>>> I've plotted this against -log10 p val. You see there is some
"volcano"
>>>>
>>> activity although it's quite 1 sides - as I understand it, this
is
>>> suggesting that more genes are increasing in intronic expression
in case
>>> samples compared to exonic expression, since the cluster of
negative
>>> values
>>> with low p values suggests that the FC is higher for introns than
exons.
>>>
>>>> However the "effect size" as I say is pretty small. This is
probably in
>>>>
>>> part due to the way FCs are shrunk by DEXSeq, should I use the FC
>>> shrinking
>>> in this case?
>>>
>>>> BTW There are a couple of outliers in the volcano plot, the real
plot is
>>>>
>>> like this:
>>>
>>>> http://postimg.org/image/**r4c7uhf5t/<http: postimg.org="" image="" r4="" c7uhf5t=""/>
>>>>
>>>> BTW I've not used the plotDEXSeq function because this is exon
focussed
>>>>
>>> if I understand correctly, instead I've used the ggbio package.
>>>
>>>> Still unsure what criteria to use to say whether a gene is really
>>>>
>>> showing increased intronic expression. Probably I can't I can just
go
>>> down
>>> the list of p-values looking at the patterns of expression and try
to say
>>> which are interesting!
>>>
>>>> Cheers,
>>>>
>>>> Jim
>>>>
>>>>
>>>> On 24 September 2013 11:17, James Perkins <jimrperkins@gmail.com>
>>>> wrote:
>>>> Dear Wolfgang, Martin, list
>>>>
>>>> Sorry for the delay
>>>>
>>>> Thanks for the explanation re: counting per exons vs. counting
per gene
>>>>
>>> wrt multiple transcripts. This doesn't effect my "real data"
example
>>> where
>>> I counted reads mapping to all exons and reads mapping to all
introns and
>>> classed them as two different "exons" per gene for the sake of
runnign
>>> DEXSeq. Martin - my approach is very simple so I would say you
idea is
>>> not
>>> quite relevant in this case. However this method could be possibly
>>> extended
>>> to look for individual introns showing an increase in expression,
rather
>>> than detecting genes who show a general relative increase in
intronic
>>> expression summed accross exons/introns.
>>>
>>>> Wolfgang - I agree with your points re: plotting the data - this
find
>>>>
>>> some nice instances of changes in 'intronic expression' in the
case vs.
>>> control samples.
>>>
>>>> Re: this point:
>>>>
>>>>
>>>> - an effect size (fold-change) cutoff might help focus the
results on
>>>>
>>> the most interesting instances. How does the volcano plot look
like?
>>>
>>>>
>>>> I like the idea of the effect size cutoff, however what do you
mean by
>>>>
>>> fold change (for plotting the volcano) do you mean the change in
ratios
>>> of
>>> exon to intronic, or what should I plot? and vs. what? p-value I
assume
>>> but
>>> just to be clear.
>>>
>>>> Thanks a lot again for your help.
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>> On 30 August 2013 15:35, Wolfgang Huber <whuber@embl.de> wrote:
>>>> Dear Jim, Martin & list
>>>>
>>>> Martin's proposal of using splice graphs is definitely the way to
go, it
>>>>
>>> is conceptually appealing, would have more power, and results are
>>> potentially more readily interpretable. That said, the current
state of
>>> what we have in DEXSeq does the testing exon-by-exon, and lack of
power
>>> does not seem to be James' problem. Regarding James' questions:
>>>
>>>> 1. It is fine, and to be expected, that the sum of per-exon
counts is
>>>>
>>> larger than the per-gene count. This is OK because a read that
covers
>>> multiple exons is counted multiple times, for each of the
overlaps. This
>>> makes sense for the testing setup of DEXSeq. It tests, marginally,
exon
>>> by
>>> exon, whether its relative usage among all transcripts from the
locus is
>>> condition-dependent. Since the tests are marginal, correlations
between
>>> counts of neighbouring exons are not a problem. Each overlap of a
read
>>> is a
>>> postiive piece of evidence for it being expressed.
>>>
>>>> 2. Regarding your high number of results, here are some
considerations:
>>>> - the fact that swapping sample labels leads to greatly reduced
sets of
>>>>
>>> hits is reassuring
>>>
>>>> - definitely do go ahead with visualising the results to see what
is
>>>>
>>> going on and whether this large number of calls is justified by
the data
>>> (please keep us in the loop)
>>>
>>>> - did you see the "plotDEXSeq" function
>>>> - an effect size (fold-change) cutoff might help focus the
results on
>>>>
>>> the most interesting instances. How does the volcano plot look
like?
>>>
>>>> Hope this helps
>>>> Wolfgang
>>>>
>>>>
>>>> On 29 Aug 2013, at 21:27, James Perkins <jimrperkins@gmail.com>
wrote:
>>>>
>>>> Hi Wolfgang, list
>>>>>
>>>>> Thank you very much for the helpful tip.
>>>>>
>>>>> I did exactly as you suggested, the code is below. I wanted to
do it
>>>>>
>>>> using the pasilla package in order to create an instructive
example,
>>> and so
>>> that you could potentially recreate the analysis, but strangely
sometimes
>>> the gene read counts from pasillaGenes are lower than the summed
exon
>>> counts from the pasillaExon object. Or at least that's the case
according
>>> to my code!
>>>
>>>> If you (or anyone else who is reading this) can think of a good
public
>>>>>
>>>> data set for this please let me know, ideally one that is already
in
>>> bioconductor and fairly easy to manipulate. Otherwise I am happy
to
>>> submit
>>> my current dataset to bioc or make available some other way once
its
>>> corresponding paper has been published
>>>
>>>> So, code below. Any comments or suggestions would be appreciated,
if
>>>>>
>>>> you think there are ways I could tune DEXSeq for this purpose:
>>>
>>>> To start with I loaded in all the necessary files and made a
counts
>>>>>
>>>> object with Gend ID:E001 for intronic and GeneID:E002 for exonic.
>>>
>>>> # c50tx - table of counts using reads mapping to anywhere in gene
>>>>>
>>>> c50ex - reads mapping to exons only
>>>
>>>> intMat <- as.data.frame(c50tx) - as.data.frame(c50ex)
>>>>> exMat <- as.data.frame(c50ex)
>>>>>
>>>>> # Remove genes for which intron count is below 100
>>>>> intMat.2 <- intMat[! apply(intMat, 1, function(x) sum(x < 100))
> 3, ]
>>>>> exMat <- exMat[! apply(intMat, 1, function(x) sum(x < 100)) > 3,
]
>>>>> intMat <- intMat.2
>>>>>
>>>>> row.names(intMat) <- paste(row.names(intMat), ":E001", sep="")
>>>>> row.names(exMat) <- paste(row.names(exMat), ":E002", sep="")
>>>>> # Now I combine the different tables in order to get a combined
table
>>>>>
>>>> as so:
>>>
>>>> both <- rbind(intMat, exMat)
>>>>> both <- both[order(row.names(both)), ]
>>>>> head(both)
>>>>>
>>>>> # Now I turn this into an exon count set
>>>>> gIDs <- matrix(unlist(strsplit(row.**names(both), ":")), ncol=2,
>>>>>
>>>> byrow=TRUE)[,1]
>>>
>>>> names(gIDs) <- row.names(both)
>>>>> gIDs <- as.factor(gIDs)
>>>>> eIDs <- matrix(unlist(strsplit(row.**names(both), ":")), ncol=2,
>>>>>
>>>> byrow=TRUE)[,2]
>>>
>>>> names(eIDs) <- row.names(both)
>>>>> des <- data.frame(condition=c(rep("**case",3), rep("control",
3)))
>>>>> row.names(des) <- colnames(both)
>>>>> ecS <- newExonCountSet(
>>>>> countData = both,
>>>>> design = des,
>>>>> geneIDs=gIDs,
>>>>> exonIDs=eIDs)
>>>>>
>>>>> Once I had this I ran normalisation etc. and looked for genes
with
>>>>>
>>>> 'differential exon/intron usage':
>>>
>>>> ecS <- estimateSizeFactors(ecS)
>>>>> ecS <- estimateDispersions(ecS)
>>>>> ecS <- fitDispersionFunction(ecS)
>>>>> head(fData(ecS)$**dispBeforeSharing)
>>>>>
>>>>> ecS <- testForDEU(ecS)
>>>>> head(fData(ecS)[, c("pvalue", "padjust")])
>>>>> ecS <- estimatelog2FoldChanges(ecS, averageOutExpression=FALSE)
>>>>> res2 <- DEUresultTable(ecS)
>>>>> sum(ecS$pad
>>>>>
>>>>>
>>>>> So - out of 29516 genes 'tested', ~7000 had > 100 intronic
counts for
>>>>>
>>>> at least 3 conditions and ~4000 of these were DEU (FDR<0.1). This
seems
>>> quite a high number of genes, so I wonder if the method is being
>>> conservative enough.
>>>
>>>> That said, tried swapping the sample labels a few times, which
lead to
>>>>>
>>>> 0 genes being called as DEU, which is a good sign I think.
>>>
>>>> Interestingly, comparing intron:exon for case and controls for
the
>>>>>
>>>> significant DEU genes shows a 4:1 ratio of genes that with an
increased
>>> intronic expression in case samples, suggesting the experimental
>>> intervention is leading to increased intronic expression.
>>>
>>>> So, does this approach seem valid? My main question is whether it
is a
>>>>>
>>>> little unconservative. I'm currently using ggbio to plot the read
>>> alignments for the significant genes and looking for interesting
looking
>>> patterns of intronic expression that are consistent across
replicates of
>>> the same class (and that differ between classes). I've found a few
things
>>> that look interesting for follow up.
>>>
>>>> Thanks a lot for your help,
>>>>>
>>>>> Jim
>>>>>
>>>>> On 9 August 2013 16:02, Wolfgang Huber <whuber@embl.de> wrote:
>>>>> Dear James
>>>>> have you already checked the DEXSeq package and the paper
>>>>>
>>>> http://genome.cshlp.org/**content/22/10/2008.long<http: genome.c="" shlp.org="" content="" 22="" 10="" 2008.long="">
>>>
>>>> This does not apply 1:1 to what you are asking for, but afaIcs
the
>>>>>
>>>> main modification would be for you to define "counting bins" (on
which
>>> the
>>> input to DEXSeq is computed by read overlap counting) that
represent (i)
>>> exons and (ii) introns and then check for changes in relative
intro usage
>>> (the 'ratios' you mention below).
>>>
>>>> Let me know how it goes
>>>>> Wolfgang.
>>>>>
>>>>> On 9 Aug 2013, at 10:02, James Perkins <j.perkins@ucl.ac.uk>
wrote:
>>>>>
>>>>> Dear list,
>>>>>>
>>>>>> I would like to know if an experimental treatment leads to a
>>>>>>
>>>>> significant
>>>
>>>> shift in intronic expression for some genes.
>>>>>>
>>>>>> Imagine I have an experiment with 6 biological replicates of a
given
>>>>>> tissue. I believe that the treatment might lead to an increased
>>>>>>
>>>>> intronic
>>>
>>>> expression for some genes, unrelated to exonic expression.
>>>>>>
>>>>>> 3 of these receive no treament, they are used as control. The
other 3
>>>>>> receive experimental treatment.
>>>>>>
>>>>>> I then sequence the mRNA from these samples (Illumina, single
end
>>>>>>
>>>>> reads,
>>>
>>>> ~40 million reads per sample), to obtain 6 fastq files, I align
>>>>>>
>>>>> these to a
>>>
>>>> refernce genome and get bam files.
>>>>>>
>>>>>> I was thinking that a fairly easy way to see if some genes show
a
>>>>>> consistent increased intronic expression following treatment
would
>>>>>>
>>>>> be to
>>>
>>>> count intronically aligning reads for each gene (e.g. using
>>>>>>
>>>>> GenomicRanges)
>>>
>>>> and use something like DESeq to look for genes showing a
significant
>>>>>>
>>>>> change
>>>
>>>> in intronic "expression".
>>>>>>
>>>>>> However, the problem is that this might be due to exonic
expression,
>>>>>>
>>>>> due to
>>>
>>>> premature mRNAs etc., so I might end up finding genes that are
>>>>>> differentially expressed at the exon level, and as a result the
>>>>>>
>>>>> increased
>>>
>>>> exon expression has caused increased intronic expression as a by
>>>>>>
>>>>> product.
>>>
>>>> Obviously I am not so interested in these genes wrt this method,
I
>>>>>>
>>>>> can find
>>>
>>>> these using "traditional" DE methods.
>>>>>>
>>>>>> In addition, when I tried profiling intronic regions using
reads
>>>>>>
>>>>> mapping to
>>>
>>>> introns, (using DESeq) it led to dodgy MA plots, where the 0 FC
line
>>>>>>
>>>>> was
>>>
>>>> quite far above the minimum mean expression point, i.e. it didn't
go
>>>>>> through the middle of the clump of data points (if that makes
>>>>>>
>>>>> sense). I
>>>
>>>> wonder if this is due to the size factor calculation being based
on
>>>>>> intronic "expression" (well, reads mapping to introns), which
is
>>>>>>
>>>>> generally
>>>
>>>> much lower than exon expression and therefore being somewhat
>>>>>>
>>>>> unreliable.
>>>
>>>> So I would like to take this exon expression out of the
equation,
>>>>>>
>>>>> so to
>>>
>>>> speak.
>>>>>>
>>>>>> I thought that one way might be to compare the ratios of exonic
to
>>>>>>
>>>>> intronic
>>>
>>>> reads between samples, for each gene.
>>>>>>
>>>>>> For example one gene might have 30, 35 and 33 exonically
mapping
>>>>>>
>>>>> reads and
>>>
>>>> 10,11 and 12 intronically mapping reads for control samples
>>>>>>
>>>>>> For case samples it might have 33, 32 and 34 exonically mapping
>>>>>>
>>>>> reads;
>>>
>>>> 20,21 and 19 intronically mapping reads.
>>>>>>
>>>>>> So we could compare 10/30, 11/35 and 12/33 for control to
20/33,
>>>>>>
>>>>> 21/32 and
>>>
>>>> 19/34 for case.
>>>>>>
>>>>>> Does this methodology sound reasonable? It is necessarily based
on
>>>>>>
>>>>> the
>>>
>>>> assumption that intronic "expression" due to unspliced RNAs is
>>>>>>
>>>>> correlated
>>>
>>>> with exon expression.
>>>>>>
>>>>>> If it sounds reasonable, is there a test that is recommended to
>>>>>>
>>>>> compare the
>>>
>>>> ratios in such a way, that takes into account the biological
>>>>>>
>>>>> replication of
>>>
>>>> samples? I could do a simple test (chi squared) to compare the
>>>>>>
>>>>> relative
>>>
>>>> frequencies, but this wouldn't take into account the replicates.
>>>>>>
>>>>>> I realise this isn't really a specific bioconductor question,
but
>>>>>>
>>>>> hopefully
>>>
>>>> it might be of interest to some of the list subscribers.
>>>>>>
>>>>>> Many thanks,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>> --
>>>>>> James R Perkins, PhD
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor@r-project.org
>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: st="" at.ethz.ch="" mailman="" listinfo="" bioconductor="">
>>>>>> Search the archives:
>>>>>>
>>>>> http://news.gmane.org/gmane.**science.biology.informatics.**cond
uctor<http: news.gmane.org="" gmane.science.biology.informatics.conducto="" r="">
>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> James R Perkins PhD
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> James R Perkins PhD
>>>>
>>>>
>>>>
>>>> --
>>>> James R Perkins PhD
>>>>
>>>
>>>
>>
>
--
James R Perkins PhD
[[alternative HTML version deleted]]