Hi there,
I'm an amateur edgeR user, and I'm having trouble generating a heat
map for
the differentially expressed genes. All examples that I've looked at
requires that I normalize the counts but I've already normalized them
prior
to doing analysis in R. I'm running a glm with blocking and have
generated
my topTags. From here, I'm not sure how to generate a heatmap. Could
you
offer any advice or suggestions?
Best,
Eleanor Su
M.S. Candidate
Department of Biology
University of Nevada, Reno
Reno, Nevada 89577
775-742-4391
[[alternative HTML version deleted]]
Hi Eleanor,
On Thu, Mar 20, 2014 at 4:36 PM, Eleanor Su <eleanorjinsu at="" gmail.com=""> wrote:
> Hi there,
>
> I'm an amateur edgeR user, and I'm having trouble generating a heat
map for
> the differentially expressed genes. All examples that I've looked at
> requires that I normalize the counts but I've already normalized
them prior
> to doing analysis in R.
Can you explain what you mean with that a bit more. You shouldn't be
doing any normalization of your actual counts prior to feeding them to
edgeR, are you?
> I'm running a glm with blocking and have generated
> my topTags. From here, I'm not sure how to generate a heatmap. Could
you
> offer any advice or suggestions?
Look at section 2.10 of the edgeR User's Guide (Clustering, heatmaps,
etc.) where the authors identify this to still be a matter of
research, but they suggest to use "moderated log-counts-per-million"
HTH,
-steve
--
Steve Lianoglou
Computational Biologist
Genentech
Hi Eleanor,
Please CC (use "reply-all") the bioconductor mailing list on all
correspondences so that everyone can help (and benefit) from this
discussion.
Comments in line:
On 21 Mar 2014, at 11:02, Eleanor Su wrote:
> Can you explain what you mean with that a bit more. You shouldn't be
> doing any normalization of your actual counts prior to feeding them
to
> edgeR, are you?
>
> I'm only working with small non-coding RNAs of a non-model organism.
> Since
> this is a fairly new kind of analysis, I'm following someone else's
> pipeline. Thus I've normalized my samples prior doing analysis in R.
> I've
> normalize all my counts based on the reads generated.
What I mean is that you shouldn't do that :-)
Have you read through the edgeR User's Guide? The `calcNormFactors`
does
the step that it sounds like you are doing before analysis -- but it
also keeps the count data "in tact" which is what you want. I guess
you
are dividing your counts by some normalization constant prior to edgeR
analysis, which is a big no-no.
The (expression) input to edgeR should be the raw count matrix of
features x samples -- many people choose to use only uniquely mapping
reads for this purpose, so probably a good idea for you to ensure that
is the case (at least for your first analysis).
>> Look at section 2.10 of the edgeR User's Guide (Clustering,
heatmaps,
>> etc.) where the authors identify this to still be a matter of
>> research, but they suggest to use "moderated log-counts-per-
million"
>
> I've generated a heatmap already using this script, but I only want
a
> heatmap of the significant differentially expressed sequences.
What script?
> When I
> generate the heatmap accordingly to the section 2.10, I end up with
a
> heatmap that I can't even read because it's plotting all the
> sequences.
> Would you suggest just generating a new file with only significant
> sequences and then generating a heatmap accordingly to section 2.10?
When you call the `heatmap` function (or whatever function you are
using
to generate these things (the aheatmap function from the NMF package
is
quite nice, btw)), you should only pass it a matrix that consists of
the
rows you want to plot.
You do not have to generate an intermediary new file to do this.
Don't take this the wrong way, but it sounds like you are quite new to
not just this analysis, but to R as a whole since indexing things
(vectors, lists, matrices) is something very basic that you need to
master before being conversant with the language.
If this is the case, I'd strongly recommend you spend some time
reading
up on introductory R stuff (R comes with "an introduction to R") for
some time before trying to do something any more advanced.
Ensuring that you do so will not only mitigate the chances of you
shooting yourself in the foot by doing something silly, but it will
also
allow you to get better (and more considered) help here since you will
be able to ask the type of questions that will leverage the expertise
from the people subscribed to this list.
For instance, if you have questions regarding fundamental "R
programming" type of things (indexing a matrix, for example), you
should
direct those to R-help, which you can subscribe to here:
https://stat.ethz.ch/mailman/listinfo/r-help
HTH,
-steve
--
Steve Lianoglou
Computational Biologist
Genentech
Hi Steve,
Don't take this the wrong way, but it sounds like you are quite new to
not
just this analysis, but to R as a whole since indexing things
(vectors,
lists, matrices) is something very basic that you need to master
before
being conversant with the language.
Indeed, I have limited knowledge in using R and edgeR. Thanks for the
suggestion to contacting R-help for these questions. Unfortunately my
graduate program offers very little help with R statistics and even
fewer
with bioinformatics especially that of small RNAs. With these limited
resources, I feel like I'm working in the dark and my analysis, to say
the
least, is cryptic. I'll take a step back before I jump the gun with
the
analysis. Thanks for the insight.
Best,
Eleanor
On Fri, Mar 21, 2014 at 11:19 AM, Steve Lianoglou
<lianoglou.steve@gene.com>wrote:
> Hi Eleanor,
>
> Please CC (use "reply-all") the bioconductor mailing list on all
> correspondences so that everyone can help (and benefit) from this
> discussion.
>
> Comments in line:
>
>
> On 21 Mar 2014, at 11:02, Eleanor Su wrote:
>
> Can you explain what you mean with that a bit more. You shouldn't
be
>> doing any normalization of your actual counts prior to feeding them
to
>> edgeR, are you?
>>
>> I'm only working with small non-coding RNAs of a non-model
organism. Since
>> this is a fairly new kind of analysis, I'm following someone else's
>> pipeline. Thus I've normalized my samples prior doing analysis in
R. I've
>> normalize all my counts based on the reads generated.
>>
>
> What I mean is that you shouldn't do that :-)
>
> Have you read through the edgeR User's Guide? The `calcNormFactors`
does
> the step that it sounds like you are doing before analysis -- but it
also
> keeps the count data "in tact" which is what you want. I guess you
are
> dividing your counts by some normalization constant prior to edgeR
> analysis, which is a big no-no.
>
> The (expression) input to edgeR should be the raw count matrix of
features
> x samples -- many people choose to use only uniquely mapping reads
for this
> purpose, so probably a good idea for you to ensure that is the case
(at
> least for your first analysis).
>
>
> Look at section 2.10 of the edgeR User's Guide (Clustering,
heatmaps,
>>> etc.) where the authors identify this to still be a matter of
>>> research, but they suggest to use "moderated log-counts-per-
million"
>>>
>>
>> I've generated a heatmap already using this script, but I only want
a
>> heatmap of the significant differentially expressed sequences.
>>
>
> What script?
>
>
> When I
>> generate the heatmap accordingly to the section 2.10, I end up with
a
>> heatmap that I can't even read because it's plotting all the
sequences.
>> Would you suggest just generating a new file with only significant
>> sequences and then generating a heatmap accordingly to section
2.10?
>>
>
> When you call the `heatmap` function (or whatever function you are
using
> to generate these things (the aheatmap function from the NMF package
is
> quite nice, btw)), you should only pass it a matrix that consists of
the
> rows you want to plot.
>
> You do not have to generate an intermediary new file to do this.
>
> Don't take this the wrong way, but it sounds like you are quite new
to not
> just this analysis, but to R as a whole since indexing things
(vectors,
> lists, matrices) is something very basic that you need to master
before
> being conversant with the language.
>
> If this is the case, I'd strongly recommend you spend some time
reading up
> on introductory R stuff (R comes with "an introduction to R") for
some time
> before trying to do something any more advanced.
>
> Ensuring that you do so will not only mitigate the chances of you
shooting
> yourself in the foot by doing something silly, but it will also
allow you
> to get better (and more considered) help here since you will be able
to ask
> the type of questions that will leverage the expertise from the
people
> subscribed to this list.
>
> For instance, if you have questions regarding fundamental "R
programming"
> type of things (indexing a matrix, for example), you should direct
those to
> R-help, which you can subscribe to here:
>
> https://stat.ethz.ch/mailman/listinfo/r-help
>
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Genentech
>
[[alternative HTML version deleted]]
Hi,
On 21 Mar 2014, at 11:39, Eleanor Su wrote:
> Hi Steve,
>
>> Don't take this the wrong way, but it sounds like you are quite new
>> to not
>> just this analysis, but to R as a whole since indexing things
>> (vectors,
>> lists, matrices) is something very basic that you need to master
>> before
>> being conversant with the language.
>
> Indeed, I have limited knowledge in using R and edgeR. Thanks for
the
> suggestion to contacting R-help for these questions. Unfortunately
my
> graduate program offers very little help with R statistics and even
> fewer
> with bioinformatics especially that of small RNAs. With these
limited
> resources, I feel like I'm working in the dark and my analysis, to
say
> the
> least, is cryptic. I'll take a step back before I jump the gun with
> the
> analysis. Thanks for the insight.
This exact issue has been making its rounds on the internet due to
this
recent blogpost:
http://biomickwatson.wordpress.com/2014/03/20/is-this-a-realistic-
portrait-of-a-modern-studentpost-doc-in-biology/
So you are not alone ... but rest assured that many of us are here to
help (and happy to do so ;-)
Your analysis is on the right track. You should follow along with the
examples in the edgeR (or even the limma (for limma::voom and its
extensive linear modeling material)) user's guide(s) to get an idea of
how to setup analyses for differential expression. Both of these
manuals
are very thorough and great to just digest and understand (be sure to
read the relevant primary publications, as well). You should also take
a
look at the DESeq2 vignette, as similar material is presented there
and
perhaps this (third) treatment of the material might help it all to
click.
The fact that you are working with small RNAs doesn't change the
picture
*too much* for the "simple" differential expression stage of the game
(putting mapping issues aside, for small molecules).
Lastly, and this is important, you are also fortunate to be "in
training" during the era of MOOCs. Coursera has a data analysis
"track"
that covers many things that will be relevant to you:
https://www.coursera.org/specialization/jhudatascience/1
(and other courses of interest):
https://www.coursera.org/jhu
And ESPECIALLY take note of this class that is starting shortly:
Data Analysis for Genomics
https://www.edx.org/course/harvardx/harvardx-ph525x-data-analysis-
genomics-1401
Don't miss it!
The material is exactly the type of stuff that you need to know, and
as
a special treat, is taught by top-notch instructors. I'm planning to
audit the class, and I (should ;-) know most of this stuff already!
HTH,
-steve
--
Steve Lianoglou
Computational Biologist
Genentech