Hi,
We counted the reads in our RNA-seq data using HT-seq and removed any
isoforms that have <5 reads/sample. We then used DESeq for
differential expression analysis.
Here's an example of a transcript that has the following read counts:
GeneA_cases counts:
85.78942
19.11753
1471.813
61.71464
GeneA_control counts:
2088.722
2681.746
2413.892
1628.187
DESeq p-value for GeneA is 10-4. Do we have to filter out transcripts
(that have high variance between samples as shown in the above
example) before giving the data to DESeq or will DESeq take this into
account while calculating the normalization?
Thank you very much.
Regards,
Nirmala
----------------------------------------------------------------------
--------------------------------------------------------
Contractor
Buiding 35, Room 1A-205
35 Convent Drive,
National Institute of Mental Health/NIH
Bethesda
MD - 20892
Phone# 301-451-4258
[[alternative HTML version deleted]]
On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C] <
akulan@mail.nih.gov> wrote:
> Hi,
>
> We counted the reads in our RNA-seq data using HT-seq and removed
any
> isoforms that have <5 reads/sample. We then used DESeq for
differential
> expression analysis.
>
> Here's an example of a transcript that has the following read
counts:
>
>
> GeneA_cases counts:
> 85.78942
>
> 19.11753
>
> 1471.813
>
> 61.71464
>
>
> GeneA_control counts:
>
> 2088.722
>
> 2681.746
>
> 2413.892
>
> 1628.187
>
>
>
> DESeq p-value for GeneA is 10-4. Do we have to filter out
transcripts
> (that have high variance between samples as shown in the above
example)
> before giving the data to DESeq or will DESeq take this into account
while
> calculating the normalization?
>
Hi, Nirmala.
If you mean filtering out transcripts that show one or more outliers
within
a given group, then you should ABSOLUTELY NOT do that as this will
bias
your statistical results. If you mean filtering based on overall
variance
(across groups) to find highly-variable transcripts, that is a
different
story and is acceptable.
Sean
[[alternative HTML version deleted]]
Thanks Sean for your response.
Regards,
Nirmala
----------------------------------------------------------------------
--------------------------------------------------------
Contractor
Buiding 35, Room 1A-205
35 Convent Drive,
National Institute of Mental Health/NIH
Bethesda
MD - 20892
Phone# 301-451-4258
From: Davis, Sean (NCI) On Behalf Of Davis, Sean (NIH/NCI) [E]
Sent: Friday, December 14, 2012 4:45 PM
To: Akula, Nirmala (NIH/NIMH) [C]
Cc: bioconductor@r-project.org
Subject: Re: [BioC] filtering before using DESeq
On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C]
<akulan@mail.nih.gov<mailto:akulan@mail.nih.gov>> wrote:
Hi,
We counted the reads in our RNA-seq data using HT-seq and removed any
isoforms that have <5 reads/sample. We then used DESeq for
differential expression analysis.
Here's an example of a transcript that has the following read counts:
GeneA_cases counts:
85.78942
19.11753
1471.813
61.71464
GeneA_control counts:
2088.722
2681.746
2413.892
1628.187
DESeq p-value for GeneA is 10-4. Do we have to filter out transcripts
(that have high variance between samples as shown in the above
example) before giving the data to DESeq or will DESeq take this into
account while calculating the normalization?
Hi, Nirmala.
If you mean filtering out transcripts that show one or more outliers
within a given group, then you should ABSOLUTELY NOT do that as this
will bias your statistical results. If you mean filtering based on
overall variance (across groups) to find highly-variable transcripts,
that is a different story and is acceptable.
Sean
[[alternative HTML version deleted]]
Dear Akula, Sean
besides overall variance, overall sum is also a good filter statistic.
Akula, please note that DESeq expects counts, which need to be
positive integer values. The values you state are not integers.
Best wishes
Wolfgang
Il giorno Dec 14, 2012, alle ore 10:45 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> ha scritto:
> On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C] <
> akulan at mail.nih.gov> wrote:
>
>> Hi,
>>
>> We counted the reads in our RNA-seq data using HT-seq and removed
any
>> isoforms that have <5 reads/sample. We then used DESeq for
differential
>> expression analysis.
>>
>> Here's an example of a transcript that has the following read
counts:
>>
>>
>> GeneA_cases counts:
>> 85.78942
>>
>> 19.11753
>>
>> 1471.813
>>
>> 61.71464
>>
>>
>> GeneA_control counts:
>>
>> 2088.722
>>
>> 2681.746
>>
>> 2413.892
>>
>> 1628.187
>>
>>
>>
>> DESeq p-value for GeneA is 10-4. Do we have to filter out
transcripts
>> (that have high variance between samples as shown in the above
example)
>> before giving the data to DESeq or will DESeq take this into
account while
>> calculating the normalization?
>>
>
> Hi, Nirmala.
>
> If you mean filtering out transcripts that show one or more outliers
within
> a given group, then you should ABSOLUTELY NOT do that as this will
bias
> your statistical results. If you mean filtering based on overall
variance
> (across groups) to find highly-variable transcripts, that is a
different
> story and is acceptable.
>
> Sean
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi,
What would be a reasonable/widely used cut-off for overall variance
and overall sum?
Thanks for pointing out the number format. The example I gave is from
eXpress software and I rounded the numbers to closest integer before I
input into DESeq.
Regards,
Nirmala
________________________________________
From: Wolfgang Huber [whuber@embl.de]
Sent: Saturday, December 15, 2012 11:05 AM
To: Davis, Sean (NIH/NCI) [E]
Cc: Akula, Nirmala (NIH/NIMH) [C]; bioconductor at r-project.org
Subject: Re: [BioC] filtering before using DESeq
Dear Akula, Sean
besides overall variance, overall sum is also a good filter statistic.
Akula, please note that DESeq expects counts, which need to be
positive integer values. The values you state are not integers.
Best wishes
Wolfgang
Il giorno Dec 14, 2012, alle ore 10:45 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> ha scritto:
> On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C] <
> akulan at mail.nih.gov> wrote:
>
>> Hi,
>>
>> We counted the reads in our RNA-seq data using HT-seq and removed
any
>> isoforms that have <5 reads/sample. We then used DESeq for
differential
>> expression analysis.
>>
>> Here's an example of a transcript that has the following read
counts:
>>
>>
>> GeneA_cases counts:
>> 85.78942
>>
>> 19.11753
>>
>> 1471.813
>>
>> 61.71464
>>
>>
>> GeneA_control counts:
>>
>> 2088.722
>>
>> 2681.746
>>
>> 2413.892
>>
>> 1628.187
>>
>>
>>
>> DESeq p-value for GeneA is 10-4. Do we have to filter out
transcripts
>> (that have high variance between samples as shown in the above
example)
>> before giving the data to DESeq or will DESeq take this into
account while
>> calculating the normalization?
>>
>
> Hi, Nirmala.
>
> If you mean filtering out transcripts that show one or more outliers
within
> a given group, then you should ABSOLUTELY NOT do that as this will
bias
> your statistical results. If you mean filtering based on overall
variance
> (across groups) to find highly-variable transcripts, that is a
different
> story and is acceptable.
>
> Sean
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
Il giorno Dec 15, 2012, alle ore 5:53 PM, "Akula, Nirmala (NIH/NIMH)
[C]" <akulan at="" mail.nih.gov=""> ha scritto:
> Hi,
>
> What would be a reasonable/widely used cut-off for overall variance
and overall sum?
>
> Thanks for pointing out the number format. The example I gave is
from eXpress software and I rounded the numbers to closest integer
before I input into DESeq
Nirmala,
it's a bit more subtle than that. DESeq expects actual counts of
fragments, please do read the DESeq vignette.
I have no experience with combining eXpress and DESeq, or whether what
you are doing is scientifically valid, but unless you are comfortable
with making your own statistical models and strategies, I'd recommend
following an established path rather than cutting your own - where you
would be on your own.
Best wishes
Wolfgang
> Regards,
> Nirmala
> ________________________________________
> From: Wolfgang Huber [whuber at embl.de]
> Sent: Saturday, December 15, 2012 11:05 AM
> To: Davis, Sean (NIH/NCI) [E]
> Cc: Akula, Nirmala (NIH/NIMH) [C]; bioconductor at r-project.org
> Subject: Re: [BioC] filtering before using DESeq
>
> Dear Akula, Sean
>
> besides overall variance, overall sum is also a good filter
statistic.
>
> Akula, please note that DESeq expects counts, which need to be
positive integer values. The values you state are not integers.
>
> Best wishes
> Wolfgang
>
>
> Il giorno Dec 14, 2012, alle ore 10:45 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> ha scritto:
>
>> On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C] <
>> akulan at mail.nih.gov> wrote:
>>
>>> Hi,
>>>
>>> We counted the reads in our RNA-seq data using HT-seq and removed
any
>>> isoforms that have <5 reads/sample. We then used DESeq for
differential
>>> expression analysis.
>>>
>>> Here's an example of a transcript that has the following read
counts:
>>>
>>>
>>> GeneA_cases counts:
>>> 85.78942
>>>
>>> 19.11753
>>>
>>> 1471.813
>>>
>>> 61.71464
>>>
>>>
>>> GeneA_control counts:
>>>
>>> 2088.722
>>>
>>> 2681.746
>>>
>>> 2413.892
>>>
>>> 1628.187
>>>
>>>
>>>
>>> DESeq p-value for GeneA is 10-4. Do we have to filter out
transcripts
>>> (that have high variance between samples as shown in the above
example)
>>> before giving the data to DESeq or will DESeq take this into
account while
>>> calculating the normalization?
>>>
>>
>> Hi, Nirmala.
>>
>> If you mean filtering out transcripts that show one or more
outliers within
>> a given group, then you should ABSOLUTELY NOT do that as this will
bias
>> your statistical results. If you mean filtering based on overall
variance
>> (across groups) to find highly-variable transcripts, that is a
different
>> story and is acceptable.
>>
>> Sean
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>