DESeq2; FoldChange values with only 0's

0

Entering edit mode

Hanneke van Deutekom ▴ 20

@hanneke-van-deutekom-6220

Last seen 10.5 years ago

Hi, My experiment consist of control and diseased samples; looking like this: colData<-DataFrame(condition=factor(c("A","A","B","B","C","C","D","D", "E","E","F","F","H","H","I","I")), type=factor(c("plasma","plasma","plasma","plasma","plasma","plasma","p lasma","plasma","tissue","tissue","tissue","tissue","tissue","tissue", "tissue","tissue")), diseased=factor(c("no","no","no","no","yes","yes","yes","yes","no","no ","no","no","yes","yes","yes","yes"))) To search for differential expression I compare AvsB, AvsC, BvsD and BvsD then EvsF, EvsH, FvsI and HvsI, So I use the results(dds,contrast=c("condition","H","I")) option in DESeq2. And then, these are the results I do not get; look at this example: geneid comparison log2foldchange padj countA1 countA2 countC1 countC2 (counts are the normalized values) xxx AC 2.40588491829641 0.281844992240883 0 0 0 0 yyy AB 3.98976903585264 0.0455960023534269 0 0 0 0 or .. a log2fold change has been found between A and C of 2.4 and almost 4 in AvsB, while all of these samples are. I noticed that in all cases where the counts and the normalized counts of the two conditions that I tested were all zero I do get a fold change, even up to (2log) 5. Often I see a padj of "NA". So I was satisfied with that, as pseudocounts could result in some kind of fold change. Leaving those with padj "NA" out would be good then. But as the example above, I sometimes get a padj, which is even significant in the AB example above. This propably has something to do with all the other values (A1, A2, B1, B2,....,I1,I2); yyy 0 0 0 0 0 0 0 11 4897697 4167458 4337956 4139125 3731919 9711412 4043504 3796036 (i.e. massively present in tissue samples but not in blood, because of the discrepancies between blood and tissue I am not comparing them directly) I would like to understand why there is a significance here between A and B?? Or should I 'just' split my data in tissue and plasma?? But then still. I will find the AC example above, as B and D have counts in that example, however small numbers (E-I are also small numbers) Hanneke > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DESeq2_1.2.2 RcppArmadillo_0.3.910.0 Rcpp_0.10.4 [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 [7] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 [4] DBI_0.2-7 genefilter_1.44.0 grid_3.0.2 [7] lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5 [10] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 [13] survival_2.37-4 XML_3.98-1.1 xtable_1.7-1 [[alternative HTML version deleted]]

• 965 views

ADD COMMENT • link updated 11.2 years ago by Michael Love 43k • written 11.2 years ago by Hanneke van Deutekom ▴ 20

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 hours ago

United States

hi Hanneke, That looks like a bug. Could you email me your data and a reproducible example so I can look into this? Can you also try rerunning DESeq() with betaPrior=FALSE? thanks, Mike On Fri, Nov 22, 2013 at 9:48 AM, Hanneke van Deutekom < hwmvandeutekom@gmail.com> wrote: > Hi, > > > My experiment consist of control and diseased samples; looking like this: > > > colData<-DataFrame(condition=factor(c("A","A","B","B","C","C","D","D ","E","E","F","F","H","H","I","I")), > > type=factor(c("plasma","plasma","plasma","plasma","plasma","plasma", "plasma","plasma","tissue","tissue","tissue","tissue","tissue","tissue ","tissue","tissue")), > > diseased=factor(c("no","no","no","no","yes","yes","yes","yes","no"," no","no","no","yes","yes","yes","yes"))) > > To search for differential expression I compare > AvsB, AvsC, BvsD and BvsD > then > EvsF, EvsH, FvsI and HvsI, > > So I use the results(dds,contrast=c("condition","H","I")) option in DESeq2. > > And then, these are the results I do not get; > look at this example: > > geneid comparison log2foldchange padj countA1 countA2 countC1 countC2 > (counts are the normalized values) > xxx AC 2.40588491829641 0.281844992240883 0 0 0 0 > yyy AB 3.98976903585264 0.0455960023534269 0 0 0 0 > or .. a log2fold change has been found between A and C of 2.4 and almost 4 > in AvsB, while all of these samples are. > > I noticed that in all cases where the counts and the normalized counts of > the two conditions that I tested were all zero I do get a fold change, even > up to (2log) 5. Often I see a padj of "NA". So I was satisfied with that, > as pseudocounts could result in some kind of fold change. Leaving those > with padj "NA" out would be good then. > But as the example above, I sometimes get a padj, which is even significant > in the AB example above. This propably has something to do with all the > other values (A1, A2, B1, B2,....,I1,I2); > yyy 0 0 0 0 0 0 0 11 4897697 4167458 > 4337956 4139125 3731919 9711412 4043504 3796036 (i.e. > massively present in tissue samples but not in blood, because of the > discrepancies between blood and tissue I am not comparing them directly) > > I would like to understand why there is a significance here between A and > B?? > Or should I 'just' split my data in tissue and plasma?? But then still. I > will find the AC example above, as B and D have counts in that example, > however small numbers (E-I are also small numbers) > > > Hanneke > > > > > > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] DESeq2_1.2.2 RcppArmadillo_0.3.910.0 Rcpp_0.10.4 > [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 > [7] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 > [4] DBI_0.2-7 genefilter_1.44.0 grid_3.0.2 > [7] lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5 > [10] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 > [13] survival_2.37-4 XML_3.98-1.1 xtable_1.7-1 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.2 years ago Michael Love 43k

0

Entering edit mode

hi Hanneke, Never mind about the example, I can reproduce this and will look into it. thanks, Mike On Fri, Nov 22, 2013 at 10:01 AM, Michael Love <michaelisaiahlove@gmail.com>wrote: > hi Hanneke, > > That looks like a bug. Could you email me your data and a reproducible > example so I can look into this? > > Can you also try rerunning DESeq() with betaPrior=FALSE? > > thanks, > > Mike > > > On Fri, Nov 22, 2013 at 9:48 AM, Hanneke van Deutekom < > hwmvandeutekom@gmail.com> wrote: > >> Hi, >> >> >> My experiment consist of control and diseased samples; looking like this: >> >> >> colData<-DataFrame(condition=factor(c("A","A","B","B","C","C","D"," D","E","E","F","F","H","H","I","I")), >> >> type=factor(c("plasma","plasma","plasma","plasma","plasma","plasma" ,"plasma","plasma","tissue","tissue","tissue","tissue","tissue","tissu e","tissue","tissue")), >> >> diseased=factor(c("no","no","no","no","yes","yes","yes","yes","no", "no","no","no","yes","yes","yes","yes"))) >> >> To search for differential expression I compare >> AvsB, AvsC, BvsD and BvsD >> then >> EvsF, EvsH, FvsI and HvsI, >> >> So I use the results(dds,contrast=c("condition","H","I")) option in >> DESeq2. >> >> And then, these are the results I do not get; >> look at this example: >> >> geneid comparison log2foldchange padj countA1 countA2 countC1 countC2 >> (counts are the normalized values) >> xxx AC 2.40588491829641 0.281844992240883 0 0 0 0 >> yyy AB 3.98976903585264 0.0455960023534269 0 0 0 0 >> or .. a log2fold change has been found between A and C of 2.4 and almost 4 >> in AvsB, while all of these samples are. >> >> I noticed that in all cases where the counts and the normalized counts of >> the two conditions that I tested were all zero I do get a fold change, >> even >> up to (2log) 5. Often I see a padj of "NA". So I was satisfied with that, >> as pseudocounts could result in some kind of fold change. Leaving those >> with padj "NA" out would be good then. >> But as the example above, I sometimes get a padj, which is even >> significant >> in the AB example above. This propably has something to do with all the >> other values (A1, A2, B1, B2,....,I1,I2); >> yyy 0 0 0 0 0 0 0 11 4897697 4167458 >> 4337956 4139125 3731919 9711412 4043504 3796036 (i.e. >> massively present in tissue samples but not in blood, because of the >> discrepancies between blood and tissue I am not comparing them directly) >> >> I would like to understand why there is a significance here between A and >> B?? >> Or should I 'just' split my data in tissue and plasma?? But then still. I >> will find the AC example above, as B and D have counts in that example, >> however small numbers (E-I are also small numbers) >> >> >> Hanneke >> >> >> >> >> >> > sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] DESeq2_1.2.2 RcppArmadillo_0.3.910.0 Rcpp_0.10.4 >> [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 >> [7] BiocGenerics_0.8.0 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 >> [4] DBI_0.2-7 genefilter_1.44.0 grid_3.0.2 >> [7] lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5 >> [10] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 >> [13] survival_2.37-4 XML_3.98-1.1 xtable_1.7-1 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Michael Love 43k

0

Entering edit mode

hi Hanneke, Thanks for pointing this out. It is a general issue with asymmetric behavior of the prior on log2 fold changes for factors with 3 or more levels, which we are interested in fixing. Other users had noted that the log2 fold changes and p-values for a factor with 3 or more levels can change if the base level of that factor is changed. I will start implementing a solution for this in the development branch (v1.3), but I don't want to disrupt functionality in the release branch (v1.2). My recommendation for version 1.2 is to use DESeq(dds, betaPrior=FALSE) if the factor you are interested in generating results for has 3 or more levels. If you have a design "~ group + condition", where group has 3 levels and condition has 2 levels, and you were only interested in results for condition, then you could continue to use the default settings. I pushed out a new version to the release branch, 1.2.6, which provides this message to the user if a DESeqDataSet is constructed with a factor with 3 or more levels. Usage note: the following factors have 3 or more levels: > group > For DESeq2 versions < 1.3, if you plan on extracting results for > these factors, we recommend using betaPrior=FALSE as an argument > when calling DESeq(). > As currently implemented in version 1.2, the log2 fold changes can > vary if the base level is changed, when extracting results for a > factor with 3 or more levels. A solution will be implemented in > version 1.3 which allows for the use of a beta prior and symmetric > log2 fold change estimates regardless of the selection of base level. Mike On Fri, Nov 22, 2013 at 10:20 AM, Michael Love <michaelisaiahlove@gmail.com>wrote: > hi Hanneke, > > Never mind about the example, I can reproduce this and will look into it. > > thanks, > > Mike > > > On Fri, Nov 22, 2013 at 10:01 AM, Michael Love < > michaelisaiahlove@gmail.com> wrote: > >> hi Hanneke, >> >> That looks like a bug. Could you email me your data and a reproducible >> example so I can look into this? >> >> Can you also try rerunning DESeq() with betaPrior=FALSE? >> >> thanks, >> >> Mike >> >> >> On Fri, Nov 22, 2013 at 9:48 AM, Hanneke van Deutekom < >> hwmvandeutekom@gmail.com> wrote: >> >>> Hi, >>> >>> >>> My experiment consist of control and diseased samples; looking like this: >>> >>> >>> colData<-DataFrame(condition=factor(c("A","A","B","B","C","C","D", "D","E","E","F","F","H","H","I","I")), >>> >>> type=factor(c("plasma","plasma","plasma","plasma","plasma","plasma ","plasma","plasma","tissue","tissue","tissue","tissue","tissue","tiss ue","tissue","tissue")), >>> >>> diseased=factor(c("no","no","no","no","yes","yes","yes","yes","no" ,"no","no","no","yes","yes","yes","yes"))) >>> >>> To search for differential expression I compare >>> AvsB, AvsC, BvsD and BvsD >>> then >>> EvsF, EvsH, FvsI and HvsI, >>> >>> So I use the results(dds,contrast=c("condition","H","I")) option in >>> DESeq2. >>> >>> And then, these are the results I do not get; >>> look at this example: >>> >>> geneid comparison log2foldchange padj countA1 countA2 countC1 countC2 >>> (counts are the normalized values) >>> xxx AC 2.40588491829641 0.281844992240883 0 0 0 0 >>> yyy AB 3.98976903585264 0.0455960023534269 0 0 0 0 >>> or .. a log2fold change has been found between A and C of 2.4 and almost >>> 4 >>> in AvsB, while all of these samples are. >>> >>> I noticed that in all cases where the counts and the normalized counts of >>> the two conditions that I tested were all zero I do get a fold change, >>> even >>> up to (2log) 5. Often I see a padj of "NA". So I was satisfied with that, >>> as pseudocounts could result in some kind of fold change. Leaving those >>> with padj "NA" out would be good then. >>> But as the example above, I sometimes get a padj, which is even >>> significant >>> in the AB example above. This propably has something to do with all the >>> other values (A1, A2, B1, B2,....,I1,I2); >>> yyy 0 0 0 0 0 0 0 11 4897697 4167458 >>> 4337956 4139125 3731919 9711412 4043504 3796036 (i.e. >>> massively present in tissue samples but not in blood, because of the >>> discrepancies between blood and tissue I am not comparing them directly) >>> >>> I would like to understand why there is a significance here between A and >>> B?? >>> Or should I 'just' split my data in tissue and plasma?? But then still. I >>> will find the AC example above, as B and D have counts in that example, >>> however small numbers (E-I are also small numbers) >>> >>> >>> Hanneke >>> >>> >>> >>> >>> >>> > sessionInfo() >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] DESeq2_1.2.2 RcppArmadillo_0.3.910.0 Rcpp_0.10.4 >>> [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 >>> [7] BiocGenerics_0.8.0 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 >>> [4] DBI_0.2-7 genefilter_1.44.0 grid_3.0.2 >>> [7] lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5 >>> [10] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 >>> [13] survival_2.37-4 XML_3.98-1.1 xtable_1.7-1 >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Michael Love 43k

Login before adding your answer.