DESeq2 - working with partial replicates

0

Entering edit mode

Avinash S ▴ 40

@avinash-s-6145

Last seen 7.9 years ago

United States

Hello, I'm trying to use DESeq2 for differential expression analysis of RNA- seq data containing Control (CR) and two Treatments (HR and SR). One of the biological replicate for HR treatment failed leaving me with 2 replicates for CR and SR and 1 for HR. DESeq document described "Working partially without replicates" however, that is not in the DESeq2 documentation. I want to check if I could use the code below to analyze data with no biological replicate for one of the treatments. Also, I appreciate if someone could weigh in on using edgeR with partial replicates (by estimating dispersion from house-keeping genes) as an option. Thank you for your time, Avinash Here is the code I used and my sessionInfo(). countsTablePop <- read.delim(filetxt, row.names=1 ) countTable <-as.data.frame(countsTablePop) colData <- DataFrame(condition = factor(c("CR", "CR", "HR", "SR", "SR"))) dds <- DESeqDataSetFromMatrix(countData = countTable, colData = colData, design = ~ condition) colData(dds)$condition <- factor(colData(dds)$condition, levels=c("CR","HR","SR")) design(dds) dds <- DESeq(dds) res <- results(dds) res <- res[order(res$padj),] ##### > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid splines parallel stats graphics grDevices utils datasets methods base other attached packages: [1] VennDiagram_1.6.5 vsn_3.28.0 gplots_2.12.1 RColorBrewer_1.0-5 DESeq2_1.0.19 [6] RcppArmadillo_0.4.000.2 Rcpp_0.10.6 GenomicRanges_1.13.35 XVector_0.1.0 IRanges_1.19.19 [11] lattice_0.20-23 locfit_1.5-9.1 BiocInstaller_1.10.4 limma_3.17.21 topGO_2.12.0 [16] SparseM_1.03 GO.db_2.9.0 RSQLite_0.11.4 DBI_0.2-7 AnnotationDbi_1.23.18 [21] Biobase_2.21.6 BiocGenerics_0.7.3 graph_1.38.3 plyr_1.8 reshape2_1.2.2 [26] ggplot2_0.9.3.1 loaded via a namespace (and not attached): [1] affy_1.38.1 affyio_1.28.0 annotate_1.38.0 bitops_1.0-6 caTools_1.16 colorspace_1.2-4 [7] dichromat_2.0-0 digest_0.6.4 gdata_2.13.2 genefilter_1.42.0 gtable_0.1.2 gtools_3.2.1 [13] KernSmooth_2.23-10 labeling_0.2 MASS_7.3-29 munsell_0.4.2 preprocessCore_1.22.0 proto_0.3-10 [19] scales_0.2.3 stats4_3.0.2 stringr_0.6.2 survival_2.37-7 tools_3.0.2 XML_3.98-1.1 [25] xtable_1.7-1 zlibbioc_1.7.0 [[alternative HTML version deleted]]

GO edgeR DESeq2 GO edgeR DESeq2 • 3.3k views

ADD COMMENT • link updated 11.2 years ago by Michael Love 43k • written 11.2 years ago by Avinash S ▴ 40

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 days ago

United States

Hi Avinash, You can use the DESeq() function as you have and it will take care of the dispersion estimation automatically. Also, I see you are using v1.0, which is some months behind now. We recommend, if possible, users always upgrading to the most recent release. Mike On Jan 28, 2014 9:56 PM, "Avinash S" <avins.s@gmail.com> wrote: > Hello, > > I'm trying to use DESeq2 for differential expression analysis of RNA-seq > data containing Control (CR) and two Treatments (HR and SR). One of the > biological replicate for HR treatment failed leaving me with 2 replicates > for CR and SR and 1 for HR. > > DESeq document described "Working partially without replicates" however, > that is not in the DESeq2 documentation. > > I want to check if I could use the code below to analyze data with no > biological replicate for one of the treatments. > > Also, I appreciate if someone could weigh in on using edgeR with partial > replicates (by estimating dispersion from house-keeping genes) as an > option. > > Thank you for your time, > Avinash > > Here is the code I used and my sessionInfo(). > > countsTablePop <- read.delim(filetxt, row.names=1 ) > countTable <-as.data.frame(countsTablePop) > colData <- DataFrame(condition = factor(c("CR", "CR", "HR", "SR", "SR"))) > dds <- DESeqDataSetFromMatrix(countData = countTable, colData = colData, > design = ~ condition) > colData(dds)$condition <- factor(colData(dds)$condition, > levels=c("CR","HR","SR")) > design(dds) > dds <- DESeq(dds) > res <- results(dds) > res <- res[order(res$padj),] > > ##### > > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > LC_MONETARY=en_US.UTF-8 > [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C > LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] grid splines parallel stats graphics grDevices utils > datasets methods base > > other attached packages: > [1] VennDiagram_1.6.5 vsn_3.28.0 gplots_2.12.1 > RColorBrewer_1.0-5 DESeq2_1.0.19 > [6] RcppArmadillo_0.4.000.2 Rcpp_0.10.6 GenomicRanges_1.13.35 > XVector_0.1.0 IRanges_1.19.19 > [11] lattice_0.20-23 locfit_1.5-9.1 BiocInstaller_1.10.4 > limma_3.17.21 topGO_2.12.0 > [16] SparseM_1.03 GO.db_2.9.0 RSQLite_0.11.4 > DBI_0.2-7 AnnotationDbi_1.23.18 > [21] Biobase_2.21.6 BiocGenerics_0.7.3 graph_1.38.3 > plyr_1.8 reshape2_1.2.2 > [26] ggplot2_0.9.3.1 > > loaded via a namespace (and not attached): > [1] affy_1.38.1 affyio_1.28.0 annotate_1.38.0 > bitops_1.0-6 caTools_1.16 colorspace_1.2-4 > [7] dichromat_2.0-0 digest_0.6.4 gdata_2.13.2 > genefilter_1.42.0 gtable_0.1.2 gtools_3.2.1 > [13] KernSmooth_2.23-10 labeling_0.2 MASS_7.3-29 > munsell_0.4.2 preprocessCore_1.22.0 proto_0.3-10 > [19] scales_0.2.3 stats4_3.0.2 stringr_0.6.2 > survival_2.37-7 tools_3.0.2 XML_3.98-1.1 > [25] xtable_1.7-1 zlibbioc_1.7.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.2 years ago Michael Love 43k

0

Entering edit mode

Thanks a lot Mike. Did not realize that I was using an older version. I will upgrade to V1.2.9. -Avinash On 28 January 2014 21:18, Michael Love <michaelisaiahlove@gmail.com> wrote: > Hi Avinash, > > You can use the DESeq() function as you have and it will take care of the > dispersion estimation automatically. > > Also, I see you are using v1.0, which is some months behind now. We > recommend, if possible, users always upgrading to the most recent release. > > Mike > On Jan 28, 2014 9:56 PM, "Avinash S" <avins.s@gmail.com> wrote: > >> Hello, >> >> I'm trying to use DESeq2 for differential expression analysis of RNA-seq >> data containing Control (CR) and two Treatments (HR and SR). One of the >> biological replicate for HR treatment failed leaving me with 2 replicates >> for CR and SR and 1 for HR. >> >> DESeq document described "Working partially without replicates" however, >> that is not in the DESeq2 documentation. >> >> I want to check if I could use the code below to analyze data with no >> biological replicate for one of the treatments. >> >> Also, I appreciate if someone could weigh in on using edgeR with partial >> replicates (by estimating dispersion from house-keeping genes) as an >> option. >> >> Thank you for your time, >> Avinash >> >> Here is the code I used and my sessionInfo(). >> >> countsTablePop <- read.delim(filetxt, row.names=1 ) >> countTable <-as.data.frame(countsTablePop) >> colData <- DataFrame(condition = factor(c("CR", "CR", "HR", "SR", "SR"))) >> dds <- DESeqDataSetFromMatrix(countData = countTable, colData = colData, >> design = ~ condition) >> colData(dds)$condition <- factor(colData(dds)$condition, >> levels=c("CR","HR","SR")) >> design(dds) >> dds <- DESeq(dds) >> res <- results(dds) >> res <- res[order(res$padj),] >> >> ##### >> >> > sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-redhat-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> LC_MONETARY=en_US.UTF-8 >> [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C >> LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] grid splines parallel stats graphics grDevices utils >> datasets methods base >> >> other attached packages: >> [1] VennDiagram_1.6.5 vsn_3.28.0 gplots_2.12.1 >> RColorBrewer_1.0-5 DESeq2_1.0.19 >> [6] RcppArmadillo_0.4.000.2 Rcpp_0.10.6 GenomicRanges_1.13.35 >> XVector_0.1.0 IRanges_1.19.19 >> [11] lattice_0.20-23 locfit_1.5-9.1 BiocInstaller_1.10.4 >> limma_3.17.21 topGO_2.12.0 >> [16] SparseM_1.03 GO.db_2.9.0 RSQLite_0.11.4 >> DBI_0.2-7 AnnotationDbi_1.23.18 >> [21] Biobase_2.21.6 BiocGenerics_0.7.3 graph_1.38.3 >> plyr_1.8 reshape2_1.2.2 >> [26] ggplot2_0.9.3.1 >> >> loaded via a namespace (and not attached): >> [1] affy_1.38.1 affyio_1.28.0 annotate_1.38.0 >> bitops_1.0-6 caTools_1.16 colorspace_1.2-4 >> [7] dichromat_2.0-0 digest_0.6.4 gdata_2.13.2 >> genefilter_1.42.0 gtable_0.1.2 gtools_3.2.1 >> [13] KernSmooth_2.23-10 labeling_0.2 MASS_7.3-29 >> munsell_0.4.2 preprocessCore_1.22.0 proto_0.3-10 >> [19] scales_0.2.3 stats4_3.0.2 stringr_0.6.2 >> survival_2.37-7 tools_3.0.2 XML_3.98-1.1 >> [25] xtable_1.7-1 zlibbioc_1.7.0 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Avinash S ▴ 40

0

Entering edit mode

Hi, Dear Michael: Sorry to directly jump in a quick question, since I saw you respond to some DESeq questions yesterday and I know you are the author of DESeq2 package. I run into some RNAseq data issue with edgeR package (see my post in BioC yesterday), since the data I used are not integer counts of genes, but RSEM processed raw_count from TCGA RNA-seq data, in fact, are not integers, although TCGA named it as raw_count (I showed some piece of such a file from TCGA as below). file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem. genes.results" its content looks below: gene_id raw_count scaled_estimate transcript_id 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 6 ?|136542 0.00 0.000000e+00 uc011krn.1 I used the raw_count of this file for edgeR analysis and run into some warning messages which indicating potential issues, and had been pointed out by Dr. Smyth as you may have seen in my post yesterday. Reading the user guide from both DESeq and DESeq2, both packages mentioned the input was expected to be count data in a form of a matrix of integer values, not even (rounded) normalized counts, which would lead to nonsensical results. The TCGA data above apparently are not true raw counts but some kind of RSEM processed raw_counts. Although both RSEM authors and TCGA data staff all suggested the raw_count can be used for edgeR and DESeq (or DESeq2) analysis, because of the issues I run into in edgeR. I wonder what would be your opinion on this data for analysis in DESeq2? Any advice would be appreciated very much! Thanks and best Ming Yi NIH Maryland,USA [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Ming ▴ 380

0

Entering edit mode

hi Ming,

We do not recommend using rounded estimated values of read counts with DESeq2 (although I found an email from myself to the list one year ago contradicting this in the case that someone had no access to the raw data). Counts of reads which are proportionally assigned to genes and then rounded can be a bad fit for distributions like the Negative Binomial (and Poisson). For example, this procedure could generate values arising from a distribution that has variance less than the mean. As a rule, and to help avoid erroneous results, users should produce a matrix containing integer counts of reads uniquely aligned to features.

Update (12/13/15): After investigation into the RSEM method and performance, I've come around and recommend the option of using rounded estimated gene-level counts from RSEM as input to DESeq2. My concern above was from a misunderstanding of how RSEM works.

Mike

ADD REPLY • link 11.2 years ago • updated 9.3 years ago Michael Love 43k

0

Entering edit mode

Hi, Michael: Thanks so much for the response and advice. We actually found files with raw counts from TCGA RNAseq v1 dataset (we were discussing about V2 RNAseq data). Unfortunately, the V1 dataset only has a portion of samples in V2 dataset, and V2 only has those RSEM estimated counts. we currently do not have the controlled access permission for fastq and bam files. We kind of stuck there. Thanks a lot for your advice and help! best Ming From: michaelisaiahlove@gmail.com Date: Wed, 29 Jan 2014 13:01:40 -0500 Subject: Re: DESeq2 - input data questions To: yi02@hotmail.com CC: bioconductor@r-project.org hi Ming, We do not recommend using rounded estimated values of read counts with DESeq2 (although I found an email from myself to the list one year ago contradicting this in the case that someone had no access to the raw data)â. Counts of reads which are proportionally assigned to genes and then rounded can be a bad fit for distributions like the Negative Binomial (and Poisson). For example, this procedure could generate values arising from a distribution that has variance less than the mean. As a rule, and to help avoid erroneous results, users should produce a matrix containing integer counts of reads uniquely aligned to features. Mike On Wed, Jan 29, 2014 at 11:16 AM, Ming Yi <yi02@hotmail.com> wrote: Hi, Dear Michael: Sorry to directly jump in a quick question, since I saw you respond to some DESeq questions yesterday and I know you are the author of DESeq2 package. I run into some RNAseq data issue with edgeR package (see my post in BioC yesterday), since the data I used are not integer counts of genes, but RSEM processed raw_count from TCGA RNA-seq data, in fact, are not integers, although TCGA named it as raw_count (I showed some piece of such a file from TCGA as below). file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem. genes.results" its content looks below: gene_id raw_count scaled_estimate transcript_id 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 6 ?|136542 0.00 0.000000e+00 uc011krn.1 I used the raw_count of this file for edgeR analysis and run into some warning messages which indicating potential issues, and had been pointed out by Dr. Smyth as you may have seen in my post yesterday. Reading the user guide from both DESeq and DESeq2, both packages mentioned the input was expected to be count data in a form of a matrix of integer values, not even (rounded) normalized counts, which would lead to nonsensical results. The TCGA data above apparently are not true raw counts but some kind of RSEM processed raw_counts. Although both RSEM authors and TCGA data staff all suggested the raw_count can be used for edgeR and DESeq (or DESeq2) analysis, because of the issues I run into in edgeR. I wonder what would be your opinion on this data for analysis in DESeq2? Any advice would be appreciated very much! Thanks and best Ming Yi NIH Maryland,USA On Wed, Jan 29, 2014 at 11:16 AM, Ming Yi <yi02@hotmail.com> wrote: Hi, Dear Michael: Sorry to directly jump in a quick question, since I saw you respond to some DESeq questions yesterday and I know you are the author of DESeq2 package. I run into some RNAseq data issue with edgeR package (see my post in BioC yesterday), since the data I used are not integer counts of genes, but RSEM processed raw_count from TCGA RNA-seq data, in fact, are not integers, although TCGA named it as raw_count (I showed some piece of such a file from TCGA as below). file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem. genes.results" its content looks below: gene_id raw_count scaled_estimate transcript_id 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 6 ?|136542 0.00 0.000000e+00 uc011krn.1 I used the raw_count of this file for edgeR analysis and run into some warning messages which indicating potential issues, and had been pointed out by Dr. Smyth as you may have seen in my post yesterday. Reading the user guide from both DESeq and DESeq2, both packages mentioned the input was expected to be count data in a form of a matrix of integer values, not even (rounded) normalized counts, which would lead to nonsensical results. The TCGA data above apparently are not true raw counts but some kind of RSEM processed raw_counts. Although both RSEM authors and TCGA data staff all suggested the raw_count can be used for edgeR and DESeq (or DESeq2) analysis, because of the issues I run into in edgeR. I wonder what would be your opinion on this data for analysis in DESeq2? Any advice would be appreciated very much! Thanks and best Ming Yi NIH Maryland,USA [[alternative HTML version deleted]]

ADD REPLY • link 11.2 years ago Ming ▴ 380

Login before adding your answer.