DESeq analysis
1
0
Entering edit mode
@fatemehsadat-seyednasrollah-5367
Last seen 10.3 years ago
Hi, First thanks a lot for your answer. Actually I have used a subset of a public data from Bowtie(the Montgomery)(http://bowtie-bio.sourceforge.net/recount/) and below are the reduced codes of my work both from edgeR and DEseq. I wanted to know have I done something wrong to obtain very different answers ( 85 from DESeq and 407 from edgeR) or it is natural to have this hude difference and it is related to the algorithms? edgeR: > g1 <- read.delim ("count1.txt", row.names = 1) > head(g1) NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F ENSG00000000003 0 0 0 0 1 0 0 ENSG00000000005 0 0 0 0 0 0 0 ENSG00000000419 10 24 19 20 19 8 14 ENSG00000000457 17 15 13 18 21 18 21 ENSG00000000460 2 3 5 2 4 6 8 ENSG00000000938 20 4 35 16 10 17 19 NA10847F ENSG00000000003 0 ENSG00000000005 0 ENSG00000000419 6 ENSG00000000457 15 ENSG00000000460 2 ENSG00000000938 9 > group <- factor(rep(c("Male", "Female"), each= 4)) > dge <- DGEList(counts = g1 , group = group ) Calculating library sizes from column totals. > dge <- calcNormFactors(dge) > dge <- estimateCommonDisp(dge) > sqrt (dge$common.dispersion) [1] 0.3858996 > test <- exactTest(dge) > head(test$table) logFC logCPM PValue ENSG00000000003 -2.3441897 -3.042057 1.0000000 ENSG00000000005 0.0000000 -Inf 1.0000000 ENSG00000000419 0.5777309 3.850993 0.2791539 ENSG00000000457 -0.3054489 4.080866 0.5592668 ENSG00000000460 -0.7792622 1.966865 0.3274528 ENSG00000000938 0.3909100 3.997866 0.4269672 > sum (test$table$PValue <0.01) [1] 407 DESeq: > g1 <- read.table("count1.txt", header = TRUE, row.names = 1) > conds <- factor(rep(c("Male", "Female"), each= 4)) > dataPack <- data.frame(row.names = colnames(g1), condition =rep( c("Male", "Female"), each= 4)) > dataPack condition NA06994M Male NA07051M Male NA07347M Male NA07357M Male NA07000F Female NA07037F Female NA07346F Female NA10847F Female > cds <- newCountDataSet(g1, conds) > head(cds) CountDataSet (storageMode: environment) assayData: 1 features, 8 samples element names: counts protocolData: none phenoData sampleNames: NA06994M NA07051M ... NA10847F (8 total) varLabels: sizeFactor condition varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: > head(counts(cds) + ) NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F ENSG00000000003 0 0 0 0 1 0 0 ENSG00000000005 0 0 0 0 0 0 0 ENSG00000000419 10 24 19 20 19 8 14 ENSG00000000457 17 15 13 18 21 18 21 ENSG00000000460 2 3 5 2 4 6 8 ENSG00000000938 20 4 35 16 10 17 19 NA10847F ENSG00000000003 0 ENSG00000000005 0 ENSG00000000419 6 ENSG00000000457 15 ENSG00000000460 2 ENSG00000000938 9 > cds <- estimateSizeFactors(cds) > sizeFactors(cds) NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F NA10847F 0.8599841 1.1102643 1.0869086 1.1157556 1.1056726 1.0666049 0.9152017 0.9402086 > head(counts(cds, normalized= TRUE)) > cds <- estimateDispersions(cds) > result <- nbinomTest(cds, "Male", "Female") > nrow(subset(result, result$pval <0.01)) [1] 85 Again thank you so much With Best Regards, Narges_ #################################################################### Dear Narges thank you for the feedback. Your second question is easy: use the idiom res1 <- subset(res, padj<0.1) instead, this will avoid the creation of rows full of NA whenever res$padj is NA. Alternatively res[order(res$padj)[1:n], ] with 'n' your favourite lucky number might be useful. Have a look at the R-intro manual for more on subsetting of arrays and dataframes in R. Your first question: can you show us the data for the genes where you know that they are differentially expressed? Perhaps then it might become more apparent why DESeq / nbinomtest did not agree. Also, what does the dispersion plot for cds look like? (This is the plot produced by plotDispEsts in the vignette). Best wishes Wolfgang ##################################################################### narges [guest] scripsit 06/26/2012 06:17 PM: > > Hi all > > I am doing some RNA seq analysis with DESeq. I have applied the nbinomTest to my dataset which I know have many differentially expressed genes but the first problem is that the result values for "padj"column is almost NA and sometimes 1. and when I want to have a splice from my fata frame the result is not meaningful for me. > > -- output of sessionInfo(): > > res <- nbinomTest(cds, "Male", "Female") > >> head(res) > id baseMean baseMeanA baseMeanB foldChange log2FoldChange > 1 ENSG00000000003 0.1130534 0.000000 0.2261067 Inf Inf > 2 ENSG00000000005 0.0000000 0.000000 0.0000000 NaN NaN > 3 ENSG00000000419 14.3767155 17.162610 11.5908205 0.6753530 -0.5662863 > 4 ENSG00000000457 17.0174761 15.342800 18.6921526 1.2183013 0.2848710 > 5 ENSG00000000460 3.9414822 2.855099 5.0278659 1.7610131 0.8164056 > 6 ENSG00000000938 16.0894945 18.350117 13.8288718 0.7536122 -0.4081058 > pval padj > 1 0.9959638 1 > 2 NA NA > 3 0.3208560 1 > 4 0.5942512 1 > 5 0.4840607 1 > 6 0.5409953 1 > > >> res1 <- res[res$padj<0.1,] >> head(res1) > id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj > NA <na> NA NA NA NA NA NA NA > NA.1 <na> NA NA NA NA NA NA NA > NA.2 <na> NA NA NA NA NA NA NA > NA.3 <na> NA NA NA NA NA NA NA > NA.4 <na> NA NA NA NA NA NA NA > NA.5 <na> NA NA NA NA NA NA NA > > my first question is that why although I know there are some differentially expressed genes in the my data, all the padj values are NA or 1 and the second question is this "NA.1" , "NA.2", ..... which are emerged as the first column of object "res1"instead of name of genes > > Thank you so much > Regards > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
edgeR DESeq edgeR DESeq • 1.5k views
ADD COMMENT
0
Entering edit mode
@wolfgang-huber-3550
Last seen 4 months ago
EMBL European Molecular Biology Laborat…
Dear Fatemehsadat Seyednasrollah this type of differences in the results may reflect the different approaches to estimating dispersion of the two packages and is not completely uncommon. I'd recommend to study the genes where the results differ in more detail, i.e. look at their data and possibly their biological roles. If your samples are 1000 Genomes cell lines, I would expect large variability between them, and indeed not that much signal in a 4 against 4 comparison. Best wishes Wolfgang Jul/3/12 1:56 PM, Fatemehsadat Seyednasrollah scripsit:: > Hi, > First thanks a lot for your answer. > Actually I have used a subset of a public data from Bowtie(the Montgomery)(http://bowtie-bio.sourceforge.net/recount/) > > and below are the reduced > codes of my work both from edgeR and DEseq. I wanted to know have I done > something wrong to obtain > very different answers ( 85 from DESeq and 407 from edgeR) or it is natural > to have this hude difference > and it is related to the algorithms? > > edgeR: >> g1 <- read.delim ("count1.txt", row.names = 1) >> head(g1) > NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F > ENSG00000000003 0 0 0 0 1 0 0 > ENSG00000000005 0 0 0 0 0 0 0 > ENSG00000000419 10 24 19 20 19 8 14 > ENSG00000000457 17 15 13 18 21 18 21 > ENSG00000000460 2 3 5 2 4 6 8 > ENSG00000000938 20 4 35 16 10 17 19 > NA10847F > ENSG00000000003 0 > ENSG00000000005 0 > ENSG00000000419 6 > ENSG00000000457 15 > ENSG00000000460 2 > ENSG00000000938 9 >> group <- factor(rep(c("Male", "Female"), each= 4)) >> dge <- DGEList(counts = g1 , group = group ) > Calculating library sizes from column totals. >> dge <- calcNormFactors(dge) >> dge <- estimateCommonDisp(dge) >> sqrt (dge$common.dispersion) > [1] 0.3858996 >> test <- exactTest(dge) >> head(test$table) > logFC logCPM PValue > ENSG00000000003 -2.3441897 -3.042057 1.0000000 > ENSG00000000005 0.0000000 -Inf 1.0000000 > ENSG00000000419 0.5777309 3.850993 0.2791539 > ENSG00000000457 -0.3054489 4.080866 0.5592668 > ENSG00000000460 -0.7792622 1.966865 0.3274528 > ENSG00000000938 0.3909100 3.997866 0.4269672 >> sum (test$table$PValue <0.01) > [1] 407 > > > DESeq: > >> g1 <- read.table("count1.txt", header = TRUE, row.names = 1) >> conds <- factor(rep(c("Male", "Female"), each= 4)) >> dataPack <- data.frame(row.names = colnames(g1), condition =rep( c("Male", "Female"), each= 4)) >> dataPack > condition > NA06994M Male > NA07051M Male > NA07347M Male > NA07357M Male > NA07000F Female > NA07037F Female > NA07346F Female > NA10847F Female >> cds <- newCountDataSet(g1, conds) >> head(cds) > CountDataSet (storageMode: environment) > assayData: 1 features, 8 samples > element names: counts > protocolData: none > phenoData > sampleNames: NA06994M NA07051M ... NA10847F (8 total) > varLabels: sizeFactor condition > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: >> head(counts(cds) > + ) > NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F > ENSG00000000003 0 0 0 0 1 0 0 > ENSG00000000005 0 0 0 0 0 0 0 > ENSG00000000419 10 24 19 20 19 8 14 > ENSG00000000457 17 15 13 18 21 18 21 > ENSG00000000460 2 3 5 2 4 6 8 > ENSG00000000938 20 4 35 16 10 17 19 > NA10847F > ENSG00000000003 0 > ENSG00000000005 0 > ENSG00000000419 6 > ENSG00000000457 15 > ENSG00000000460 2 > ENSG00000000938 9 >> cds <- estimateSizeFactors(cds) >> sizeFactors(cds) > NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F NA10847F > 0.8599841 1.1102643 1.0869086 1.1157556 1.1056726 1.0666049 0.9152017 0.9402086 >> head(counts(cds, normalized= TRUE)) > >> cds <- estimateDispersions(cds) >> result <- nbinomTest(cds, "Male", "Female") >> nrow(subset(result, result$pval <0.01)) > [1] 85 > > Again thank you so much > With Best Regards, > Narges_ > > #################################################################### > > Dear Narges > > thank you for the feedback. Your second question is easy: use the idiom > res1 <- subset(res, padj<0.1) > instead, this will avoid the creation of rows full of NA whenever > res$padj is NA. Alternatively > res[order(res$padj)[1:n], ] > with 'n' your favourite lucky number might be useful. Have a look at the > R-intro manual for more on subsetting of arrays and dataframes in R. > > Your first question: can you show us the data for the genes where you > know that they are differentially expressed? Perhaps then it might > become more apparent why DESeq / nbinomtest did not agree. Also, what > does the dispersion plot for cds look like? (This is the plot produced > by plotDispEsts in the vignette). > > Best wishes > Wolfgang > ##################################################################### > > narges [guest] scripsit 06/26/2012 06:17 PM: >> >> Hi all >> >> I am doing some RNA seq analysis with DESeq. I have applied the nbinomTest to my dataset which I know have many differentially expressed genes but the first problem is that the result values for "padj"column is almost NA and sometimes 1. and when I want to have a splice from my fata frame the result is not meaningful for me. >> >> -- output of sessionInfo(): >> >> res <- nbinomTest(cds, "Male", "Female") >> >>> head(res) >> id baseMean baseMeanA baseMeanB foldChange log2FoldChange >> 1 ENSG00000000003 0.1130534 0.000000 0.2261067 Inf Inf >> 2 ENSG00000000005 0.0000000 0.000000 0.0000000 NaN NaN >> 3 ENSG00000000419 14.3767155 17.162610 11.5908205 0.6753530 -0.5662863 >> 4 ENSG00000000457 17.0174761 15.342800 18.6921526 1.2183013 0.2848710 >> 5 ENSG00000000460 3.9414822 2.855099 5.0278659 1.7610131 0.8164056 >> 6 ENSG00000000938 16.0894945 18.350117 13.8288718 0.7536122 -0.4081058 >> pval padj >> 1 0.9959638 1 >> 2 NA NA >> 3 0.3208560 1 >> 4 0.5942512 1 >> 5 0.4840607 1 >> 6 0.5409953 1 >> >> >>> res1 <- res[res$padj<0.1,] >>> head(res1) >> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj >> NA <na> NA NA NA NA NA NA NA >> NA.1 <na> NA NA NA NA NA NA NA >> NA.2 <na> NA NA NA NA NA NA NA >> NA.3 <na> NA NA NA NA NA NA NA >> NA.4 <na> NA NA NA NA NA NA NA >> NA.5 <na> NA NA NA NA NA NA NA >> >> my first question is that why although I know there are some differentially expressed genes in the my data, all the padj values are NA or 1 and the second question is this "NA.1" , "NA.2", ..... which are emerged as the first column of object "res1"instead of name of genes >> >> Thank you so much >> Regards >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
ADD COMMENT

Login before adding your answer.

Traffic: 679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6