Hi everyone, I am confused with my DEseq2 result and PCA plot. I have two RNAseq samples(each sample has three biological replicates) that are very close to each other on the PCA plot. However, there is still a lot of differentially expressed genes(padj <0.05) between these two samples. I was thinking these two RNAseq samples should be very similar, because they are close to each other on the PCA plot, how can I still get a lot of differentially expressed genes between these two?
Is this normal? If so, can anyone give me some explanation ? Thanks in advance.
You need to provide more information to get some useful advice:
- your column data: as.data.frame(colData(dds))
- your design
- the code you used
- the output of sessionInfo()
- a picture of the PCA plot*
*see "How do I put images into my posts?": https://support.bioconductor.org/info/faq/
Thanks Michael.
Following are more information about my analysis:
as.data.frame(colData(ddsTC))
genotype sizeFactor
rep1_A_16 A 0.5108994
rep2_A_16 A 1.8407776
rep3_A_16 A 0.8506794
rep1_B_16 B 1.0531460
rep2_B_16 B 0.4112253
rep3_B_16 B 1.0702545
rep2_C_16 C 1.2199964
rep3_C_16 C 1.2810071
rep1_D_16 D 1.0954071
rep2_D_16 D 1.1814218
rep3_D_16 D 1.4053602
LRTDesign = data.frame(row.names = colnames(R_data ),genotype = c( "A", "A", "A", "B", "B", "B", "C", "C", "D", "D", "D"))
R_data_matrix <- data.matrix(R_data)
head(R_data_matrix)
ddsTC <- DESeqDataSetFromMatrix(countData = R_data_matrix, colData= LRTDesign, design = ~ genotype)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ 1)
resultsNames(ddsTC)
D_vs_A_16<-results(ddsTC, name = "genotype_D_vs_A", test = "Wald")
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ggplot2_1.0.1 DESeq2_1.8.1
[3] RcppArmadillo_0.5.200.1.0 Rcpp_0.11.6
[5] GenomicRanges_1.20.5 GenomeInfoDb_1.4.1
[7] IRanges_2.2.4 S4Vectors_0.6.0
[9] BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3
[4] XVector_0.8.0 futile.options_1.0.0 tools_3.2.0
[7] rpart_4.1-9 digest_0.6.8 RSQLite_1.0.0
[10] annotate_1.46.0 gtable_0.1.2 lattice_0.20-31
[13] DBI_0.3.1 proto_0.3-10 gridExtra_0.9.1
[16] genefilter_1.50.0 stringr_1.0.0 cluster_2.0.1
[19] locfit_1.5-9.1 nnet_7.3-9 grid_3.2.0
[22] Biobase_2.28.0 AnnotationDbi_1.30.1 XML_3.98-1.2
[25] survival_2.38-1 BiocParallel_1.2.4 foreign_0.8-63
[28] latticeExtra_0.6-26 Formula_1.2-1 geneplotter_1.46.0
[31] reshape2_1.4.1 lambda.r_1.1.7 magrittr_1.5
[34] scales_0.2.5 Hmisc_3.16-0 MASS_7.3-40
[37] splines_3.2.0 xtable_1.7-4 colorspace_1.2-6
[40] stringi_0.4-1 acepack_1.3-3.3 munsell_0.4.2
I don't understand why there are more differentially expressed genes between genotype A and D than between genotype A and C, although A and D are more close in PCA plot. Thanks a lot.
A minor clarification in terminology: generally a sample refers to a single replicate, not a group of replicates. Anyway, if you could show us the PCA plot somehow, it would be much easier to see what you're trying to describe.
Thanks for your reply.
I do try to show the PCA plot, however, I don't know how to insert an image here.
What I am trying to describe is that I have done differentially expressed genes anlysis (DEseq2) with two data points A and B, both of which have three biological replicates. I got a lot of differentially expressed genes. However, on the PCA plot, data points A and B are very close to each other. I was thinking the gene expression in data points A and B should be very similar, since they are close in the PCA plot. Why I still got a lot of differentially expressed genes?
Thanks.