Principal component plot of the RNAseq samples and DEseq2 result
Emma ▴ 10
Last seen 9.1 years ago
United States

Hi everyone, I am confused with my DEseq2 result and PCA plot. I have two RNAseq samples(each sample has three biological replicates) that are very close to  each other on the PCA plot. However, there is still a lot of differentially expressed genes(padj <0.05) between these two samples. I was thinking these two RNAseq samples should be very similar, because they are close to each other on the PCA plot, how can I still get a lot of differentially expressed genes between these two? 

Is this normal? If so, can anyone give me some explanation ?  Thanks in advance.




You need to provide more information to get some useful advice: 

- your column data:
- your design
- the code you used
- the output of sessionInfo()
- a picture of the PCA plot*

*see "How do I put images into my posts?":

Thanks Michael.

Following are more information about my analysis:
              genotype sizeFactor
rep1_A_16         A  0.5108994
rep2_A_16         A  1.8407776
rep3_A_16         A  0.8506794
rep1_B_16        B  1.0531460
rep2_B_16        B  0.4112253
rep3_B_16       B  1.0702545
rep2_C_16     C  1.2199964
rep3_C_16     C  1.2810071
rep1_D_16     D  1.0954071
rep2_D_16     D  1.1814218
rep3_D_16     D  1.4053602
LRTDesign = data.frame(row.names = colnames(R_data ),genotype = c( "A", "A", "A", "B", "B", "B", "C", "C", "D", "D", "D"))

R_data_matrix <- data.matrix(R_data)
ddsTC <- DESeqDataSetFromMatrix(countData = R_data_matrix, colData= LRTDesign, design = ~ genotype)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ 1)
D_vs_A_16<-results(ddsTC, name = "genotype_D_vs_A", test = "Wald")

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)


[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252  
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                         
[5] LC_TIME=English_United States.1252   


attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base    


other attached packages:
[1] ggplot2_1.0.1             DESeq2_1.8.1            
[3] RcppArmadillo_0. Rcpp_0.11.6             
[5] GenomicRanges_1.20.5      GenomeInfoDb_1.4.1      
[7] IRanges_2.2.4             S4Vectors_0.6.0         
[9] BiocGenerics_0.14.0     


loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3         
 [4] XVector_0.8.0        futile.options_1.0.0 tools_3.2.0        
 [7] rpart_4.1-9          digest_0.6.8         RSQLite_1.0.0      
[10] annotate_1.46.0      gtable_0.1.2         lattice_0.20-31    
[13] DBI_0.3.1            proto_0.3-10         gridExtra_0.9.1    
[16] genefilter_1.50.0    stringr_1.0.0        cluster_2.0.1      
[19] locfit_1.5-9.1       nnet_7.3-9           grid_3.2.0         
[22] Biobase_2.28.0       AnnotationDbi_1.30.1 XML_3.98-1.2       
[25] survival_2.38-1      BiocParallel_1.2.4   foreign_0.8-63     
[28] latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.46.0 
[31] reshape2_1.4.1       lambda.r_1.1.7       magrittr_1.5       
[34] scales_0.2.5         Hmisc_3.16-0         MASS_7.3-40        
[37] splines_3.2.0        xtable_1.7-4         colorspace_1.2-6   
[40] stringi_0.4-1        acepack_1.3-3.3      munsell_0.4.2  

I don't understand why there are more differentially expressed genes between genotype A and D than between genotype A and C, although A and D are more close in PCA plot. Thanks a lot.



A minor clarification in terminology: generally a sample refers to a single replicate, not a group of replicates. Anyway, if you could show us the PCA plot somehow, it would be much easier to see what you're trying to describe.

Thanks for your reply. 

I do try to show the PCA plot, however, I don't know how to insert an image here. 

What I am trying to describe is that I have done differentially expressed genes anlysis (DEseq2) with two data points A and B, both of which have three biological replicates. I got a lot of differentially expressed genes. However, on the PCA plot, data points A and B are very close to each other. I was thinking the gene expression in data points A and B should be very similar, since they are close in the PCA plot. Why I still got a lot of differentially expressed genes?


Last seen 47 minutes ago
United States

short answer: The PCA plot you show is over multiple time points. You should subset to the time point of interest to make the PCA more comparable to your DE results within a time point.

The longer explanation is that PCA is a 2 dimensional summarization of the distances between samples, which requires removing some information (as the original data is in the space of all genes, and here we transform it into only 2 dimensions). When you include all time points, the most important 2 ways (dimensions) to distinguish the samples are describing differences between time points. Other dimensions, such as those genes which distinguish genotypes at your time point of interest are not shown. If you subset to your time point of interest, I would guess you would see a more representative picture.

Thanks Michael for your quick reply. I forgot to mention what I am most confused is at time point 16.

I will try to subset my dataset according to time point and make the PCA plot.

Thanks again for your helpful comment.




