Question

Principal component plot of the RNAseq samples and DEseq2 result

1

Entering edit mode

Emma ▴ 10

@emma-8582

Last seen 9.7 years ago

United States

Hi everyone, I am confused with my DEseq2 result and PCA plot. I have two RNAseq samples(each sample has three biological replicates) that are very close to each other on the PCA plot. However, there is still a lot of differentially expressed genes(padj <0.05) between these two samples. I was thinking these two RNAseq samples should be very similar, because they are close to each other on the PCA plot, how can I still get a lot of differentially expressed genes between these two?

Is this normal? If so, can anyone give me some explanation ? Thanks in advance.

deseq2 rnaseq • 4.0k views

ADD COMMENT • link updated 9.7 years ago by Michael Love 43k • written 9.7 years ago by Emma ▴ 10

1

Entering edit mode

You need to provide more information to get some useful advice:

- your column data: as.data.frame(colData(dds))
- your design
- the code you used
- the output of sessionInfo()
- a picture of the PCA plot*

*see "How do I put images into my posts?": https://support.bioconductor.org/info/faq/

ADD REPLY • link 9.7 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael.

Following are more information about my analysis:

as.data.frame(colData(ddsTC))
              genotype sizeFactor
rep1_A_16         A 0.5108994
rep2_A_16         A 1.8407776
rep3_A_16         A 0.8506794
rep1_B_16        B 1.0531460
rep2_B_16        B 0.4112253
rep3_B_16       B 1.0702545
rep2_C_16     C 1.2199964
rep3_C_16     C 1.2810071
rep1_D_16     D 1.0954071
rep2_D_16     D 1.1814218
rep3_D_16     D 1.4053602
LRTDesign = data.frame(row.names = colnames(R_data ),genotype = c( "A", "A", "A", "B", "B", "B", "C", "C", "D", "D", "D"))

R_data_matrix <- data.matrix(R_data)
head(R_data_matrix)
ddsTC <- DESeqDataSetFromMatrix(countData = R_data_matrix, colData= LRTDesign, design = ~ genotype)
ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ 1)
resultsNames(ddsTC)
D_vs_A_16<-results(ddsTC, name = "genotype_D_vs_A", test = "Wald")

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] ggplot2_1.0.1             DESeq2_1.8.1
[3] RcppArmadillo_0.5.200.1.0 Rcpp_0.11.6
[5] GenomicRanges_1.20.5      GenomeInfoDb_1.4.1
[7] IRanges_2.2.4             S4Vectors_0.6.0
[9] BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2   futile.logger_1.4.1 plyr_1.8.3
[4] XVector_0.8.0        futile.options_1.0.0 tools_3.2.0
[7] rpart_4.1-9          digest_0.6.8         RSQLite_1.0.0
[10] annotate_1.46.0      gtable_0.1.2         lattice_0.20-31
[13] DBI_0.3.1            proto_0.3-10         gridExtra_0.9.1
[16] genefilter_1.50.0    stringr_1.0.0        cluster_2.0.1
[19] locfit_1.5-9.1       nnet_7.3-9           grid_3.2.0
[22] Biobase_2.28.0       AnnotationDbi_1.30.1 XML_3.98-1.2
[25] survival_2.38-1      BiocParallel_1.2.4   foreign_0.8-63
[28] latticeExtra_0.6-26 Formula_1.2-1        geneplotter_1.46.0
[31] reshape2_1.4.1       lambda.r_1.1.7       magrittr_1.5
[34] scales_0.2.5         Hmisc_3.16-0         MASS_7.3-40
[37] splines_3.2.0        xtable_1.7-4         colorspace_1.2-6
[40] stringi_0.4-1        acepack_1.3-3.3      munsell_0.4.2

I don't understand why there are more differentially expressed genes between genotype A and D than between genotype A and C, although A and D are more close in PCA plot. Thanks a lot.

ADD REPLY • link updated 9.7 years ago by Michael Love 43k • written 9.7 years ago by Emma ▴ 10

0

Entering edit mode

A minor clarification in terminology: generally a sample refers to a single replicate, not a group of replicates. Anyway, if you could show us the PCA plot somehow, it would be much easier to see what you're trying to describe.

ADD REPLY • link 9.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks for your reply.

I do try to show the PCA plot, however, I don't know how to insert an image here.

What I am trying to describe is that I have done differentially expressed genes anlysis (DEseq2) with two data points A and B, both of which have three biological replicates. I got a lot of differentially expressed genes. However, on the PCA plot, data points A and B are very close to each other. I was thinking the gene expression in data points A and B should be very similar, since they are close in the PCA plot. Why I still got a lot of differentially expressed genes?

Thanks.

ADD REPLY • link 9.7 years ago Emma ▴ 10

score 3 · Answer 1 · 2015-08-12

short answer: The PCA plot you show is over multiple time points. You should subset to the time point of interest to make the PCA more comparable to your DE results within a time point.

The longer explanation is that PCA is a 2 dimensional summarization of the distances between samples, which requires removing some information (as the original data is in the space of all genes, and here we transform it into only 2 dimensions). When you include all time points, the most important 2 ways (dimensions) to distinguish the samples are describing differences between time points. Other dimensions, such as those genes which distinguish genotypes at your time point of interest are not shown. If you subset to your time point of interest, I would guess you would see a more representative picture.