Hi,
To see the batch effects I would like to make a PCA plot. I have readcounts data for my data. I'm showing simulated data here.
Geneid |
SampleC |
SampleD |
SampleE |
SampleF |
SampleG |
ENSG00000230021.9 |
25 |
30 |
27 |
9 |
8 |
ENSG00000235146.2 |
0 |
0 |
0 |
0 |
0 |
ENSG00000225972.1 |
7 |
3016 |
11 |
5 |
14 |
ENSG00000225630.1 |
113 |
194 |
76 |
91 |
47 |
ENSG00000237973.1 |
1037 |
9767 |
1160 |
886 |
1321 |
ENSG00000278791.1 |
1 |
0 |
5 |
0 |
0 |
ENSG00000229344.1 |
121 |
129 |
97 |
37 |
112 |
After filtering step I did normalization
y <- calcNormFactors(y,method = "TMM")
I tried making a PCA plot like below, but didn't work.
er <- y$counts
As I see genes as rownames
rownames(er) <- NULL
type <- read.csv("samples.csv")
Samples Type
SampleC Disease
SampleD Normal
SampleE Disease
SampleF Disease
SampleG Normal
library(ggbiplot)
er.pca <- prcomp(er, scale. = TRUE)
ggbiplot(er.pca, obs.scale = 1, var.scale = 1,
groups= type, varname.abbrev = FALSE,var.axes = FALSE,pc.biplot =TRUE,circle = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')
This is the Error I see:
Error in `$<-.data.frame`(`*tmp*`, "groups", value = list(Samples = 1:46, : replacement has 46 rows, data has 27906
Thank you. but MDS plot is different from PCA right? And with the counts data how can I get the normalized values?
Classical MDS and PCA are quite closely related. I can't recall all the theory, but look:
The coordinates should be the same other than a change in sign, which doesn't impact the dimensionality reduction anyway. So, from a conceptual perspective, it doesn't really matter which one you use. The choice is more practical as MDS only needs a distance matrix while PCA requires the full coordinates. In some applications, you only get a distance matrix (comparative studies where an absolute measurement is not possible, e.g., wine tasting), which is where you need to run
cmdscale
.In
plotMDS
, edgeR constructs a distance matrix by computing the average absolute log-fold change across the top 500 genes with the largest log-fold changes between every pair of samples. We use a top set to improve the resolution of any differences between samples. However, the top set changes across pairs of samples, meaning that we can't easily convert this approach to an expression matrix (with the same genes for all samples) for use in PCA. If you setgene.selection="common"
inplotMDS
, then you'd basically be doing a PCA with the top 500 most highly variable genes.As for getting the normalized expression values, try
cpm(y, log=TRUE, prior.count=5)
. This computes log-CPMs for use in dimensionality reduction and clustering. A log-transform provides some measure of variance stabilization; otherwise, highly expressed genes with large absolute differences in the counts between samples would just end up driving the spread of samples in the plot, even if they have low log-fold changes and are largely uninteresting. (Or in other words, the difference between the log-CPMs is the log-fold change, which is generally much more interesting than a difference on the raw count scale.) Theprior.count=5
downweights large log-differences when the counts are low by shrinking the log-fold change to zero.Thank you very much Aaron. In my code above for the ggbiplot, for variable groups I gave type. Is that right?
I don't use ggpbiplot, or any ggplot for that matter. But I notice that you've run
prcomp
oner
. This is wrong on two counts - one, you should be using log-CPMs as I mentioned above, and two, you should transpose the matrix so that the samples are the rows. Right now, you're doing a PCA on the genes, resulting in >20000 points on the plot. This is unlikely to be what you intended iftype
is a per-sample variable.Coincidentally, there is now PCAtools: https://bioconductor.org/packages/release/bioc/html/PCAtools.html