Principle component analysis plot with gene expression data
1
0
Entering edit mode
Biologist ▴ 120
@biologist-9801
Last seen 4.8 years ago

Hi,

To see the batch effects I would like to make a PCA plot. I have readcounts data for my data. I'm showing simulated data here.

Geneid

SampleC

SampleD

SampleE

SampleF

SampleG

ENSG00000230021.9

25

30

27

9

8

ENSG00000235146.2

0

0

0

0

0

ENSG00000225972.1

7

3016

11

5

14

ENSG00000225630.1

113

194

76

91

47

ENSG00000237973.1

1037

9767

1160

886

1321

ENSG00000278791.1

1

0

5

0

0

ENSG00000229344.1

121

129

97

37

112

After filtering step I did normalization 

y <- calcNormFactors(y,method = "TMM")

I tried making a PCA plot like below, but didn't work.

er <- y$counts

As I see genes as rownames 

rownames(er) <- NULL

type <- read.csv("samples.csv") 

Samples          Type

SampleC       Disease

SampleD        Normal

SampleE       Disease

SampleF         Disease

SampleG       Normal

library(ggbiplot)

er.pca <- prcomp(er, scale. = TRUE)
ggbiplot(er.pca, obs.scale = 1, var.scale = 1,
         groups= type, varname.abbrev = FALSE,var.axes = FALSE,pc.biplot =TRUE,circle = TRUE) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top')

This is the Error I see:

Error in `$<-.data.frame`(`*tmp*`, "groups", value = list(Samples = 1:46,  : 
  replacement has 46 rows, data has 27906
r pca rnaseq edger ggbiplot • 5.3k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

You don't want to do a PCA plot on counts. And do note that calcNormFactors doesn't do anything to the counts anyway. It just calculates offsets for the linear model. If you are using edgeR, you could do

plotMDS(y, pch = 16, labels = as.numeric(type$Type))

Which will show any batch effects. You could also use the batches to color the plotting symbols, which would probably make it easier to see a batch effect if it's subtle.

ADD COMMENT
0
Entering edit mode

Thank you. but MDS plot is different from PCA right? And with the counts data how can I get the normalized values?

ADD REPLY
1
Entering edit mode

Classical MDS and PCA are quite closely related. I can't recall all the theory, but look:

set.seed(100)
my.exprs <- matrix(rnorm(1000), ncol=10)
d <- dist(t(my.exprs)) # sample-sample distance matrix
cmdscale(d, k=2) # Function for MDS in R.
prcomp(t(my.exprs), rank.=2)$x # Function for PCA in R.

The coordinates should be the same other than a change in sign, which doesn't impact the dimensionality reduction anyway. So, from a conceptual perspective, it doesn't really matter which one you use. The choice is more practical as MDS only needs a distance matrix while PCA requires the full coordinates. In some applications, you only get a distance matrix (comparative studies where an absolute measurement is not possible, e.g., wine tasting), which is where you need to run cmdscale.

In plotMDS, edgeR constructs a distance matrix by computing the average absolute log-fold change across the top 500 genes with the largest log-fold changes between every pair of samples. We use a top set to improve the resolution of any differences between samples. However, the top set changes across pairs of samples, meaning that we can't easily convert this approach to an expression matrix (with the same genes for all samples) for use in PCA. If you set gene.selection="common" in plotMDS, then you'd basically be doing a PCA with the top 500 most highly variable genes.

As for getting the normalized expression values, try cpm(y, log=TRUE, prior.count=5). This computes log-CPMs for use in dimensionality reduction and clustering. A log-transform provides some measure of variance stabilization; otherwise, highly expressed genes with large absolute differences in the counts between samples would just end up driving the spread of samples in the plot, even if they have low log-fold changes and are largely uninteresting. (Or in other words, the difference between the log-CPMs is the log-fold change, which is generally much more interesting than a difference on the raw count scale.) The prior.count=5 downweights large log-differences when the counts are low by shrinking the log-fold change to zero.

ADD REPLY
0
Entering edit mode

Thank you very much Aaron. In my code above for the ggbiplot, for variable groups I gave type. Is that right?

ADD REPLY
0
Entering edit mode

I don't use ggpbiplot, or any ggplot for that matter. But I notice that you've run prcomp on er. This is wrong on two counts - one, you should be using log-CPMs as I mentioned above, and two, you should transpose the matrix so that the samples are the rows. Right now, you're doing a PCA on the genes, resulting in >20000 points on the plot. This is unlikely to be what you intended if type is a per-sample variable.

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 412 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6