PCA for QC should include subset of top variable genes or full set of genes in RNAseq data?
1
0
Entering edit mode
amnahsiddiqa ▴ 10
@amnahsiddiqa-15854
Last seen 17 months ago
United States

PCA for QC in RNA seq data is to detect sample level outliers and batch effects etc.

I am well aware that plotPCA of DESeq2 uses topn genes for doing PCA.

My naive question from whole community is that do we need to use all genes or top highly variable genes for doing PCA for QC of RNA seq data or any highthrougput omics data.

Now, I have around 200 samples from paired ened rna seq data and

# Read count data
ft <- "./data/rsem_counts.tsv"
readCount <- read.table(file = ft, header = TRUE, row.names = 1, stringsAsFactors = FALSE, check.names = FALSE)
head(colnames(readCount))

# Read metadata
metadata_inputfile <- "./data/metadata_counts.tsv"
metadata <- read.table(file = metadata_inputfile, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
row.names(metadata)=metadata$filenames
metadata$filenames==colnames(readCount)

metadata$visit<-as.factor(metadata$visit)
metadata$response_status<-as.factor(metadata$response_status)


dds <- DESeqDataSetFromMatrix(readCount, 
                              colData=metadata, 
                              ~visit+response_status)
# Check the available columns in colData(dds)
colnames(colData(dds))
colData(dds)
dds 

vsd <- vst(dds, blind=FALSE)

## PCA for topn genes by default that DESEQ is using n=500
DESeq2::plotPCA(vsd,intgroup="Run", n=1000)
DESeq2::plotPCA(vsd,intgroup="Run", n=10000)

?plotPCA

Now looking at top 500 genes there is a strong some sort of factor (probably hidden batch effect?) but nothing in pca with 10000 genes(or whatever number of filtered genes i had after removing lowly expressed genes).

10k genes pca : https://drive.google.com/file/d/1LVKCJ-OJPmucjsfRa5mw-zc4s1w3NUw4/view?usp=sharing

1k genes pca : https://drive.google.com/file/d/1oU5yNxKPLCu8dEjkwEtisxBJtJYv0dZR/view?usp=sharing

I am very confused that I should go for sva here or not. any help in what ways probably i can track or move forward to check this sort of effect is real or not is appreciated.

ANd also what is teh best practice to do pca for QC at sample l;evel? by using topn genes or all gens?

Thanks much.

BatchQC RNASeqData sva BatchEffect • 1.9k views
ADD COMMENT
2
Entering edit mode
ATpoint ★ 4.6k
@atpoint-13662
Last seen 7 hours ago
Germany

You use top genes because batch effects affect a subset of genes, not all genes. If all genes were affected pretty much equally then we could simply normalize them away with a scaling factor. Using top genes is motivated by the fact that relevant batch effects are causing strong variability, hence you diagnose with the most variable genes.

ADD COMMENT
0
Entering edit mode

Thanks AT- probably it will be helpful for other people like me to know or have a reference for what could be the factors and why only subset of genes could be affected during sequencing. If you have any pointers I will appreciate that too.

ADD REPLY
0
Entering edit mode

That is the definition of batch effects, unwanted technical variation. Again, if all genes were affected equally it would be removed by standard normalization and you would not diagnose it.

ADD REPLY

Login before adding your answer.

Traffic: 542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6