PCA for QC in RNA seq data is to detect sample level outliers and batch effects etc.
I am well aware that plotPCA of DESeq2 uses topn genes for doing PCA.
My naive question from whole community is that do we need to use all genes or top highly variable genes for doing PCA for QC of RNA seq data or any highthrougput omics data.
Now, I have around 200 samples from paired ened rna seq data and
# Read count data
ft <- "./data/rsem_counts.tsv"
readCount <- read.table(file = ft, header = TRUE, row.names = 1, stringsAsFactors = FALSE, check.names = FALSE)
head(colnames(readCount))
# Read metadata
metadata_inputfile <- "./data/metadata_counts.tsv"
metadata <- read.table(file = metadata_inputfile, header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
row.names(metadata)=metadata$filenames
metadata$filenames==colnames(readCount)
metadata$visit<-as.factor(metadata$visit)
metadata$response_status<-as.factor(metadata$response_status)
dds <- DESeqDataSetFromMatrix(readCount,
colData=metadata,
~visit+response_status)
# Check the available columns in colData(dds)
colnames(colData(dds))
colData(dds)
dds
vsd <- vst(dds, blind=FALSE)
## PCA for topn genes by default that DESEQ is using n=500
DESeq2::plotPCA(vsd,intgroup="Run", n=1000)
DESeq2::plotPCA(vsd,intgroup="Run", n=10000)
?plotPCA
Now looking at top 500 genes there is a strong some sort of factor (probably hidden batch effect?) but nothing in pca with 10000 genes(or whatever number of filtered genes i had after removing lowly expressed genes).
10k genes pca : https://drive.google.com/file/d/1LVKCJ-OJPmucjsfRa5mw-zc4s1w3NUw4/view?usp=sharing
1k genes pca : https://drive.google.com/file/d/1oU5yNxKPLCu8dEjkwEtisxBJtJYv0dZR/view?usp=sharing
I am very confused that I should go for sva here or not. any help in what ways probably i can track or move forward to check this sort of effect is real or not is appreciated.
ANd also what is teh best practice to do pca for QC at sample l;evel? by using topn genes or all gens?
Thanks much.
Thanks AT- probably it will be helpful for other people like me to know or have a reference for what could be the factors and why only subset of genes could be affected during sequencing. If you have any pointers I will appreciate that too.
That is the definition of batch effects, unwanted technical variation. Again, if all genes were affected equally it would be removed by standard normalization and you would not diagnose it.