Hi all, I have RNA-Seq raw count data and I want to take rlog (regularized log transform) to run surrogate variable analysis using R package.
FYI, here is my code:
>source("https://bioconductor.org/biocLite.R")
>biocLite("DESeq")
>library(DESeq2)
>dat <- read.table("TestData_09_30_2016.txt",header=T)
>row.names(dat)<-dat[,1]
>dat<-dat[,-1]
>dat <- as.matrix(dat)
>head(dat)
>(condition <- factor(c(rep("KO1", 1), rep("KO2", 1), rep("KO3", 1), rep("KO4", 1), rep("KO5", 1), rep("KO6", 1), rep("Control1", 1), rep("Control2", 1))))
>(coldata <- data.frame(row.names=colnames(dat), condition))
>dds <- DESeqDataSetFromMatrix(countData=dat, colData=coldata, design=~condition)
#run the DESeq pipeline
>dds <- DESeq(dds)
#delete all the data that have total of 0 counts
>dds<-dds[rowSums(counts(dds))>1,]
#get regularized log transformation
>rld <- rlogTransformation(dds)
This is my raw count data (without the data that have total of 0 counts) head(dat):
KO1 KO2 KO3 KO4 KO5 KO6 Control1 Control2
ENSG00000000003 770 520 650 364 1165 440 1116 1002
ENSG00000000005 23 17 9 15 30 18 31 16
ENSG00000000419 1086 473 800 425 1091 659 1187 1192
ENSG00000000457 401 231 374 156 583 312 457 485
ENSG00000000460 619 364 604 268 795 428 883 684
ENSG00000000938 0 1 0 2 0 3 0 0
And this is my head(assay(rld)):
KO1 KO2 KO3 KO4 KO5 KO6 Control1
ENSG00000000003 9.38864787 9.55798078 9.33509314 9.38546651 9.5167140 9.19017804 9.63294632
ENSG00000000005 4.26857690 4.32402555 4.14925099 4.32936408 4.2673566 4.28471196 4.31229727
ENSG00000000419 9.78054476 9.55463023 9.59997856 9.60364472 9.5358608 9.62371048 9.76759852
ENSG00000000457 8.42800759 8.45243834 8.45965033 8.27816421 8.5124251 8.49092122 8.45058666
ENSG00000000460 9.05770197 9.10173211 9.13116762 8.98096336 9.0382444 9.02339178 9.28249708
ENSG00000000938 -0.08566075 -0.06565163 -0.08529477 -0.04374719 -0.0864924 -0.03977232 -0.08599017
Control2
ENSG00000000003 9.48105503
ENSG00000000005 4.17943696
ENSG00000000419 9.71825081
ENSG00000000457 8.45220418
ENSG00000000460 9.00489782
ENSG00000000938 -0.08616627
So my question is:
1) What does an output look like when I run SVA? Do I get a list of genes without the genes that are removed? Or do I get the list of genes that are subjected to removal?
2) My final goal is to see differentially expressed genes after running SVA. When I just use dds for DESeq and run results(dds), there were so many repeated adjusted p-values, which did not give me enough information to determine differentially expressed genes. There were many suggestions in previous posts, but my professor wants me to use SVA then determine DE. What would be the simplest way to do this?
Thanks!