differential expression analysis with big dataset and >500 surrogate variables
0
0
Entering edit mode
aec ▴ 90
@aec-9409
Last seen 4.4 years ago

dear all,

I have more than 500 RNA-seq samples and have to compare cases vs controls. I first run SVA to remove unknown variation and found >500 surrogate variables. Is a good practice to perform a LRT test with deseq2 where full model =~case+SV1+SV2+SVn and reduced model=~case to know how many surrogate variables should I add in order to avoid overfitting? The idea would be to first add SV1 to the full model, then add SV1+SV2, then SV1+SV2+SV3 and so on, and stop if the number of differentially expressed genes diminishes with respect to the previous model.

 

SVA deseq2 differential expression LRT • 1.5k views
ADD COMMENT
0
Entering edit mode

I think something went wrong with your estimation of SVs. Can you post all your code and sessionInfo()

ADD REPLY
0
Entering edit mode
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]

mod <- model.matrix(~case, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
n.sv <- num.sv(dat,mod,method="leek")
n.sv

[1] 589

 

ADD REPLY
0
Entering edit mode

What do you get with the default method "be"?

ADD REPLY
0
Entering edit mode
n.sv <- num.sv(dat,mod)
n.sv
[1] 1

 

ADD REPLY
1
Entering edit mode

I'll wait to see Jeff's answer, but this seems to be an issue.

I typically use a small number of SVs. Even with hundreds of samples, I usually find that 1-10 SVs or RUV factors is sufficient to capture technical variance.

ADD REPLY
0
Entering edit mode

thanks Michael. 

ADD REPLY

Login before adding your answer.

Traffic: 567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6