differential expression analysis with big dataset and >500 surrogate variables

0

Entering edit mode

aec ▴ 90

@aec-9409

Last seen 4.9 years ago

dear all,

I have more than 500 RNA-seq samples and have to compare cases vs controls. I first run SVA to remove unknown variation and found >500 surrogate variables. Is a good practice to perform a LRT test with deseq2 where full model =~case+SV1+SV2+SVn and reduced model=~case to know how many surrogate variables should I add in order to avoid overfitting? The idea would be to first add SV1 to the full model, then add SV1+SV2, then SV1+SV2+SV3 and so on, and stop if the number of differentially expressed genes diminishes with respect to the previous model.

SVA deseq2 differential expression LRT • 1.7k views

ADD COMMENT • link 7.0 years ago aec ▴ 90

0

Entering edit mode

I think something went wrong with your estimation of SVs. Can you post all your code and sessionInfo()

ADD REPLY • link 7.0 years ago Michael Love 43k

0

Entering edit mode

dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]

mod <- model.matrix(~case, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
n.sv <- num.sv(dat,mod,method="leek")
n.sv

[1] 589

ADD REPLY • link 7.0 years ago aec ▴ 90

0

Entering edit mode

What do you get with the default method "be"?

ADD REPLY • link 7.0 years ago Michael Love 43k

0

Entering edit mode

n.sv <- num.sv(dat,mod)
n.sv
[1] 1

ADD REPLY • link 7.0 years ago aec ▴ 90

1

Entering edit mode

I'll wait to see Jeff's answer, but this seems to be an issue.

I typically use a small number of SVs. Even with hundreds of samples, I usually find that 1-10 SVs or RUV factors is sufficient to capture technical variance.