Question

Remove unwanted variation/confounder due to library size with sva

0

Entering edit mode

Jasmin • 0

@6462d400

Last seen 2.7 years ago

Netherlands

Hi,

Library size is a source of unwanted variation in my data (dataset of healthy vs disease samples) because I can see in the pca plot that samples are separated to some degree by library size. Can sva remove the effect that library size has on my data or is another package better suited for this? To be clear: I'm not trying to remove batch effects.

Thanks!

sva • 2.3k views

ADD COMMENT • link updated 2.7 years ago by James W. MacDonald 68k • written 2.8 years ago by Jasmin • 0

score 1 · Answer 1 · 2022-07-20

1

Entering edit mode

ATpoint ★ 4.8k

@atpoint-13662

Last seen 6 hours ago

Germany

Just asking to be sure, how did you diagnose that library size is a confounder? Did you run a PCA based on values where library size has been normalized for, like vst() from DESeq2? If so then svaseq might be used here. Check if it identifies the library size as a significant surrogate variable. The manual covers a basic workflow. If you need details then please add code and plots.

ADD COMMENT • link 2.8 years ago ATpoint ★ 4.8k

0

Entering edit mode

Hi thanks for answering! This is the code that I want to use:

dds<-DESeqDataSetFromMatrix(df,colData =coldata, design = ~lib.size+sex+Patient_group)
dds <- estimateSizeFactors(dds)
dat  <- counts(dds, normalized = TRUE)
idx  <- rowMeans(dat) > 1
dat  <- dat[idx, ]
mod  <- model.matrix(~lib.size+sex+Patient_group, colData(dds))
mod0 <- model.matrix(~lib.size, colData(dds))
svseq <- svaseq(dat, mod, mod0)
ddssva <- dds
design(ddssva) <- ~sex + lib.size + Patient_group
design(ddssva)
rld <- vst(ddssva, blind=FALSE)
mat <- assay(rld)
mm <- model.matrix(~sex+Patient_group. colData(rld))
mat <- limma::removeBatchEffect(mat, design=mm, batch=lib.size)
assay(rld) <- mat
plotPCA(rld,"Patient.group")

However, I'm unsure about the mod0 line which should only include the "adjustment" variables, so that would be libsize I think? I'm pretty sure I'm making a mistake with the svseq line because I don't want to add any surrogate variables, just remove lib.size as a confounder, but I don't know how to write this down. I'd really appreciate your help!

ADD REPLY • link 2.8 years ago • updated 2.7 years ago Jasmin • 0

0

Entering edit mode

The plots aren't really the problem, it's just the code that is giving me great difficulties. Thanks again!

ADD REPLY • link 2.8 years ago Jasmin • 0

1

Entering edit mode

You are generating surrogate variables, but you never use them. Also mod0 should include all the adjustment variables, which includes sex (which is almost always included as a nuisance variable rather than a parameter of interest). Also, if you are fitting lib.size as part of your model, then the surrogate variables won't adjust for library size unless there are differences in variability due to library size (which is not a completely unexpected result, particularly if you have widely divergent library sizes).

Also also, you don't want to use lib.size as a batch argument in removeBatchEffect. The expectation for a batch is that it is a factor, whereas lib.size is continuous. One of the first things that happens to the batch argument is conversion to factor. If you use a continuous variable you will end up with N factor levels (where N is the number of samples), which is not what you want to do. The covariates argument is what you use for continuous variables. And if you want to adjust for the surrogate variables you have to cbind the lib.size and svseq$sv together and pass that to removeBatchEffect.

ADD REPLY • link 2.8 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks for the help! I have now changed libsize to a continous variable (it was a dichotomous variable: high/low at first). I should use the pre-normalized counts for this right? My code is now as follows:

coldata$libsize<-colSums(df)
dds<-DESeqDataSetFromMatrix(df,colData =coldata, design = ~lib.size+Patient_group) # I have removed sex as variable
dds <- estimateSizeFactors(dds)
dat  <- counts(dds, normalized = TRUE)
idx  <- rowMeans(dat) > 1
dat  <- dat[idx, ]
mod  <- model.matrix(~lib.size+Patient_group, colData(dds))
mod0 <- model.matrix(~lib.size, colData(dds))
svseq <- svaseq(dat, mod, mod0) #this gives me 14 significant SVs
ddssva <- dds
ddssva$SV1<- svseq$sv[,1]
ddssva$SV2<- svseq$sv[,2]
ddssva$SV3<- svseq$sv[,3]
ddssva$SV4<-svseq$sv[,4]
ddssva$SV5<-svseq$sv[,5]
ddssva$SV6<-svseq$sv[,6]
ddssva$SV7<-svseq$sv[,7]
ddssva$SV8<-svseq$sv[,8]
ddssva$SV9<-svseq$sv[,9]
ddssva$SV10<-svseq$sv[,10]
ddssva$SV11<-svseq$sv[,11]
ddssva$SV12<-svseq$sv[,12]
ddssva$SV13<-svseq$sv[,13]
ddssva$SV14<-svseq$sv[,14]
design(ddssva) <- ~SV1+SV2+SV3+SV4+SV5+SV6+SV7+SV8+SV9+SV10+SV11+SV12+SV13+SV14+lib.size+Patient_group
design(ddssva)
rld <- vst(ddssva, blind=FALSE)
mat <- assay(rld)
mm <- model.matrix(~Patient.group,colData(rld))
svs_and_libsize_cbind<-cbind(svseq$sv,coldata$libsize)
mat <- limma::removeBatchEffect(mat, design=mm, covariates=svs_and_libsize_cbind)
assay(rld) <- mat   
plotPCA(rld,"Patient_group")

This gives me the following plot: enter image description here

The seperation isn't bad but PC values are low. When I do not remove SV, the PC values increase but separation worsens: enter image description here

So would it be fair to conclude that batch effect is more important than biological effects for my dataset?

ADD REPLY • link 2.7 years ago Jasmin • 0

0

Entering edit mode

Do you have batches? If you have known batches, people usually directly account for them as part of the linear model rather than doing something indirect like computing surrogate variables.

ADD REPLY • link 2.7 years ago James W. MacDonald 68k

0

Entering edit mode

I actually don't have known batches (I'm using data from another researcher from some time ago who does not have this information anymore). Do you think that my code lines look ok in general? On another note: the sva code line

svseq <- svaseq(dat, mod, mod0)

Gives me the following error: Error in solve.default(t(mod) %*% mod) : system is computationally singular: reciprocal condition number = 1.32097e-16 I have found that you can get this by two covariates (patient group and libsize in my case, I presume) being highly colinear or because the number of cases is approximately as large as the number of covariates. These are both not true for my dataset I think (case number is 174, covariate number is 2). Do you have any idea if something else could cause this error?

ADD REPLY • link 2.7 years ago Jasmin • 0

0

Entering edit mode

And would it be possible to remove confounders using ruvseq and limma's remove batch effect function?