Question

Batch effects between controls

0

Entering edit mode

llo ▴ 10

@llo-13602

Last seen 7.4 years ago

Hello, I am working with data that I downloaded from the SRA database. I am only working with the same stage, library preparation, and species. However, when I plot a PCA plot of the data, they do not align very well even though I am using the same reference genome and annotation. How would I correct for this batch effect? I have tried RUVSeq's upper quartile normalization but it does not do anything, I have not tried using "negative control genes" or housekeeping genes.

I also have single and paired end data, how do I correct for the batch effects between the two? Thank you

batch-effect ruvseq • 2.7k views

ADD COMMENT • link updated 7.4 years ago by James W. MacDonald 68k • written 7.4 years ago by llo ▴ 10

score 1 · Answer 1 · 2017-10-31

1

Entering edit mode

davide risso ▴ 980

@davide-risso-5075

Last seen 12 months ago

University of Padova

Hi Ilo,

It is expected that upper-quartile normalization will not handle batch effects as it is only a global scaling normalization and is not related to the RUV method.

I suggest that you read carefully the RUVSeq vignette if you want to use RUV to try and adjust for batch effects. An alternative approach would be to use the sva package. It's a good idea to read both vignettes and see if these methods can help.

You said you want to use the RUV method, but you haven't tried using the negative controls. That is the main point of RUV: Using negative controls to estimate the batch effects. So you cannot use RUVSeq without using negative control genes. Again, the vignette is pretty clear on how to use the RUVSeq package, I suggest that you start from there.

It may also be useful to read the RUVSeq and svaseq papers, as they make clear the difference between adjusting for sequencing depth (what upper-quartile does) and removing batch effects.

Best,
Davide

ADD COMMENT • link 7.4 years ago davide risso ▴ 980

0

Entering edit mode

Thank you for your reply. I will try using negative control genes but the vignette does not include how to use specific genes but rather how to use spike ins. I have a list of potential control genes but no spike ins, do you know how to use a list of genes that I have by gene name to use as a negative control gene?

ADD REPLY • link 7.4 years ago llo ▴ 10

1

Entering edit mode

I'm not sure I understand your question. The same way you specify the names of the spike ins, you can specify the names of the endogenous genes that you want to use as negative controls. Section 2.4 of the vignette uses endogenous genes as negative controls.

ADD REPLY • link 7.4 years ago davide risso ▴ 980

0

Entering edit mode

Thank you, I completely missed 2.4, it does exactly what I need it to

ADD REPLY • link 7.4 years ago llo ▴ 10

score 1 · Answer 2 · 2017-10-31

A PCA plot simply shows you the largest differences between samples, so 'not aligning well' can mean more than one thing. For example, it may be that there is lots of technical variability that is obscuring the biological differences between your samples. But this is a matter of degree!

If you have really large changes between samples for a lot of genes, but larger technical variability due to batches or whatever, then the technical variability can obscure the biological variability (which usually shows up in higher principal components). In this case, using something like RUVSeq or svaseq from the sva package can help control for the unwanted technical variability.

However, if you have consistent, but real differences between samples in just a few genes, then the 'normal' variability that one might expect is often predominant in a PCA plot. This (IMO) doesn't necessarily mean you have to do something to 'fix' the data. With any adjustments to the data you always run the risk that you may be capturing some of your real biological variability with a surrogate variable, and thereby reducing your abilities to see the real changes that exist.

My point is that there is no free lunch here. Any adjustment you make to fix perceived faults in your data may well erase real signal. So I usually try to figure out if I really do have a problem, and if I can identify the source of the problem first.

As to correcting for SE and PE data, if they were run in separate batches (you seem to imply that these data were all run together, although I am inferring that from you saying 'the same stage, library preparation, and species' , which may not mean what I think), then you would simply fit a batch effect in your model. But it is pretty uncommon in my experience for samples to be run using the same library preparation, but sequenced differently.

Perhaps this is just a compilation of a bunch of different samples from different labs? If that is the case, you really shouldn't just be piling them all into one analysis. You would be better off doing separate analyses and then using something like the GeneMeta package to do a meta-analysis.