Hi all,
I have ~1000 RNAseq samples that come from 100 donors and am using edgeR to analyse it. The tissue from the 100 donors was treated with either 9 different chemicals (A, B, C, ...) or not treated at all (control).
Unfortunatly, due to technical reasons, I had to remove some of the control condition (only 3) due to a low number of reads (<5e6). This means that I have some "unpaired" samples, i.e. they received the treatment but I have no control for them. Is it correct, that I can still formulate a design model like this, if I am just interested in the average difference between for example treatment5 and the control?
#my design (for simplicity all batch effects are ignored)
design = ~ patient.id + treatment
A related question is the following. Because I have so many samples, I am using sva to adjust for batch effects. Is it correct, that after using this approach, I can remove the patient.id
from my design because the identified surrogate variables account for patient-specific effects.
#my design after SVA
design = ~ SV1 + SV2 + SV3 +SV4+ SV5 ... + SV12 + treatment
Any input is very much appreciated!
Cheers
Thank you very much for your input! I picked the 5M read threshold, because this is what I generally read on the internet (for example ECSEQ or Illumina. ENCODE even recommends 30 million reasds [see RNA Sequence Experiment Design: Replication and sequencing depth]).
The reason I am using SVA is the following: I know that there are several technical variables (and also biological ones) that influence expression such as age or sex or ethnicity, but they are not of interest for me. The problem is that I do not have information about age or ethinicity for all my participants (because it is a big dataset), so I can not model the effect of it. Would you agree that in this case SVA is the right approach?
The first thing with low depth samples to do is to resequence them. If that is not possible, check in exploratory analysis such as PCA or MDS whether there is data-driven evidence that the depth is a problem and drives outlier behaviour. If that is true you could still use limma with arrayWeights or voomWithQualityWeights to downweight rather than remove the sample. Removing would be the last option to consider.
If you are blocking on subject, none of the things you mention should have any effect, and couldn't be fit anyway (after blocking on subject). In other words, algebraically you are fitting a paired t-test, where for each subject you are computing the difference between a treatment and that person's control. Both the treatment and control provide information about the sex, age, ethnicity, and any other phenotype inherent to that person, and by subtracting one from the other (control from treatment) you are effectively canceling out any subject-specific variation.
And if you randomized correctly (keeping each subject within a given batch), then you are also canceling out the batch effect. If you didn't randomize correctly, then that is a tragedy.
Also, there is a difference between recommending a certain number of reads (30M is excessive IMO) and saying 'Remove any samples with <X million reads`. People recommend all sorts of things that aren't necessarily achievable in practice.
Thank you very much, this makes a lot of sense! We actually randomized "correctly" (i.e. batch and subject are confounded and not batch and condition). Just one last thing that I am unsure about. Even if I do not remove any samples because of low counts, some conditions are not paired (sequencing failed and we simply do not have them in our meta data). For these cases, can I still keep the "single" sample and just have the design
~donor + condition
, or will this mess up the "paired t-test" since some samples are unpaired?There are previous questions on incomplete pairing in DESeq2, and the unpaired samples are not considered so they don't add to any inference.