Hello all,
This is generic question, not really dependent on code. I tag DESeq2 because that is what I am using and it would be ideal to reach a solution using it, but the main question is about independent filtering. I have a dataset with two treatments and 4 biological replicates each. This is an amplicon sequencing experiment where all amplicons are the same size, and the total number of amplicons is ~200K. Similarly to a CRISPR, within each biological replicate I have internal replicates, which are amplicons with different DNA sequence that I would expect to give the same phenotype. So there are 3 of these internal replicates per amplicon, and therefore 3 internal reps x 66K amplicon groups = 200K amplicons in total. The problem I am facing is a very strong dispersion between the biological replicates of the treated samples (the control samples have a good correlation, similar to other datasets I am working with). I have looked at the internal replicates, and seen that there is a lot of variation between them as well. So I am assuming that the experimental system is very noisy. A comparison of data between independent experiments suggests that I have many false positives.
I have been thinking about the independent filtering approach described in DESeq2 and the corresponding filtering paper, so I would like your opinion about this:
- would it be OK to discard the amplicons with high variation between internal replicates, and repeat the analysis with the ones with low variation? I understand that doing this would be independent from the null hypothesis (OK according to the paper), but not correlated to the alternative (not OK according to the paper).
- is there a way to include this internal replication in the DESeq2 model? If not, could you give me a pointer to literature to approach this? The difference between my experiment and a CRISPR one is that in CRISPR the guides target different parts of a gene, so they are different. In my case, the DNA sequence of the internal replicates is different, but they all encode for the same peptide, so other than differences between codons I would expect low variability in the phenotype for the internal reps within each biological rep.
Thanks a lot, Ruben
Hello Wolfgang, Thank you for the comments (and thank you also Michael!). First I have to mention that I come from a biology background, which is great because it gives me the understanding of the biology and the lab-technical implications of what we are doing. However my data analysis background I am in the process of building, which makes me wary about being creative. I am looking forward to having the tools, like I have in molecular biology, that allow me to control that the results I get are likely to be real and not an artifact of wrong assumptions. I have the exploratory analysis, and I can say there is no good reason to discard any biological rep. My PCA showed a potential outlier biorep, but I realised that most of the variation came from a range of amplicons with low counts. Once I filtered them out, the outlier was no more. I confirmed that by looking at replicate-to-replicate heat scatterplots with Spearman correlation, and comparing each rep against the log mean of all reps from the same treatment in unfiltered data.
I like the idea about the ranking, and I have those pos/neg controls in place. Perhaps you could elaborate a little bit more on "the 2nd- or 3rd-strongest effect among your replicates"? And what would be the dataset to work with, an rlog transformation of the normalised counts? Something I tried was to get a local z-score (log2FoldChange/lfcSE) for each amplicon in my dataset, filter the amplicons with good signal, filter again those with low coefficient of variation between the internal replicates, and then rank by median log2FoldChange. Visually, they make sense on the MA plots: I get a denser cloud of amplicons around the log2FoldChange=0, and they look like potential true DE amplicons. But to see that I am looking just at the population of amplicons with low variation between internal replicates, and discarding the rest of the data. From Michael's answer I see that it's not recommended to run DESeq2 on the filtered dataset, but does it work for ranking on the lines you mention?
Very nice, that comment about hammer an nail. That's what I want to build, a good toolbox! I find it especially rewarding to become a bridge that can communicate biologists with statisticians.
Best, Ruben