Hello,
I'm dealing with data sets which come from patient data and I need some advice for the formula design to do a gene expression analysis. I don't want to compare the data sets between each other!
The data sets consist of two groups, sample which have a disease and samples which don't have it. It looks like this:
Sample | Status |
---|---|
1 | healthy |
2 | healthy |
3 | healthy |
4 | healthy |
5 | healthy |
6 | sick |
7 | sick |
8 | sick |
9 | sick |
10 | sick |
What I would do know is the following design
dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ Status)
But I'm quite sure this is not the right way to go, because I'm dealing with human samples and they do not only differ by Status. They have other variances. A PCA confirms this as well. There is not real clustering between healthy and sick. The samples are spread all other the coordinate system and sick and healther cluster within each other. What I do see is a separation between samples on the x-axis. There is a gap between the samples and I have a cluster of healthy/sick samples, then the gap and again healthy/sick cluster. I assume it might be a batch effect. So here are my question.
1. Is it worth to account for unknown batches? Like it is described here: http://www.bioconductor.org/help/workflows/rnaseqGene/#batch
I'm not a big fan trying to see something which might not be there which is why I have my doubts about such procedures.
2. What kind of design can I do? And can I somehow account for samples differences? A design like Sample + Status doesn't work. Or should I do likelihood ratio test with DESeq2?
I have additional information for some of the data sets, like age of the patients, bmi index, disease status but I'm not sure how to work these into a design formula. For example how would I build I factor for Age (where would be the group separation age 25 or 40 or 60...)
One data set is from amplified samples and when I run DESeq2 with the design which I mentioned above, I'll get 1 differentially expressed gene at a FDR of 0.1. What I see is that a lot of genes get filtered due to independent filtering:
out of 52719 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 1, 0.0019% LFC < 0 (down) : 0, 0% outliers [1] : 0, 0% low counts [2] : 49725, 94% (mean count < 1012.9)
which seems quite high. If I turn independentFiltering off I get 0 genes. What I'd like to know is why the threshold is so high.
Thx
Mathias
Yes, Ryan is correct. And in fact, we are planning to tweak the filtering process this devel cycle: in the case where there are only e.g. single digit gains in #(adj p < alpha), to avoid aggressive filtering (which would give here 0 adjusted p-values < alpha).
hey, thx for the answer. any comment on question number 2? Or is there nothing I can do than just compare sick vs healthy
I wouldn't go exploring phenotypic covariates to include to reduce the variance, because this kind of a procedure isn't really safe from a statistical point of view. The main conclusion is that, if there are differences, you need more sample size to detect them, or you need to account for technical variation (either with batch correction or improvements to protocol).
Alternatively, if you do EDA and figure out that, for example, there are age-based differences, you could add a factor variable of age categories to the model, to control for this. But I would caution against looking at all possible covariate (unless you do a final analysis on new samples), because you will end up overfitting to these 10 samples.
Hey Michael,
thx for the answer and I have to apology for the sample overview. It was an example. Just to be precise, I have about 40 samples per experiment and half of them are from healthy patients, the others from sick patients.
But what do you mean with EDA?
It is just if you look at a PCA of patient data and you don't see a clustering at all between the two conditions, it makes you doubt that the disease has some bigger impact and will be hard to trace with a gene expression analysis.
Yes, by EDA I mean, for example, PCA to look for major differences between samples. By changing the color/shape of points, you can see the contribution across genes of different covariates to differences in counts.
Hi Michael,
I understand what you mean. this is what the pca looks like at the moment: http://www.biotec.tu-dresden.de/sharing/t/upload41ovSC/pca.pdf Right now I have only the information sick and healthy and I'm waiting to get more information. I don't think I can't do more at the moment. The only thing I could try to find out with sva would be the gap on the x-axis between the 8 samples on the right and the rest.
If apply sva, I only followed rnaseqGene workflow on bioconductor, then the pca looks like this:
http://www.biotec.tu-dresden.de/sharing/t/uploadwjRMyu/pca.sva.pdf
If I interpret this correctly, then SV1 shows the batch effect.
thanks
mathias
The Principal Component Analysis is important for EDA and quality control - but I wouldn't necessarily give up on the analysis if the disease and controls don't show a clear separation in a PCA plot. It's feasible that the difference only affects a small number of genes (which wouldn't dominate the PCA). Looking for DE genes in a supervised manner (i.e. DESeq2) could find them. And as said above, you might (but no guarantee) gain power by including batch covariates, ideally explicitly known ones, perhaps also ones fit from, e.g., PCA.
I'm sorry to ask this but what do you mean with covariates. Is it for example if I divide my samples in two groups. One consists of the 8 sample on left-side of the first principle component (see pdf), the other group is the rest and I add it to the design like
Thx
Mathias
hi Mat,
You shouldn't come up with covariates by eye.
By covariates, we mean getting the actual batch information from the lab (what day the samples were processed, etc), this would be considered a technical covariate, or biological covariate like sex, age, etc.
Or you can use the sva software to estimate the batches in a statistically principled way. See the svaseq() function in the sva package. We have an example of incorporating sva surrogate variables here:
http://master.bioconductor.org/help/workflows/rnaseqGene/#batch
Yes okay, I wouldn't that. But let's say I know the data of sample and I add this as a covariate, then it may still not be visible in a pca?
I followed your guideline and applied sva (see http://www.biotec.tu-dresden.de/sharing/t/uploadwjRMyu/pca.sva.pdf ). The problem is, that I don't really understand the method. I'll have to read up on that. For example how many surrogate variables n.sv variable). It's not really explained in the help page in R either.
There is also one other problem with another data set. This time I'll have about 15,000 DEGs and I know have to filter on lfcThreshold and/or altHypothesis. But before I ask about guidance in this case, I'll have to read the vignette again and check for related bioconductor posts.
thx again!
Yes it's true. Basically the two categories don't really differ. I'm just wondering what else I could do when dealing with data from patients. thx mathias