Hi, I have 1500 HTA 2.0 affy human microarray CEL files. These file are a combination of treatment, placebo and healthy controls at 4 different times (weeks). They are not paired samples. Hence I started out doing the following - 1- Perform batch QC using Affy power tools 2- Perform sample wise QC 3- Batch processing of arrays and normalization using rma.
eset_norm <- oligo::rma(raw_data, target = "core")
4- Combine all the batches of normalized eset and annotate using pd.hta.2.0
eset_norm_anno <- annotateEset(eset_norm ,pd.hta.2.0, columns = c("PROBEID", "ENTREZID", "SYMBOL", "GENENAME"), multivals = "list")
5- Filter probe id that are related to main gene expression
eset_norm_anno_main <- getMainProbes(eset_norm_anno )
6- Filtered probe ids using manual threshold in order to remove all the low expression probes across samples. 7- Filtered probe ids that were not annotated, i.e which had gene symbol /ID as "NA". Ended up with 34886 probe ids (Started with 70k probes after core rma norm). 8- set up design matrix - I would like to look at a list of differential expressed gene based on treatment and time effect. In order to figure out genes that are expressed deferentially per dosage at a given time point, I set up the following design matrix and develop contrast matrix that will compare DrugA.Dose1.WEEK0 and PLACEBO.WEEK0 and so forth.
design = model.matrix(~0+groups)
colnames(design)
# colnames(design)
#[1] "DrugA.Dose1.WEEK0" "DrugA.Dose1.WEEK6" "DrugA.Dose1.WEEK1" "DrugA.Dose1.WEEK8"
#[5] "DrugA.Dose1.WEEK2" "DrugA.Dose2.WEEK0" "DrugA.Dose2.WEEK6" "DrugA.Dose2.WEEK1"
#[9] "DrugA.Dose2.WEEK8" "DrugA.Dose2.WEEK2" "CONTROL.WEEK0.HEALTHY" "PLACEBO.WEEK0"
#[13] "PLACEBO.WEEK6" "PLACEBO.WEEK1" "PLACEBO.WEEK8" "PLACEBO.WEEK2"
9- Setting up analysis using Limma -
contrast = makeContrasts( Dose1week0=DrugA.Dose1.WEEK0-PLACEBO.WEEK0)
data.fit = lmFit(eset_norm_anno_main_final,design)
data.fit.con = contrasts.fit(fit = data.fit,contrasts = contrast)
data.fit.eb = eBayes(fit = data.fit.con)
results <- decideTests(data.fit.eb, adjust.method = "BH", lfc = 1)
Can you please tell me if the design and my process is right ? My ultimate list represents a list of probeids that are mapped to a gene id with statistics and p values associated. Do I have to aggregate the result based on Gene symbols to understand what Genes have been regulated or the result with transcripts just fine to present and compare ?
A quick check will help me very much.
Thanks, RV
Hi James,
Thank you! I have a couple more follow up questions and it will great to know your thoughts.
That is right, I did not incorporate all the contrasts. I loop over a list of contrasts and calculate contrast.fit and ebayes. As far as looking at genes deferentially expressed between doses, I should be doing something like this right -
contrast = makeContrasts( Dose1vDose2_week0=(DrugA.Dose1.WEEK0-PLACEBO.WEEK0)-(DrugA.Dose2.WEEK0-PLACEBO.WEEK0))
? I do not think it is a straight up comparisonDrugA.Dose1.WEEK0-DrugA.Dose2.WEEK0
.Secondly, I am merely trying to filter the list of genes that are significant based on adj p-value <= 0.05 and lfc >=2. SInce DecideTests already had a defulat filter of 0.05, I included the lfc. From your comment, I need not use it but later filter the result I get from topTable separately. Am I right in understanding that ?
Finally, I did want to normalize it all and I used a linux cluster with 250GB RAM and also tried it in one of the x1 ec2 instance with 1TB RAM and I was getting NaNs when I tried analyzing the entire data set, Hence I randomly divided the samples into 7 buckets making sure I pick equal amounts from each of the label and normalize. I did two more iterations of the same thing and I plot PCA to see that there was no significant batch effects based on processing dates of these samples. Thank you for pointing that out.
Appreciate your help!
The contrast
reduces to
because you are subtracting the same thing in both parentheses. Remember that you are just doing (linear) algebra here, and if you can reduce the formula to something simpler, well, it's already the simpler thing.
For your second question, you don't want to filter using the
lfc
argument, like ever. Or by hand. Use thetreat
function instead, and use something smaller than 1 as the fold change cutoff. That's the short story.The long story is that here are a couple of reasons for this. The first reason has to do with what a p-value really is. The p-value actually has very little to do with your data. You use your data to estimate the differences between the two groups and to get an estimate of the variance, and then generate the t-statistic. You then use the sample variance to define the null distribution, which is the set of t-statistics you would expect to see, given the number of samples and the estimated variance under the assumption that there are no differences between the two groups.
If you have a p-value of 0.05, then you expect to see a t-statistic that large even when there are no differences between two groups of that size, about 5% of the time. So you have either observed a pretty rare occurrence, or there really are differences between the two groups. But this is a probabilistic argument - you don't know the truth - where you (should) say that you have some evidence that the groups are different, and you can say exactly what that statement is based on.
If you add in an additional fold change criterion, then you can't really interpret the p-value the same way any longer, because it has now been conditioned on the sample fold change. It's also sort of weird, because the p-value is based on a population estimate, and the sample fold change is based on a subset of that population. So that's problem #1.
Given the definition of the p-value (the expected proportion of statistics, under the null, that exceed a certain cutoff), if you do say 10,000 comparisons and none are really different, you expect about 500 to have a p < 0.05. Because that is the definition of the p-value. So you probably want to adjust for the fact that you have made so many simultaneous comparisons, in order to 'get rid of' all the false positives. One way to do that is to use FDR, which adjusts the p-value in such a way that you can then make statements about a set of results, rather than one at a time. So if you select genes based on an FDR < 0.05, the interpretation is that you expect the maximum percentage of false positives in that set to be about 5%. But if you add in an extra fold change criterion, then the FDR is no longer interpretable either, because you have removed some of the genes that were in that set, so the maximum false positive rate is somewhat less than 5%, but actually unknowable.
Using the
treat
function fixes all these issues, because it modifies the underlying test from 'how often should I expect to see a t-statistic as large or larger than my observed statistic if there really are no differences between the two groups' to 'how often should I expect to see a t-statistic as large or larger than my observed statistic if the differences between the two groups is less than some constant?' By incorporating the fold change into your test, the p-values remain interpretable, as do the FDRs.To follow up on James' answer:
It's actually worse than this. To give an extreme case:
So you can see that the actual FDR after filtering on the log-fold change is higher than the specified FDR threshold! This is because, in the presence of many highly variable but non-DE genes, the log-fold change filter actually enriches for false positives. Highly variable genes are more likely to achieve large log-fold changes by chance - a fact that is considered during calculation of the p-value but not if you apply a filter to the log-fold change estimate.
Thanks Aaron and James for the comprehensive answer. I really appreciate your help.