Dear Bioconductor community,
based on a current project aiming to decipher prognostic biomarkers that driving the response to chemotherapy in breast cancer, we are trying to analyze external public datasets to validate in house findings; on this premise, on an rnaseq gene expression dataset of neo-adjuvant treated patients with multiple time points:
T1 is the “pre-treatment” timepoint, T2 after two weeks on treatment, T3 is the “middle” timepoint of chemotherapy, whereas T4 refers to the relative timepoint when the surgical resection performed;
Our main goal is to find any differences between T1 and the available time-points that refer to after chemotherapy to identify interesting markers; as also to emphasize perhaps on specific comparisons-the main putative issue is based on the availability of different time points on the available patients:
head(dd,8)
# A tibble: 8 x 2
SampleID Biopsy_Time
<chr> <chr>
1 Patient_11_T1 biopsy time: T1
2 Patient_11_T2 biopsy time: T2
3 Patient_11_T3 biopsy time: T3
4 Patient_11_T4 biopsy time: T4
5 Patient_15_T1 biopsy time: T1
6 Patient_15_T2 biopsy time: T2
7 Patient_15_T3 biopsy time: T3
8 Patient_15_T4 biopsy time: T4
table(dd$Biopsy_Time)
biopsy time: T1 biopsy time: T2 biopsy time: T3 biopsy time: T4
21 9 19 20
As you see from above, not all time points are covered uniformly;
6 patients have complete 4 timepoints (T1-T2-T3-T4)
13 patients have 3 available timepoints: 2 have T1-T2-T4, and the rest T1-T3-T4
Thus, my major questions are the following:
1) If I consider only the 6 patients will all timepoints available, I could perform something like a "paired" design? For example if SampleID only denotes the patient number, and the Biopsy time the available treatment timepoints/conditions, with T1 as the reference level then:
design <- model.matrix(~SampleID+Biopsy_Time)
fit <- estimateDisp(y,design)
fit2 <- glmQLFit(fit, design)
fit3 <- glmQLFTest(fit2)
and also for example by setting coef=2:4
within glmQLFTest
I can perform an ANOVA-like comparison that can detect DEs between any of the chemotherapy timepoints versus treatment naïve samples?
2) In addition, is there a way to incorporate also the information of the additional samples that lack any of the aforementioned levels? or the relative NA values in specific levels such as T2 would hamper any relative analysis?
Thank you in advance,
Efstathios
Dear James,
thank you very much for your comment; please excuse me for my naïve additional question, but is there an example in the edgeR user's guide or generally how to implement ? As I have never used this specific pipeline, if I have understood well from the relative function information, this resembles the original duplicateCorrelation ? However, this is not intended when comparing both within and between samples? Or based on my experimental design, if I use all the samples, it takes into account NA values for specific levels, thus including the between sample comparison?
Moreover, if my understanding is correct, should I implement the following approach? :
Do not remove any samples with not complete timepoints, correct? Or I have to explicitly include only patients with complete timepoints, as to create the variable below for the blocking on subject?
Then from a DGElist directly from creating an edgeR object, with or without normalization:
And then implement something like the following?
Best regards,
Efstathios
The
voomLmFit
function will automatically fit a linear mixed model accounting for within-subject correlations if you block on patient, so you don't need to exclude anybody. That's why I suggested it. I normally wouldn't use a quantile normalization, but instead just use the scale normalization thatvoom
already does. You can use an ANOVA-like approach, but in general I tend to make individual t-tests which IIRC are more powerful than the F-test.Dear James,
thank you for your additional clarification for the intended use of
voomLmFit
; Just to summarize three final points based on your explanation, as far asvoomLmFit
encapsulates both voom, duplicateCorrelation and lmFit functions:If I prior use TMM normalization with
calcNormFactors
, additional scale normalization withvoomLmFit
is not necessary, right?For the blocking variable that indicates the patient that each sample/condition relates: a simple factor, with numbers indicating each patient would suffice right? something like c(1,1,1,1,2,2,3,3,3,3,4,4,4,etc) even in the under-representation of specific levels as mentioned?
For non-specific intensity filtering: I can directly use
filterByExpr
, with the above design matrix with the different time points? that isdesign1 <- model.matrix(~0 +condition)
If you use TMM, it doesn't normalize anything. It generates estimated library sizes in a way that is intended to reduce compositional bias. Then when you compute CPM (which
voomLmFit
does for you), you divide the counts by the estimated library sizes to scale normalize your data.Yes. If you don't have complete observations the only thing you can do is fit a linear mixed model. This has been discussed many many times on this support site.
Yes. Is there a reason you think that won't work?
Dear James, thanks for the feedback and confirmation;
For point 1, I was mixing different things, apologies for the not appropriate explanation as I was indeed confused with the actual downstream edgeR analysis without the voom-limma pipeline;
For point 2 I was just a bit uncertain, as the previous time that I have used duplicateCorrelation with limma and microarray data I did not had any NA values, but I wanted to perform both within and between sample comparisons-but the final outcome is the same concerning the interpretation;
Finally, for point 3 sorry for not providing one additional important point-it might be for several analysis reasons that we might not use timepoint2 for the DEG downstream analysis-but even with this, I think as it is the sample group with the smaller number, the filtering approach would work just fine-
Overall, thanks a gazillion for your help and suggestions !!!
You say you have NA values, but I am not sure that is actually what you mean. It's one thing to have a data set like
In which case if you include Progesterone in your linear model day 2 for subject 1 will be dropped automatically (this is what Gordon was telling you). It's completely different to have no data for a given time point for a given subject. Those are technically NA, for all the observations, but in general I wouldn't normally say I have NA values. Instead I would say I have missing data for a given timepoint, which is what I believe you have. I mean nobody would do
Because that's not a thing. You just wouldn't have a third row, because you didn't measure subject 1 on day 3.
But regardless, if you have missing data for any subject, for any thing you are planning to fit in your model, you will drop certain timepoints. In which case the timepoints are no longer orthogonal to your other measures of interest and you can't simply block on subject with a fixed effect. Which is why you have to use
duplicateCorrelation
and fit a linear mixed model, which can handle that situation.