Question

Time series DE gene analysis with edgeR where specific timepoints have NA levels across different patients

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 10 days ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

based on a current project aiming to decipher prognostic biomarkers that driving the response to chemotherapy in breast cancer, we are trying to analyze external public datasets to validate in house findings; on this premise, on an rnaseq gene expression dataset of neo-adjuvant treated patients with multiple time points:

T1 is the “pre-treatment” timepoint, T2 after two weeks on treatment, T3 is the “middle” timepoint of chemotherapy, whereas T4 refers to the relative timepoint when the surgical resection performed;

Our main goal is to find any differences between T1 and the available time-points that refer to after chemotherapy to identify interesting markers; as also to emphasize perhaps on specific comparisons-the main putative issue is based on the availability of different time points on the available patients:

head(dd,8)
# A tibble: 8 x 2
  SampleID      Biopsy_Time    
  <chr>         <chr>          
1 Patient_11_T1 biopsy time: T1
2 Patient_11_T2 biopsy time: T2
3 Patient_11_T3 biopsy time: T3
4 Patient_11_T4 biopsy time: T4
5 Patient_15_T1 biopsy time: T1
6 Patient_15_T2 biopsy time: T2
7 Patient_15_T3 biopsy time: T3
8 Patient_15_T4 biopsy time: T4

table(dd$Biopsy_Time)

biopsy time: T1 biopsy time: T2 biopsy time: T3 biopsy time: T4 
             21               9              19              20

As you see from above, not all time points are covered uniformly;

6 patients have complete 4 timepoints (T1-T2-T3-T4)

13 patients have 3 available timepoints: 2 have T1-T2-T4, and the rest T1-T3-T4

Thus, my major questions are the following:

1) If I consider only the 6 patients will all timepoints available, I could perform something like a "paired" design? For example if SampleID only denotes the patient number, and the Biopsy time the available treatment timepoints/conditions, with T1 as the reference level then:

design <- model.matrix(~SampleID+Biopsy_Time)
fit <- estimateDisp(y,design)
fit2 <- glmQLFit(fit, design)
fit3 <- glmQLFTest(fit2)

and also for example by setting coef=2:4 within glmQLFTest I can perform an ANOVA-like comparison that can detect DEs between any of the chemotherapy timepoints versus treatment naïve samples?

2) In addition, is there a way to incorporate also the information of the additional samples that lack any of the aforementioned levels? or the relative NA values in specific levels such as T2 would hamper any relative analysis?

Thank you in advance,

Efstathios

edgeR RNASeq DifferentialExpression TimeCourse • 2.1k views

ADD COMMENT • link updated 3.7 years ago by James W. MacDonald 68k • written 3.7 years ago by svlachavas ▴ 840

score 1 · Answer 1 · 2021-07-26

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 9 hours ago

United States

I would use voomLmFit and block on subject.

ADD COMMENT • link 3.7 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James,

thank you very much for your comment; please excuse me for my naïve additional question, but is there an example in the edgeR user's guide or generally how to implement ? As I have never used this specific pipeline, if I have understood well from the relative function information, this resembles the original duplicateCorrelation ? However, this is not intended when comparing both within and between samples? Or based on my experimental design, if I use all the samples, it takes into account NA values for specific levels, thus including the between sample comparison?

Moreover, if my understanding is correct, should I implement the following approach? :

Do not remove any samples with not complete timepoints, correct? Or I have to explicitly include only patients with complete timepoints, as to create the variable below for the blocking on subject?
Then from a DGElist directly from creating an edgeR object, with or without normalization:

y <- DGEList(counts=gene.mat)

And then implement something like the following?

condition <- factor(dd$Biopsy_Time, levels=c.....))
pairs <- factor(SampleID)

design1 <- model.matrix(~0 +condition)
colnames(design1) <- levels(condition)

voom.fit <- voomLmFit(counts=y, design=design1, block=pairs, normalize.method="quantile")

# and then with makeContrasts, contrasts.fit and eBayes as with the usual limma analysis?

And finally, similarly for an ANOVA like approach, one could follow the additional:

top2_merged <- topTable(fit3, coef=2:4, number=nrow(fit3), adjust.method="fdr", sort.by="none")

Best regards,

Efstathios

ADD REPLY • link 3.7 years ago svlachavas ▴ 840

1

Entering edit mode

The voomLmFit function will automatically fit a linear mixed model accounting for within-subject correlations if you block on patient, so you don't need to exclude anybody. That's why I suggested it. I normally wouldn't use a quantile normalization, but instead just use the scale normalization that voom already does. You can use an ANOVA-like approach, but in general I tend to make individual t-tests which IIRC are more powerful than the F-test.

ADD REPLY • link 3.7 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James,

thank you for your additional clarification for the intended use of voomLmFit; Just to summarize three final points based on your explanation, as far as voomLmFit encapsulates both voom, duplicateCorrelation and lmFit functions:

If I prior use TMM normalization with calcNormFactors, additional scale normalization with voomLmFit is not necessary, right?
For the blocking variable that indicates the patient that each sample/condition relates: a simple factor, with numbers indicating each patient would suffice right? something like c(1,1,1,1,2,2,3,3,3,3,4,4,4,etc) even in the under-representation of specific levels as mentioned?
For non-specific intensity filtering: I can directly use filterByExpr, with the above design matrix with the different time points? that is design1 <- model.matrix(~0 +condition)

ADD REPLY • link 3.7 years ago svlachavas ▴ 840

1

Entering edit mode

If you use TMM, it doesn't normalize anything. It generates estimated library sizes in a way that is intended to reduce compositional bias. Then when you compute CPM (which voomLmFit does for you), you divide the counts by the estimated library sizes to scale normalize your data.
Yes. If you don't have complete observations the only thing you can do is fit a linear mixed model. This has been discussed many many times on this support site.
Yes. Is there a reason you think that won't work?

ADD REPLY • link 3.7 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James, thanks for the feedback and confirmation;

For point 1, I was mixing different things, apologies for the not appropriate explanation as I was indeed confused with the actual downstream edgeR analysis without the voom-limma pipeline;

For point 2 I was just a bit uncertain, as the previous time that I have used duplicateCorrelation with limma and microarray data I did not had any NA values, but I wanted to perform both within and between sample comparisons-but the final outcome is the same concerning the interpretation;

Finally, for point 3 sorry for not providing one additional important point-it might be for several analysis reasons that we might not use timepoint2 for the DEG downstream analysis-but even with this, I think as it is the sample group with the smaller number, the filtering approach would work just fine-

Overall, thanks a gazillion for your help and suggestions !!!

ADD REPLY • link 3.7 years ago svlachavas ▴ 840

1

Entering edit mode

You say you have NA values, but I am not sure that is actually what you mean. It's one thing to have a data set like

Subject   Time   Weight   Progesterone
1         1 day  165 lb    234 ng/ml
1         2 day  165 lb    NA

In which case if you include Progesterone in your linear model day 2 for subject 1 will be dropped automatically (this is what Gordon was telling you). It's completely different to have no data for a given time point for a given subject. Those are technically NA, for all the observations, but in general I wouldn't normally say I have NA values. Instead I would say I have missing data for a given timepoint, which is what I believe you have. I mean nobody would do

Subject   Time   Weight   Progesterone
1         1 day  165 lb    234 ng/ml
1         2 day  165 lb    NA
1         3 day    NA      NA

Because that's not a thing. You just wouldn't have a third row, because you didn't measure subject 1 on day 3.

But regardless, if you have missing data for any subject, for any thing you are planning to fit in your model, you will drop certain timepoints. In which case the timepoints are no longer orthogonal to your other measures of interest and you can't simply block on subject with a fixed effect. Which is why you have to use duplicateCorrelation and fit a linear mixed model, which can handle that situation.

ADD REPLY • link 3.7 years ago James W. MacDonald 68k

score 1 · Answer 2 · 2021-07-26

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

is there a way to incorporate also the information of the additional samples that lack any of the aforementioned levels?

NA values are not permitted in design matrices. That's a general rule for linear modelling in R, not just in edgeR. You must either create a new category/level to assign to these samples or else omit them.

ADD COMMENT • link 3.7 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, thank you very much for your valuable comment-probably James suggestion is the only way to include all the samples with voomLmFit without creating new levels or excluding related samples

ADD REPLY • link 3.7 years ago svlachavas ▴ 840