Question

limma technical repeat correction

0

Entering edit mode

john • 0

@cb255338

Last seen 9 weeks ago

United Kingdom

Hello, Would someone please help me verify if i'm on the right track in correcting for technical replicates using limma and also if there is a better approach to use (duplicateCorrelation() vs lmer()) for my protein expression data.

I've got 8 disease patients per group (DiseaseA, DiseaseB). There are 2 biological samples(_1,_2) with 4 technical replicates each (rep1-4).

Quite confused on what to include in the block when handling technical replicates - is it the column - patientID or replicates_1 or replicates_2 below. Please note I have manually created replicate_2 column


> sample.sheet
                  group   patientID replicates_1 replicates_2
sample_1    Disease_A   Disease_A_1 1   1
sample_2    Disease_A   Disease_A_1 2   2
sample_3    Disease_A   Disease_A_1 3   3
sample_4    Disease_A   Disease_A_1 4   4
sample_5    Disease_A   Disease_A_2 1   1
sample_6    Disease_A   Disease_A_2 2   2
sample_7    Disease_A   Disease_A_2 3   3
sample_8    Disease_A   Disease_A_2 4   4
sample_9    Disease_B   Disease_B_1 1   5
sample_10   Disease_B   Disease_B_1 2   6
sample_11   Disease_B   Disease_B_1 3   7
sample_12   Disease_B   Disease_B_1 4   8
sample_13   Disease_B   Disease_B_2 1   5
sample_14   Disease_B   Disease_B_2 2   6
sample_15   Disease_B   Disease_B_2 3   7
sample_16   Disease_B   Disease_B_2 4   8
sample_17   Healthy Healthy_1   1   9
sample_18   Healthy Healthy_1   2   10
sample_19   Healthy Healthy_1   3   11
sample_20   Healthy Healthy_1   4   12
sample_21   Healthy Healthy_2   1   9
sample_22   Healthy Healthy_2   2   10
sample_23   Healthy Healthy_2   3   11
sample_24   Healthy Healthy_2   4   12

group <- factor(sample.sheet$group);
patient <- factor(sample.sheet$patientID); **# unsure whether to use patientID or replicates_1 or replicates_2**

#> table(sample.sheet$group)
#Disease_A Disease_B   Healthy
#        8         8         8

#> table(sample.sheet$patientID)
#Disease_A_1 Disease_A_2 Disease_B_1 Disease_B_2   Healthy_1   Healthy_2
#          4           4           4           4           4           4

# Create a design matrix for your comparison (e.g., control vs treatment)
design <- model.matrix(~0 + group);
colnames(design) <-  gsub('group', '', colnames(design));

# Use duplicateCorrelation to estimate within-patient correlation
corfit <- duplicateCorrelation(exp.data.subset, design, block = patient);

fit <- lmFit(
    exp.data.subset,
    design,
    block = patient,
    correlation = corfit$consensus.correlation
    );

# create contrast matrix
contrast.matrix <- makeContrasts(
    'Disease_AvsHealthy' = Disease_A-Healthy,
    'Disease_BvsHealthy' = Disease_B-Healthy,
    'Disease_A-HealthyvsDisease_B-Healthy' = (Disease_A-Healthy) - (Disease_B-Healthy),
    levels = design
);

fit2 <- contrasts.fit(fit, contrast.matrix);
fit2 <- eBayes(fit2, trend = TRUE); **# robust = TRUE, trend = TRUE? what is more appropriate for proteomics **

limma lmer duplicateCorrelation technicalrepeats • 1.9k views

ADD COMMENT • link 4 months ago • updated 3 months ago john • 0

score 0 · Answer 1 · 2024-10-25

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 20 minutes ago

WEHI, Melbourne, Australia

Technical samples should be averaged. Then use duplicateCorrelation() with block=PatientID.

ADD COMMENT • link 4 months ago • updated 3 months ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon, Thank you very much for the prompt response.

If i understand correctly, wouldnt this remove the need to use duplicateCorrelation(). Since averaging condenses these replicates into a single value per biological replicate, would that mean we no longer need to account for correlations between them? Thanks for clarifying!

         sampleID     group   patientID
1  sample_1  Disease_A Disease_A_1
5  sample_5  Disease_A Disease_A_2
9  sample_9  Disease_B Disease_B_1
13 sample_13  Disease_B Disease_B_2
17 sample_17  Healthy   Healthy_1
21 sample_21  Healthy   Healthy_2

ADD REPLY • link 4 months ago • updated 3 months ago john • 0

0

Entering edit mode

Gordon Smyth Thank you again for your response, and apologies as I think I might still be confused by your suggestion "The technical samples should be averaged. Then use duplicateCorrelation() with block=PatientID." I'd prefer to retain the technical replicates instead of averaging them, as I'm aiming to capture the nuances in each set (there seems to be quite a lot of variation between 2 of my biological replicates in one condition on the PCA plot).

Am I correct to assume my initial approach of using duplicateCorrelation() with block = patient is appropriate in this case (i.e., allowing the function to calculate the consensus correlation of technical replicates)? Or is there another setup you'd recommend to account for both biological and technical repeats effectively?

Thanks once more for your insights!

ADD REPLY • link 3 months ago john • 0

0

Entering edit mode

I am not familiar with the scientific and procedural details of your experiment, so I cannot tell you exactly how to analyse it. In particular, it is not clear to me what the replicates (replicates_1 and replicates_2) represent. You have called them "technical replicates", but I suspect that they might not be. If they really were technical replicates, then you wouldn't be wanting to capture their nuances.

I can only give you general principles and limma syntax. Repeated measurements on the same patients would be handled by duplicateCorrelation with block=PatientID. Technical replicates, by definition, do not contain any biological signal. If you really did have technical replicates, then they would be averaged. Repeated measurements are not technical replicates.

Your experiment appears to have some sort of blocking/pairing structure associated with replicate. Different mass spec runs perhaps? Different samples from each patient? I don't know.

ADD REPLY • link 3 months ago Gordon Smyth 52k

0

Entering edit mode

Gordon Smyth Thank you for getting back to me and apologies for the confusion

replicates_1 column indicates for a patient (cell line) (e.g.: Disease_A_1 the 4 repeated measurements labelled 1, 2, 3, 4). Please disregard the replicates_2 column as I manually created this to clarify the correct application of duplicateCorrelation().
To provide context, we have 3 groups : Healthy, Disease_A and Disease_B. Each group includes 2 different cell lines. On the day of experiment, each cell line was plated in 4 separate culture dishes. The samples were collected and processed individually in the lab. In the end we have 8 samples per group which were analysed on somascan proteomics platform in a single run and we are using their provided normalised file as our starting point of analysis. I've attached the PCA plot below for clarity. Please let me know what you think.

ADD REPLY • link 3 months ago john • 0

1

Entering edit mode

Thanks for the further explanation. It seems you should simply use block=PatientID, so your original analysis is correct.

The replicate_1 column does not really mean anything. In reality, every sample is a new culture. The column would only have a meaning if you were using the same four culture dishes over and over again, or something like that.

However, the PCA plot strongly suggests that there is little or no significant DE in the data, regardless of what statistical test you might use.

I will also add that running lmer on this data, as you mention at the start of your question, would be exactly equivalent to averaging the replicates for each patientID and then running ordinary anova.

ADD REPLY • link 3 months ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon Smyth, Thank you very much for the clarification, including the details about lmer() and for confirming that the original analysis is correct.

Interestingly, despite the PCA result, I'm still identifying a small number of biologically relevant differential genes in DiseaseA and DiseaseB with |logFC| > 1 and Q < 0.05, alongside with meaningful pathways.

For this dataset, I was also recommended to consider a |logFC| threshold in the range of 0.2-0.5.

Thanks again for your help!

ADD REPLY • link 3 months ago john • 0

1

Entering edit mode

I discourage the use of logFC cutoffs, especially larger cutoffs. There is no scientific or statistical justification for arbitrary combinations of logFC and FDR cutoffs in limma. See https://www.biostars.org/p/9603855/ or DecideTest vs topTAGs in edgeR.

Also suggest you try plotMDS(exp.data) instead of the PCA plot function you are currently using.

ADD REPLY • link 3 months ago Gordon Smyth 52k

0

Entering edit mode

Thank you very much for the clarification and recommendations. I will prioritise my hits with FDR ranking and check plotMDS too.

ADD REPLY • link 3 months ago john • 0

0

Entering edit mode

Hi Gordon Smyth,

I've been reflecting on whether limma's assumptions are fully compatible with SomaScan proteomics data, and I'd greatly appreciate your insights on the following concerns:

Variance Borrowing Across Genes
Non-Uniform Variance (S-shaped Dilution Curves)

the SomaScan team followed an lmer and emmeans approach. Using lmer resulted in no FDR-significant genes in our data, unlike limma. Interestingly, limma and lmer identified distinct sets of genes, although both approaches revealed overlapping pathways. To address these differences, I'm considering a consensus approach for experimental validation.

Here's what I've tried in limma so far, and I'd like your feedback on whether these are appropriate:

VSN Normalization with Limma

exp.data.subset <- normalizeVSN(exp.data.subset);
group <- factor(metadata.subset$group);
patient <- factor(metadata.subset$patientID);
# Create a design matrix for your comparison 
design <- model.matrix(~0 + group);
colnames(design) <-  gsub('group', "", colnames(design));
# Use duplicateCorrelation to estimate within-patient correlation
corfit <- duplicateCorrelation(exp.data.subset, design, block = patient);
fit <- lmFit(
 exp.data.subset,
 design,
 block = patient,
 correlation = corfit$consensus.correlation
 );

Double Voom Normalization

y <- DGEList(counts = exp.data.subset);
group <- factor(metadata.subset$group);
patient <- factor(metadata.subset$patientID);
# Create a design matrix for your comparison 
design <- model.matrix(~0 + group);
colnames(design) <-  gsub('group', "", colnames(design));
# normalise using double voom with TMM (Section 18.1.9 from LimmaUserGuide.pdf)
y <- calcNormFactors(y);
v <- voom(y, design, plot = TRUE);
# Use duplicateCorrelation to estimate within-patient correlation
corfit <- duplicateCorrelation(v, design, block = patient);
# call voom again
v <- voom(y, design, plot = TRUE, block = patient, correlation = corfit$consensus.correlation);
corfit <- duplicateCorrelation(v, design, block = patient);
fit <- lmFit(
   v,
   design,
   block = patient,
   correlation = corfit$consensus.correlation
   );

ADD REPLY • link 3 months ago john • 0

0

Entering edit mode

I have never analysed SomaScan data, so I cannot make any specific recommendations about how to analyse such data. I have however analysed data from Olink, which is SomaScan's main competitor, and have found that the use of limma and edgeR improves enormously on the statistical analysis recommended by Olink itself. I have also read a journal article by the SomaScan team comparing their data to Olink's. My experience over the past 25 years is that commercial companies and biomedical engineers have quite conservative views of how data should be analysed, and they don't take into account the statistical advances of the last two decades.

Variance Borrowing Across Genes

limma does not assume equal variances between bins. Olink data is also conducted in bins, and it caused no problems for limma's empirical Bayes procedure.

I doubt there are substantial differences in variance between bins that is not accounted by the mean-variance trend but, if there were, it could be handled by precision weights.

Non-Uniform Variance (S-shaped Dilution Curves)

limma models the mean-variance trend by setting trend=TRUE in the eBayes() step. This is one of the features that makes limma so much more powerful that running individual protein analyses with lm or lmer.

the SomaScan team followed an lmer and emmeans approach

I don't think they appreciate the importance of using empirical Bayes for high-throughput data.

VSN Normalization with Limma

There is no need for variance stabilization. As already mentioned, limma has no trouble handling the mean-variance relationship.

Double Voom Normalization

voom() is only for counts, but you have not mentioned anything in your question about having count data.

You have to choose normalization methods that are appropriate for the scale of the data. If you have counts, then you shouldn't use normalizeVSN() (or anything like it). If you don't have counts, you can't use voom(). normalizeVSN() etc is only for data on the unlogged scale whereas lmFit() is only for data after logging. The nature of the data you have from SomaScan is unclear to me.

ADD REPLY • link 3 months ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon Smyth, Thank you very much for your response. The normalization process involves multiple steps and they provide a normalised file FILENAME.{hybNorm}.{medNormInt}.{plateScale.medNormSMP}.adat in Relative Flourescence Units (added references).

I've noticed some publications applying normalizeVSN, which only seemed to yield slightly more differentially expressed results in my analysis. For the normalised dataset provided, I applied log2(exp.data + 1) for limma analysis, but for approaches like vsn or voom, I reverted to unlogged data before analysis.

The readout in relative fluorescent units (RFU). So, from what I understand, using normalizeVSN isn't strictly necessary since trend=TRUE in limma accounts for the unequal variances, and we should not use voom here

References

ADD REPLY • link 3 months ago john • 0