Edited the original question to make it a little clearer:
I have a gene expression data set of 165 samples and different subject/ Time-point at which the sample was collected ( after a certain dose of drug and days), source of sample, whether it was CD33/34 enriched or not.
A subset of this actual data looks like this:
|
I concatenated time point, sample source and sample type as the grouping factor:
design <- model.matrix(~0+TimePoint+Source+Sample.Type)
My question is that not all subjects have the same matching time point collected or even same sample source. For example, I wanted to look at the differences of CD33/34+ PBMC samples between samples collected after Dose1Day8 and after Dose1Day1. There are 20 CD33/34+ PBMC samples from Dose1Day1, only 7 of those from Dose1Day8.
Furthermore, only 5 subjects are present in both time points.
Do I only look at those 5 subjects (making the comparison balanced and 10 samples total)? If that's the case then I'll have to filter the samples first instead of fitting all the data to one glm model?
Or, do I look at all the available samples within that group and make the contrast looks something like :
Group = factor(sampleInfoAll$Group) design <- model.matrix(~0+Group) my.contrasts = makeContrasts(Dose1Day8.PBMC.CD33_34pos_vs_Dose1Day1.PBMC.CD33_34pos=Dose1Day8.PBMC.CD33_34pos-Dose1Day1.PBMC.CD33_34pos,levels=design)
Can I check, have you shown the whole dataset or is the table above just a subset of the data? If this is just a subset, how many subjects do you have in total?
Thank you for checking! No this is only a subset of the (simplified) data. The actual data set has 165 samples and different subject/ Time-point at which the sample was collected ( after a certain dose of drug and days), source of sample, whether it was CD33/34 enriched or not.
A subset of this actual data looks like this:
I concatenated time point, sample source and sample type as the grouping factor.
The challenge is that not all subjects have the same matching time point collected or even same sample source. For example, I wanted to look at the differences of CD33/34+ PBMC samples between samples collected after Dose1Day8 and after Dose1Day1. There are 20 CD33/34+ PBMC samples from Dose1Day1, only 7 of those from Dose1Day8.
Furthermore, only 5 subjects are present in both time points.
Do I only look at those 5 subjects (making the comparison balanced and 10 samples total)? If that's the case then I'll have to filter the samples first instead of fitting all the data to one glm model?
Or, do I look at all the available samples within that group and make the contrast looks something like :
Just one important point regarding your code snippet. You never need to account for sample sizes, by dividing by 7 or dividing by 20, when forming contrasts in edgeR. edgeR always accounts for sample sizes correctly by itself. If you want to compare Day8 to Day1 you just use Dose1Day8.PBMC.CD33_34pos - Dose1Day1.PBMC.CD33_34pos.
Thank you very much! That's right - I was thinking about group...thank you so much for pointing it out