Question

nested modeling for unequal groups with edgeR

0

Entering edit mode

Emmanouela Repapi ▴ 20

@emmanouela-repapi-6515

Last seen 3.0 years ago

United Kingdom

Hello,

I have a fairly complicated design of experiment and I would like some help/feedback on designing the model.matrix. The data is coming from an experiment for which there are two groups of mice (young/old), the cells of which have been used for sorting populations with two markers (sort1/sort2) and each sort has positive cells and negative cells. The problem is that the mice from which the cells are coming are nested within both the sorts and the age groups and that some groups have 3 some 4 and some 5 mice. To explain a bit better, my samples matrix looks like this:

	sort	cell	age	mouse	mouse_nest
sample1	sort1	positive	young	1	1
sample2	sort1	negative	young	1	1
sample3	sort1	positive	young	2	2
sample4	sort1	negative	young	2	2
sample5	sort1	positive	young	3	3
sample6	sort1	negative	young	3	3
sample7	sort1	positive	young	4	4
sample8	sort1	negative	young	4	4
sample9	sort1	positive	young	5	5
sample10	sort1	negative	young	5	5
sample11	sort1	positive	old	6	1
sample12	sort1	negative	old	6	1
sample13	sort1	positive	old	7	2
sample14	sort1	negative	old	7	2
sample15	sort1	positive	old	8	3
sample16	sort1	negative	old	8	3
sample17	sort1	positive	old	9	4
sample18	sort1	negative	old	9	4
sample19	sort2	positive	young	10	1
sample20	sort2	negative	young	10	1
sample21	sort2	positive	young	11	2
sample22	sort2	negative	young	11	2
sample23	sort2	positive	young	12	3
sample24	sort2	negative	young	12	3
sample25	sort2	positive	old	13	1
sample26	sort2	negative	old	13	1
sample27	sort2	positive	old	14	2
sample28	sort2	negative	old	14	2
sample29	sort2	positive	old	15	3
sample30	sort2	negative	old	15	3

Initially I thought of splitting the data in two (sort1 and sort2 groups) and then using a nested design within that:

  design <- model.matrix( ~ cell + age +  age:cell + age:mouse_nest)

which works for sort2 but not for sort1 because the groups of mice are different for the two groups of young and old (5 vs 4 samples per group). As far as I understand the way to resolve this is either to remove a pair of samples so that I have 4 samples in each group or to remove the age:mouse_nest term. However, neither of the two solutions sounds great to me because a) don't like removing samples and b) there seem to be differences according to the mice. How do people go about choosing which is best, looking at the dispersion estimates? Any other ways to resolve this?

Also I would like to be able to compare between the positive cells of one marker (sort1) with the positive cells of the other marker (sort2) so I would like to put all the samples together but then the problem with the nesting becomes even greater because of the differences in group sizes. Is the best way to just put these samples together (sort1+ve vs sort2+ve) for young and old and forget about the nesting all together, using a design matrix like the below:

  design <- model.matrix( ~ age + sort + age:sort)

(or the equivalent form of combining them into one factor and using contrasts)

Thank you in advance for all your help!

Best wishes,

Emma

edger r edger de • 1.3k views

ADD COMMENT • link updated 8.8 years ago by Ryan C. Thompson ★ 7.9k • written 8.8 years ago by Emmanouela Repapi ▴ 20

score 2 · Accepted Answer · 2016-06-14

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

The only factor that is nested inside another factor is mouse. So I think your best bet is to use the limma-voom with duplicateCorrelation model. Assuming you want the complete 3-way interaction between age, sort, and cell, I would create a group variable and use that in the design:

library(limma)
group <- interaction(sort, cell, age, sep=".", drop=TRUE)
design <- model.matrix(~0 + group)

and then proceed with the analysis as described here[1], using mouse as the block argument to duplicateCorrelation. If you don't want the 3-way interaction, use whatever design you like involving those three variables, but leave out mouse, since that is handled as a random effect by duplicateCorrelation.

I don't think the unequal group sizes are an issue in this design, as long as you are properly modelling all the variables involved.

[1]: A: using duplicateCorrelation with limma+voom for RNA-seq data

ADD COMMENT • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thank you for your answer Ryan! In my mind the cell factor is also nested within the sort because the positive cells are specific for the sort in question. There shouldn't be a great difference in the negative cells of the two sorts because they are just the remaining cells from either sort. Although in theory if you are taking out different things from the same pools of cells then you are left with different groups of cells, I wouldn't expect significant differences.

I think you are right in using limma for this analysis. Assuming that cell is also nested, something like this :

 design <- model.matrix( ~ sort + sort:cell + age + age:cell + sort:age + sort:age:cell)

would look unreasonably complicated to interpret properly, so I guess the best way is to keep the main effect of cell even if not much comes out of it. Correct me if I am wrong.

Many thanks,

Emma

ADD REPLY • link 8.8 years ago Emmanouela Repapi ▴ 20

0

Entering edit mode

If you think that the negatives for both sorts should be equivalent, you can represent this by creating a single factor with 3 levels: "negitive", "positive1", "positive2". By using this factor in place of sort and cell, you'll be comparing both positive groups to a common baseline consisting of all the negative samples from both sorts.

Either way, my recommended way to construct a design matrix for any interaction model is still to combine all the interacting factors into a single "group" variable as demonstrated above and then use a design of ~0+group, giving you a coefficient for each unique combination of factor levels, and then to construct contrasts between your groups of interest.

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

With regard to your suggested design above, be aware that as long as you have sort:age:cell in the design, including or excluding any of the previous terms will only result in a different parametrization of the same design. So the design that you suggested is just a more complicated version of my suggested ~0+group.

(This only applies to factor variables, though. I think the situation with numeric/continuous variables is a bit different.)

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k