Question

selecting the right package for complex designs

0

Entering edit mode

bensimon.ariel • 0

@bensimonariel-12348

Last seen 8.2 years ago

Dear all,

I am pretty much a novice so any advice would be appreciated. After having read through the manuals of these three packages I actually got more confused about the commonalities and differences. I am trying to select the "right" method to analyze data with the following structure. to give an example of the data, i just wrote a quick table with some numbers (0-100). I actually have two biological replicates per cell in such a table but for simplicity i put one value.

treatment A (in time)	none	1 hour	6 hour	none	1 hour	6 hour	none	1 hour	6 hour
treatment drug	a	a	a	b	b	b	c	c	c

protein A	10	95	15	10	15	15	10	95	65
protein B	80	80	15	80	78	79	85	83	85

In this toy example protein A goes up after 1 hour and down after 6 hours under drug a (let us assume that this is the "normal pattern").; protein A doesn't go up under drug b; protein A goes up and stays high under drug 6 at 6 hours.Meanwhile protein B goes down only at six hours under drug a (let us assume that no change under drug b or c is the "normal" pattern).

My biological question is to find the drugs (a or b or c) for which the pattern is not the "normal" pattern. So for protein A I would "get" drugs b and c , and for protein B i would "get" drug a. In practice i have many more drugs and would be looking to test for which drugs (and which proteins) there is an interaction between treatment A and drug treatment.

As far as i can understand, all three packages can be used for such factorial designs, but I am not sure which to use.

Your suggestions are welcome.

Kind regards, AB

limma edger lme4 • 1.2k views

ADD COMMENT • link updated 8.2 years ago by Aaron Lun ★ 28k • written 8.2 years ago by bensimon.ariel • 0

score 0 · Answer 1 · 2017-02-13

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 5 hours ago

The city by the bay

The parametrization of your design matrix is simple. Just combine the time and drug factors into a single grouping factor, and use a one-way layout, i.e., each time-drug combination forms a separate group. This is the most flexible parametrization while still retaining residual degrees of freedom for dispersion estimation. With this model, you can do all the comparisons between/within drugs and time points that you want.

The more difficult problem is that your null hypothesis is not well defined, at least not for drug A. It seems like you want to detect genes that do not follow the "normal pattern". But what exactly is this? If you want to do hypothesis testing, you need to precisely and quantitatively define how the expression pattern behaves under the null. If expression "normally" goes up after an hour, by how much does it go up? Similarly, if it goes down after 6 hours, by how much? If you cannot describe this, the choice of package is the least of your concerns.

Drugs B and C are easier, where the null hypothesis is that there is no change with time. In this case, you can just do an ANODEV where all groups for a particular drug have the same expression under the null. If you reject, then that gene is not behaving "normally".

ADD COMMENT • link 8.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for your comments. Indeed i forgot to say it explicitly. For protein A, under treatment A only , I would note something like 10, 95 , 15. So meaning that drug a has no effect on protein A. I therefore am looking to formulate correctly the model such that drug c in 6 hours would be the significant result for protein A, and drug a at 6 hours for protein B. If no treatment is applied then protein A would remain A always and protein B would remain 80 always. I hope this answers.

ADD REPLY • link 8.2 years ago bensimon.ariel • 0

0

Entering edit mode

This is not precise enough. Ignore all other drugs besides drug A. What exactly is your null hypothesis for this drug? From what you're writing, the null hypothesis is that after 1 hour, you expect to get a 9.5-fold increase in expression compared to the zero time point, and after 6 hours, this drops to an expected 1.5-fold increase in expression. Is this correct? Just saying "some increase after 1 hour followed by a drop at 6 hours" is too vague to construct a hypothesis test. For example, does the expression at 6 hours return to the expression at time zero under the null?

Besides, you say that "drug A has no effect on protein A". But, I would say a near 10-fold increase in expression of protein A after 1 hour of treatment with drug A is, in fact, a pretty strong effect. Why is this not interesting?

ADD REPLY • link 8.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you. I shall clarify: The null hypothesis is that under treatment A alone, I would get the values 10,95,15. Therefore, drug (a) combined with treatment A, shows the same pattern. It is not "interesting" to me because i am not looking for protein A, but for the interactions: drug c, in which there is a deviation at 6h: 10, 95, 65 (rather than 10,95,15), or drug b (deviates at 1h, so 10,15,15 instead of 10,95,15).

ADD REPLY • link 8.2 years ago bensimon.ariel • 0

0

Entering edit mode

Edited: Where are you getting values of 10, 95 and 15 from? You can't define the null hypothesis after you look at the data, you need to define it beforehand. In other words, a separate piece of data must be used to get these numbers - is this the case? And are these numbers the same for all proteins (I would find this hard to imagine)?

In any case, it is unwise to frame a null hypothesis in terms of absolute numbers. What happens if, upon treatment with drug A, gene X exhibits an expression pattern of 20, 190 and 30? Is this interesting or not?

ADD REPLY • link 8.2 years ago Aaron Lun ★ 28k