Question

2 cells X 2 conditions X patients

0

Entering edit mode

dickson.russel ▴ 10

@dicksonrussel-14158

Last seen 6.4 years ago

Hi,

I've a small question concerning differential expression analysis regarding 12 samples divided into 2 cell lines (cell 1 & 2), two conditions: tumor (T) and control (C):

Cell	Condition	Patient number (p)
Cell1	T	1
Cell1	T	2
Cell1	T	4
Cell2	T	1
Cell2	T	2
Cell2	T	4
Cell1	C	3
Cell1	C	5
Cell1	C	6
Cell2	C	3
Cell2	C	5
Cell2	C	6

In order to capture the differences among Cell1 and Cell2 (considering Cell2 and condition C as reference levels), I used the following model:

design = ~ Cell+Condition+Cell:Condition

but the DEGs (significant genes) I am getting here via interaction term (Cell:Condition) is not considering patient. Therefore, please help me to design a model which can perform a differential expression analysis by doing the following:

(i) Cell1/Cell2: T samples vs. Cell1/Cell2 C samples including patient information.

I like to obtain DEGs between Cell 1 and Cell2 cell lines considering (i) comparison.

Thank you,

deseq2 • 1.3k views

ADD COMMENT • link updated 7.3 years ago by Michael Love 43k • written 7.3 years ago by dickson.russel ▴ 10

score 1 · Answer 1 · 2017-11-07

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 16 hours ago

United States

There’s a section of the vignette which discusses this topic, see controlling for individual level differences in the section on “model matrix not full rank”

ADD COMMENT • link 7.3 years ago Michael Love 43k

1

Entering edit mode

Here's that section:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#group-specific-condition-effects-individuals-nested-within-groups

Note that the terminology is different, your "condition" is called "group" in the vignette, and your "cell" is called "condition" in the vignette. Otherwise you should be able to apply the recommendation straight away to your data.

ADD REPLY • link 7.3 years ago Michael Love 43k

0

Entering edit mode

Hi Michael:

Thank you for your suggestion and the link. I have applied the model as suggested in vignettes by creating an additional column i.e. (ind.n)

ds <- DESeqDataSetFromMatrix(countData=counts, colData=samples, design=~cell+cell:ind.n+cell:cnd) here cnd=condition

ds$cell <- relevel(ds$cell, "Cell2")
ds$cnd <- relevel(ds$cnd, "C")

ds=DESeq(ds)

This model has returned the following resultsNames:

cell_X_vs_Y, cellX.ind.np2, cellY.ind.np2, cellX.ind.np3, cellY.ind.np3, cellX.cndT, cellY.cndT

Consider X as cell1 and Y as cell2.

But this model is not giving the results for patient1 samples (T vs. C). How can I get this?

To obtain the differences among two cell lines considering patients information, do I need to combine all the significant genes obtained by results name ending with .np2 & .np3. Is this correct?

Thank you!

ADD REPLY • link 7.3 years ago dickson.russel ▴ 10

1

Entering edit mode

Note that you're not following the example exactly. It should be (using your terms) condition:individual, not cell:individual. This is because it is across condition (C or T) that the individuals are nested.

Furthermore, once you use the condition:individual, note that the model controls for patient differences, but it's letting you estimate the Y vs X effect for C and for T, or to contrast that effect between T and C. But the differences seen in individual patients are used as replication to evaluate the significance. You will not get Y vs X coefficients for each patient. You would need more replication to do a statistical analysis at that level.

ADD REPLY • link 7.3 years ago Michael Love 43k

0

Entering edit mode

Michael - thank you for your reply.

I want the exact interaction what you have mentioned in your response i.e. Y vs X effect for C and for T, or to contrast that effect between T and C, specially contrast between T(Y/X) and C(Y/X).

Can you please check my the below approach, if I am doing correct?

Revised EDF (experimental design file):

Cell	Patient	cnd	ind.n (new column)
X	p3	C	p1
Y	p3	C	p1
X	p5	C	p2
Y	p5	C	p2
X	p6	C	p3
Y	p6	C	p3
X	p1	T	p1
Y	p1	T	p1
X	p2	T	p2
Y	p2	T	p2
X	p4	T	p3
Y	p4	T	p3

As per your suggestion, I am using the following design:

##################################################################################

ds<- DESeqDataSetFromMatrix(countData=counts, colData=edf, design = ~cnd+cnd:ind.n + cnd:cell)

ds$cnd <- relevel(ds$cnd, "C")
ds$cell <- relevel(ds$cell, "X")

ds = DESeq(ds)

resultsNames(ds)

[1] "Intercept" "cnd_T_vs_C" "cndC.ind.np2" "cndT.ind.np2" "cndC.ind.np3" "cndT.ind.np3" "cndC.cellX" "cndT.cellX"

res.model1 <- results(ds, contrast=list("cndT.cellX", "cndC.cellX")

#######################################################################

is this the correct design to fetch DEGs for the contrast between T(Y/X) and C(Y/X)?

Thank you

ADD REPLY • link 7.3 years ago dickson.russel ▴ 10

1

Entering edit mode

If you relevel() like in the code above, then the resultsNames() should have given you "cndC.cellY" instead of "cndC.cellX". Can you try that again and see if you get "cndC.cellY". I'm guessing you may have added in the relevel() code later but the copy/paste of resultsNames() was from before?

But aside from this issue, if you use

results(dds, name="cndC.cellY")
results(dds, name="cndT.cellY")
results(dds, contrast=list("cndT.cellY", "cndC.cellY"))

You will get 1) the Y vs X effect in C, 2) the Y vs X effect in T, and 3) the difference between the Y vs X effect in T compared to C, all while controlling for baseline patient differences.

ADD REPLY • link 7.3 years ago Michael Love 43k

0

Entering edit mode

Michael - thank you very much for your diligent feedback and detailed explanation which is very much appreciated.

Regarding relevel(), you correctly guessed that I copied/pasted before adding relevel(), therefore, the interested names are as you mentioned i.e. cndC.cellY and cndT.cellY.

Again, thank you very much!

ADD REPLY • link 7.3 years ago dickson.russel ▴ 10

0

Entering edit mode

Hi Michael:

After getting DEGs using contrast = list("cndT.cellY", "cndC.cellY"), I have notified some unusual observations :

1. the fold change (or logfc) doesn't match with the expression pattern of several genes for two different groups. For example, say gene-x has down patterns in group X but more up expression in Y, the expected fold should be up whereas contrast is giving opposite results (negative fold) and vice verse.

2. With FDR <5%, there is no signal gene found significant. But at pv<0.05, I found ~600 genes. Upon comparing the top significant genes (p-value) using their relative expression values in three patients for a given groups [C or T], genes are not at all having similar patterns see below table:

X and Y are two cell lines

C and T are control and Tumor

Gene	X/Y(p1) : C	X/Y(p2):C	X/Y(p3):C	X/Y(p1):T	X/Y(p2):T	X/Y(p3):T	pv
1	up	down	up	down	up	down	significant
2	down	down	up	up	up	down	significant
3	up	up	down	down	down	down	significant

3. When I performed a t-test on the relative expression values, I got ~2000 genes with pv<0.05 with perfect patterns among patients of two different groups (up up up in C, down down down in T or vice versa), but not with DESeq2. Is there any bug in my code mentioned in my previous messages?

Looking forward to learn your suggestions.

Thank you

ADD REPLY • link 7.3 years ago dickson.russel ▴ 10

0

Entering edit mode

I see you have "X/Y" in the tables above, although if you read my explanation above, the coefficients specify a log fold change for Y/X. Is this the confusion? You determine what the reference level is, which is the denominator of the fold change. Earlier you said the reference (denominator) should be X. It can be whatever you want, but you should pick one and go with it for the whole analysis.

You shouldn't be looking at p-values, only adjusted p-values.

You can expect 5% of a null dataset to have a p-value less than 0.05. So looking at the number of small p-values isn't necessarily interesting, and hence statisticians have come up with the methods for controlling FDR, and an adjusted p-value to produce such sets.

A t-test on estimated fold changes ignores the data distribution, and does not share parameter estimates across genes. It is very much not recommended by me. It's not surprising to me that you get different results with a method which is totally different.

ADD REPLY • link 7.3 years ago Michael Love 43k

0

Entering edit mode

Thank you for your reply, Michael.

Regarding X/Y or Y/X both fold-change values are not matching with the relative expression values. Could it be because of higher FDR values?

I do agree with your comment on FDR cut-off.

ADD REPLY • link 7.3 years ago dickson.russel ▴ 10

0

Entering edit mode

I guess I would double check how you are double-checking the coefficients, and walk through the 3 tables I described above. The contrast is simply the difference between LFCs for table 1 and 2, so there should be no surprises when you get to table 3.

ADD REPLY • link 7.3 years ago Michael Love 43k