Question

How to do multivariate glmFit analysis of RNA-seq with edgeR

0

Entering edit mode

mousheng xu ▴ 10

@mousheng-xu-2280

Last seen 7.3 years ago

In edgeR, given the following data table:

GENE EXPRESSION DISEASE A B
A1BG  3       1   0   0
A1BG 0       1   0   0
...
ZZZ3   6       0   0   1
ZZZ3   7       0   0   1

A: concentration of compound A (continuous var)

B: concentration of compound B (continuous var)

DISEASE: 0/1 whether it's a disease or normal sample

If what I care is to do the following "glm":

EXPRESSION ~ intercept + DISEASE + A + B

Then,

1) How should I define "group" in edgeR? DISEASE can be grouped, but A or B cannot because they are continuous

2) coef = 3?

3) Should I use contrast of c(0, 1, 1, 1)?

In effect, I don't know much about "coef = x" or why one needs to "group" or specify "contrast". If we can assume EXPRESSION or log(EXPRESSION) is normally distributed, then in R we can simply do

glm(EXPRESSION ~ DISEASE + A + B)

Don't know why edgeR is so (unnecessarily?) complicated.

It would be much appreciated if someone could post the actual edgeR code, or just enlighten me with some teaching.

This is what I have come to so far:

-----------------------------------------------------------
d<-read.table("myfile.txt", sep="\t", header=T);
sd<-subset(d, GENE=="A1BG");
attach(sd)
dsn<-model.matrix(~DISEASE+A+B, data=sd);
fit<-glmFit(EXPRESSION, dsn, coef=c(2,3,4))

-----------------------------------------------------------

And I got the following error:

-----------------------------------------------------------

> fit<-glmFit(EXPRESSION, dsn, coef=c(2,3,4))

Error in glmFit.default(EXPRESSION, dsn, coef = c(2, 3, 4)) :

No dispersion values provided.

-----------------------------------------------------------

coef=2 yielded the same error.

Thanks a lot in advance.

limma edger rnaseq • 1.6k views

ADD COMMENT • link 7.9 years ago mousheng xu ▴ 10

score 1 · Answer 1 · 2017-03-29

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

Those expression values aren't counts, so you shouldn't be using edgeR in the first place. And the A and B covariates aren't continuous, so far as I can tell. And you don't pass raw matrices to edgeR, which you would already know if you had read the edgeR User's Guide. Or the glmFit help page.

From what you have shown it doesn't look like you have even tried to understand what you are doing. You will find that it's hard to get people to help you when it seems you are unwilling to help yourself. You should at the very least read the edgeR User's Guide, which is quite comprehensive.

ADD COMMENT • link 7.9 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James, thanks a lot for the reply.

The expression values are expected count produced by RSEM, and edgeR can handle non-integer "raw" count like this. What you have seen about A & B are part of the values they take. In effect, A & B both can be 0, 0.1, 0.5, 1, and these should be considered as continuous.

I have been able to use edgeR to process such expected count with 2 groups of samples (e.g. case ~ control) with replicates, and everything seems to work fine. However this time the compounds come into play, and I want to use edgeR to model more input variables.

I did read the user's manual over and over. However, guess my statistics background is not strong enough to understand all relevant parts yet.

ADD REPLY • link 7.9 years ago mousheng xu ▴ 10

1

Entering edit mode

I am not sure this has to do with your statistics background. There are like a bazillion different examples in the edgeR User's Guide, and none of them are remotely similar to what you have done. You need to read your data into a DGEList, then construct a design matrix, then fit the model, etc.

If you have read the manual over and over, why aren't you emulating what you have read? The idea is to give examples that people could try to modify to fit their needs.

Plus, you have an expected count from RSEM of -2? I am not that familiar with RSEM, so I could easily be mistaken, but I cannot imagine that RSEM would really return a negative expected count.

If your statistics background isn't strong enough to understand everything, then you have two choices. Either get to learning so it is strong enough, or find somebody local who already understands this stuff.

ADD REPLY • link 7.9 years ago James W. MacDonald 68k

0

Entering edit mode

Good catch on the -2. It's my fault -- it's actually log10(expression + 0.01) when expression = 0. 0.01 is the pseudo-count. I pasted the wrong data. However, the problems remain the same. I've made corrections to the original post.

Meanwhile, dude, I really appreciate your help and sorry for the errors. But, you seem to be quite mad, which does not seem to be necessary. Forum is a place for people to ask questions, even stupid questions. If everybody knows everything, then why people ask questions, right?

Thanks for pointing out DGEList. That is helpful, too.

ADD REPLY • link 7.9 years ago mousheng xu ▴ 10

0

Entering edit mode

At a minimum, we expect people to follow the documentation for whatever package they're using - see http://www.bioconductor.org/help/support/posting-guide/. If you don't, then any questions you might have will simply be answered with "read the user's guide". In this respect, James is already doing quite a bit more than he needs to. Just looking at your code suggests you didn't even read the Quick Start section (1.2) at the front of the edgeR user's guide; this would indicate that you needed to run DGEList and estimateDisp prior to calling glmFit.