In edgeR, given the following data table:
GENE EXPRESSION DISEASE A B
A1BG 3 1 0 0
A1BG 0 1 0 0
...
ZZZ3 6 0 0 1
ZZZ3 7 0 0 1
A: concentration of compound A (continuous var)
B: concentration of compound B (continuous var)
DISEASE: 0/1 whether it's a disease or normal sample
If what I care is to do the following "glm":
EXPRESSION ~ intercept + DISEASE + A + B
Then,
1) How should I define "group" in edgeR? DISEASE can be grouped, but A or B cannot because they are continuous
2) coef = 3?
3) Should I use contrast of c(0, 1, 1, 1)?
In effect, I don't know much about "coef = x" or why one needs to "group" or specify "contrast". If we can assume EXPRESSION or log(EXPRESSION) is normally distributed, then in R we can simply do
glm(EXPRESSION ~ DISEASE + A + B)
Don't know why edgeR is so (unnecessarily?) complicated.
It would be much appreciated if someone could post the actual edgeR code, or just enlighten me with some teaching.
This is what I have come to so far:
-----------------------------------------------------------
d<-read.table("myfile.txt", sep="\t", header=T);
sd<-subset(d, GENE=="A1BG");
attach(sd)
dsn<-model.matrix(~DISEASE+A+B, data=sd);
fit<-glmFit(EXPRESSION, dsn, coef=c(2,3,4))
-----------------------------------------------------------
And I got the following error:
-----------------------------------------------------------
> fit<-glmFit(EXPRESSION, dsn, coef=c(2,3,4))
Error in glmFit.default(EXPRESSION, dsn, coef = c(2, 3, 4)) :
No dispersion values provided.
-----------------------------------------------------------
coef=2 yielded the same error.
Thanks a lot in advance.
Hi James, thanks a lot for the reply.
The expression values are expected count produced by RSEM, and edgeR can handle non-integer "raw" count like this. What you have seen about A & B are part of the values they take. In effect, A & B both can be 0, 0.1, 0.5, 1, and these should be considered as continuous.
I have been able to use edgeR to process such expected count with 2 groups of samples (e.g. case ~ control) with replicates, and everything seems to work fine. However this time the compounds come into play, and I want to use edgeR to model more input variables.
I did read the user's manual over and over. However, guess my statistics background is not strong enough to understand all relevant parts yet.
I am not sure this has to do with your statistics background. There are like a bazillion different examples in the edgeR User's Guide, and none of them are remotely similar to what you have done. You need to read your data into a DGEList, then construct a design matrix, then fit the model, etc.
If you have read the manual over and over, why aren't you emulating what you have read? The idea is to give examples that people could try to modify to fit their needs.
Plus, you have an expected count from RSEM of -2? I am not that familiar with RSEM, so I could easily be mistaken, but I cannot imagine that RSEM would really return a negative expected count.
If your statistics background isn't strong enough to understand everything, then you have two choices. Either get to learning so it is strong enough, or find somebody local who already understands this stuff.
Good catch on the -2. It's my fault -- it's actually log10(expression + 0.01) when expression = 0. 0.01 is the pseudo-count. I pasted the wrong data. However, the problems remain the same. I've made corrections to the original post.
Meanwhile, dude, I really appreciate your help and sorry for the errors. But, you seem to be quite mad, which does not seem to be necessary. Forum is a place for people to ask questions, even stupid questions. If everybody knows everything, then why people ask questions, right?
Thanks for pointing out DGEList. That is helpful, too.
At a minimum, we expect people to follow the documentation for whatever package they're using - see http://www.bioconductor.org/help/support/posting-guide/. If you don't, then any questions you might have will simply be answered with "read the user's guide". In this respect, James is already doing quite a bit more than he needs to. Just looking at your code suggests you didn't even read the Quick Start section (1.2) at the front of the edgeR user's guide; this would indicate that you needed to run
DGEList
andestimateDisp
prior to callingglmFit
.Thanks for the instructions, Aaron.