Hi,
I have multiple rna-seq samples with treated and untreated with a drug. Something which looks like:
tissue treatment gender patient1 treated female patient1 untreated female patient2 treated female patient2 untreated female patient3 treated female patient3 untreated female patient4 treated male patient4 untreated male patient5 treated male patient5 untreated male
I am interested to see the DE genes between treated and untreated considering the gender. There might be other factors that add noise to data such as the time of collection, age etc.
So far what I have done is the following. Is it the correct approach ? Which would be the best way to get rid of unknown noise and find DE genes ?
library(edgeR) rawdata <- read.delim("selected_raw_count_matrix.txt",header=T,row.names=1) y <- DGEList(counts=rawdata) y <- calcNormFactors(y) gender=c("female","female","female","female","female","female","male","male","male","male") treat=c("treated","untreated","treated","untreated","treated","untreated","treated","untreated","treated","untreated") design <- model.matrix(~gender+treat) y <- estimateGLMCommonDisp(y, design, verbose=TRUE) y <- estimateGLMTrendedDisp(y, design) y <- estimateGLMTagwiseDisp(y, design) fit <- glmFit(y, design) lrt <- glmLRT(fit) topTags(lrt)
Thanks for the answer. this is very useful. my goal is to look for effect of the drug on gene expression, while masking the heterogeneity introduced by sexes, age, and other unknown factors , a my data seems to have lot of variation etc. There is no clear separation of treated and untreated samples in MDS plot for both the designs, (~0 + gender + gender:treat) and model.matrix(~ 0 + patient + gender:treat). I am bit worried.
Well, even if the treated/untreated samples are all over the place on the MDS plot, the DE analysis will still be okay if the treatment effect is consistent between patients. This kind of within-patient similarity is hard to see on the plot, so you'll need to do the analysis itself to check. Any problems can then be usually diagnosed by looking at the expression profiles of the genes with the most variable dispersions, e.g., to see if any patients are acting abnormally.
Thanks. RUVSq seems to be a good alternative but I could not find a way to input the design factor for normalisation. They did not mention anything about multi factor design for normalisation. Have you used it for multi factor experiments ?
From what I understand, the
RUVg
method doesn't need a design matrix if you have known controls. You can try theRUVr
method, which accepts a design matrix and operates on the residuals of the fitted GLM.I disagree. It is more correct to adjust for patient differences as part of the linear model, than as part of the normalization. This is the same as for any paired analysis.
Sorry I realise this thread is ancient, but I'm wondering why it's ok to remove the last column from the design matrix? Also, why this design results in a matrix that is not full rank? Thanks