Question

Stuck on how to voom-limma on RSEM data from TCGA

0

Entering edit mode

Peipei • 0

@59a190f4

Last seen 2.7 years ago

Canada

Hi there, I'm a complete newbie to working with data in R for my own project that's not been cleaned up and streamlined for a university course, so wanted to ask if anyone has had experience working with or has thoughts on using data from the TCGA's RNA Seq V2 RSEM data for differential expression analysis through voom and limma

After scouring the forums, it seems that there's a general consensus to filter out any low counts and then run voom-limma

My question is whether the presence of a mutation in a tumour leads to significant up or downregulation of genes compared to those without the mutation. However, when I do this with the TCGA's RSEM data, I get that almost all of the genes in both my groups are significantly upregulated. Any thoughts or advice would be appreciated. This is bits and pieces of what I've been able to learn online through papers, forums, and youtube videos.

I can also supply more code to try to describe the problem better. Thanks so much for reading!

#mrna is a dataframe of 20531 genes x 230 samples
counts <- mrna
counts[is.na(counts)] <- 0
d0 <- DGEList(counts)

#mut is a variable of 'Yes' and 'No' regarding whether the patient harbours the mutation of interest or not
group <- as.factor(sample$mut)
d0$samples$group <- group

keep <- rowSums(counts) > 20
counts <- counts[keep,]

design <- model.matrix(~0+group)
colnames(design) <- gsub("group", "", colnames(design))

v <- voom(d0, design, plot=T)
vfit <- lmFit(v, design)
efit <- eBayes(vfit)

summary(decideTests(efit))
tfit <- treat(vfit, lfc=1)
dt <- decideTests(tfit)
summary(dt)
#the output of this is that both the groups with and without the mutation almost all the 20,300 genes that were included are upregulated

TCGA r RNASeqData voom limma • 3.5k views

ADD COMMENT • link 2.8 years ago Peipei • 0

score 2 · Accepted Answer · 2022-06-17

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Your approach is fine but you've made a simple mistake. If you had simply omitted the 0+ from the model.matrix statement and not reset the column names then everything would have been fine. At the moment, you're not actually conducting a differential expression analysis at all.

You would be better off following the official limma documentation rather than bits and pieces from forums and youtube videos:

Try typing limmaUsersGuide() at the R prompt.
Or you could try the limma-voom workflow at https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html
Or try the guide to creating design matrices https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/designmatrices.html

ADD COMMENT • link 2.8 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks so much for your response. I was following the limma-voom workflow link previously but now after reading the guide for creating design matrices, I'm seeing that I didn't make my wild type group the reference group and maybe that contributed to the problem

However, with that adjustment and removing the requirement for the intercept to be 0 and the renaming of the columns, I'm now getting that none of the genes are up or downregulated out of ~20,000 genes- is this likely? Or does my analysis just need more adjustment?

#mrna is a dataframe of 20531 genes x 230 samples 
#(row names are genes, col names are sample names)
counts <- mrna
counts[is.na(counts)] <- 0
d0 <- DGEList(counts)

group <- as.factor(sample$mut)
group <- relevel(group, 'WT') #reorder to make WT the reference
d0$samples$group <- group

keep <- rowSums(counts) > 20
counts <- counts[keep,]

design <- model.matrix(~group)
v <- voom(d0, design, plot=T)
vfit <- lmFit(v, design)
efit <- eBayes(vfit)
summary(decideTests(efit))

ADD REPLY • link 2.8 years ago Peipei • 0

0

Entering edit mode

After tinkering with this all day, some strange things that I've found are:

When I run View(decideTests(efit)), my TestResults matrix yields all 1s for the (Intercept) column and all 0s for the groupmut column- why would this happen if I specified d0$samples$group <- group earlier and my d0 object showed that I clearly delineated which sample fell into which group?
When I run topTable(efit), all my adj. P values are exactly the same at 0.3699211, but when I run topTable(efit, coef=1), I get different adj P values that are highly significant. I've read the documentation on coef within topTable, but am not sure in my case what the NULL coef and a coef of 1 means for my data.

Thanks for reading until this point and I thank you again in advance for your help.

ADD REPLY • link 2.8 years ago Peipei • 0

0

Entering edit mode

I was following the limma-voom workflow link

The workflow showed you how to form contrasts but you skipped that step.

I didn't make my wild type group the reference group and maybe that contributed to the problem

No it didn't. If you use 0+ for the design matrix then you must explicitly form contrasts. The question of which is the reference group is quite separate. The only difference that setting the reference group makes is to change the sign of the logFC values. The p-values will be identical either way.

I'm now getting that none of the genes are up or downregulated out of ~20,000 genes- is this likely?

Yes, it is very likely indeed that noisy human data will show no significant DE. TCGA in particular is notoriously noisy. It is one of the key aims of the limma software that it does not give significant DE results unless there is good reproducible evidence for differential expression.

Or does my analysis just need more adjustment?

Very likely. You've skipped out several important steps from the limma-voom workflow including filterByExpr to cut down on variability, library size normalization with calcNormFactors and data exploration with plotMDS. You almost certainly need sample quality weights. There are almost certainly batch effects or covariates that should be accounted for. Large human datasets require lots of quality asssessment and trouble-shooting.

counts[is.na(counts)] <- 0

There are no circumstances in which there should be any NA values in an RNA-seq dataset! This makes me suspicious that there is something wrong or non-standard with your pipeline.

some strange things that I've found are

Nothing at all strange about these things. The first coefficient of the fitted model is the intercept. It makes no sense to test whether the intercept is equal to zero, of course it won't be. Please type colnames(efit) and then compare to the documentation to understand what the two coefficients represent.

ADD REPLY • link 2.8 years ago Gordon Smyth 52k

0

Entering edit mode

Hi again, thank you so much for getting back to me so thoroughly. You're right that I didn't include contrasts and that was from my misunderstanding that I wasn't supposed to use them since I am comparing between just two groups.

Your point about the NA values is very helpful, I saw that one sample had 516 NAs and once I removed that, there were no longer any NA values for the rest of the dataset. I tried to more thoroughly follow along with the guidelines and also went through the limma User Guide and have come to the following:

counts <- mrna #mrna is a dataframe of 20531 genes x 229 samples (removed the one with NAs)
#(row names are genes, col names are sample names)

group <- as.factor(sample$mut)
group <- relevel(group, 'NS') 

d1 <- DGEList(counts, genes=row.names(counts), remove.zeros=T)
d1$samples$group <- group

design <- model.matrix(~0+group)

#filtering out lowly expressed genes
keep <- filterByExpr(d1, group)
d1 <- d1[keep,,keep.lib.sizes=FALSE]

d1 <- calcNormFactors(d1)

v <- voom(d1, design, plot=T)
vwts <- voomWithQualityWeights(d1, design=design) #trying to add sample quality weights
vfit2 <- lmFit(vwts, design)
contr <- makeContrasts(groupCS - groupNS, levels = colnames(coef(vfit2)))
vfit2 <- contrasts.fit(vfit2, contr)
vfit2 <- eBayes(vfit2)
summary(decideTests(vfit2))

The output I have from this final call is that still nothing is up or down regulated

My output when I try to explore the data with plot MDS (with NS being no mutation and CS being currently mutated) looking like this:

par(mfrow=c(1,2))
plotMDS(d1, labels=d1$samples$group, col=ifelse(d1$samples$group=="NS","blue","red"))
plotMDS(d1,labels=d1$samples$group,col=ifelse(d1$samples$group=="NS","blue","red"), gene.selection="common") 
#I've seen this use of 'common' for gene selection but unsure what it does and how it's able to 
#account for so much variation

enter image description here

So the way this data looks is that there is no natural clustering of my two groups, and perhaps indeed no DE genes would that be a correct interpretation? Would the next step to be to find another dataset to ask this question on or is there more that can be done with this data in terms of adjustments for an exploratory analysis? Especially since the % variance explained by the top two dimensions are both only 3% with the 'pairwise' gene.selection, it feels like I'm hitting a dead end.

Apologies if these questions are incredibly naive, thanks for your patience and reading through.

ADD REPLY • link 2.8 years ago Peipei • 0

0

Entering edit mode

The MDS plots show absolutely no suggestion of any separation between the groups, i.e., no suggestion of differential expression. I can't tell you want to do next as that's a research question rather than a software question. It is theoretically possible but very unlikely that there might a DE in the data if appropriate covariates or technical effects were accounted for, but I have no idea whether any such covariates exist.

To find out what gene.selection="common" does, type help(plotMDS).

ADD REPLY • link 2.8 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you so much! One final question, I think it makes sense that nothing is significantly up or downregulated based on plotMDS, but why would in turn every single gene show up as upregulated with the same data when I give more weight to fold changes in:

logCPM <- cpm(d1, log=TRUE, prior.count=3)

fit2 <- lmFit(logCPM, design)
fit2 <- treat(fit, lfc=log2(1.2), trend=TRUE)
summary(decideTests(fit2))
top.table2<- topTreat(fit2, coef=ncol(design))

ADD REPLY • link 2.8 years ago Peipei • 0

0

Entering edit mode

You have gone back to the same mistake as in your original question. Let me repeat, you don't need to use 0+ to create the design matrix but, if you do, then you must explicitly form contrasts.

ADD REPLY • link 2.8 years ago Gordon Smyth 52k

0

Entering edit mode

Okay I see, that makes complete sense- thanks so much again for your patience and help! Will be going back to the drawing board to reassess next steps

ADD REPLY • link 2.8 years ago Peipei • 0