Question

ANOVA using edgeR

0

Entering edit mode

myprogramming2016 • 0

@myprogramming2016-9741

Last seen 7.9 years ago

Hello,

I would like to identify deferentially expressed genes.

I have 50 different individuals in three biological replicates (50 x 3=150) and they are not challenged for any condition (for e.g. control vs wild). I mean there is no treated vs non-treated.

I would like to perform an ANOVA. Could you please exemplify using some example code in edgeR?

Thanks in advance.

YK

edger • 4.5k views

ADD COMMENT • link 9.2 years ago myprogramming2016 • 0

Ryan C. Thompson · Answer 1 · 2016-02-17

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

Differential expression... between what? If you have no conditions, then the only contrast you can do is between individuals. This would only be sensible if you're looking for highly variable genes across individuals. Identifying DE genes between specific individuals seems pointless as this cannot be generalized to the population.

Anyway, to run an ANOVA-like comparison in edgeR, you should set up your design matrix like this:

individuals <- factor(rep(1:50, each=3))
design <- model.matrix(~individuals)

And then run estimateDisp, glmFit and glmLRT (or their quasi-likelihood equivalents). For glmLRT, you need to drop all non-intercept coefficients via the coef argument:

res <- glmLRT(fit, coef=2:ncol(design))

... which will test against the null hypothesis that all individuals have the same expression.

ADD COMMENT • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron.

I would like to look at the differential expression among 50 different individuals. Do you have any suggestion?

ADD REPLY • link updated 9.2 years ago by Ryan C. Thompson ★ 7.9k • written 9.2 years ago by myprogramming2016 • 0

0

Entering edit mode

Aaron already provided a suggestion for how to do it in his answer. If that suggestion didn't work, can you explain why it's not what you're looking for?

ADD REPLY • link 9.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Aaron and Ryan,

I am getting an error while designing a model (design <- model.matrix(~individuals). This code is not working.

Could you please go through the following code?

Thanks for your help.

R code:

x<-read.delim("test.txt",header=T,sep="\t")
y<-DGEList(counts = x[,2:13]) #using only 12 individuals for test 
y$samples
#Filter lowly expressed genes
keep<-rowSums(cpm(y)>1)>=2
y<-y[keep,keep.lib.sizes=FALSE]
head(y$counts)
#individuals1<-factor(rep(1:12,each=3))
individuals<-factor(y$counts)
design <- model.matrix(~individuals) # it's not working
#design <- model.matrix(object = ~y$counts)
y<-estimateGLMCommonDisp(y, design)
y<-estimateGLMTrendedDisp(y,design)
y<-estimateGLMTagwiseDisp(y.design)
fit<-glmFit(y,design)
res <- glmLRT(fit, coef=2:ncol(design))

ADD REPLY • link updated 9.2 years ago by Gordon Smyth 52k • written 9.2 years ago by myprogramming2016 • 0

0

Entering edit mode

The line individuals<-factor(y$counts) makes no sense. I suggest you follow my original instructions. Even your subsetting is wrong; you've picked libraries 2 to 13, so you're either just using one replicate from each individual, or you're using all replicates from only 4 individuals. If you want to test it, then you should use all replicates from 12 individuals and use individuals <- factor(rep(1:12, each=3)).

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Sorry Aaron! There was an error in subsetting. I have revised the code and its working fine now.

Thanks for your help.

ADD REPLY • link updated 9.2 years ago by Gordon Smyth 52k • written 9.2 years ago by myprogramming2016 • 0

0

Entering edit mode

Dear Aaron, I am getting an error while generating Smear (MA) plot in edgeR. Is there any way that I can generate a smear plot using all the individual and highlight differentially expressed genes? Thanks *I used following Codes:* individuals <- factor(rep(1:50, each=3)) design <- model.matrix(~individuals) #design <- model.matrix(~0+individuals) head(design) #estimateDisp y<-estimateGLMCommonDisp(y,design) y<-estimateGLMTrendedDisp(y,design) y<-estimateGLMTagwiseDisp(y,design) res <- glmLRT(fit, coef=2:ncol(design)) top<-topTags(res,n=15) tt = topTags(res, n=nrow(y)) head(tt$table) nc = cpm(y, normalized.lib.sizes=TRUE) rn = rownames(tt$table) rn deg = rn[tt$table$FDR < .05] *plotSmear(res, de.tags=deg)* Error in plotSmear(res, de.tags = deg) : table$logFC slot in DGELRT object is NULL. We cannot produce an MA (smear) plot if more than one coefficient from the GLM is being tested in the likelihood ratio test as this results in more one logFC value per gene---one for each coefficient. *plotSmear(res$table$logFC.individuals2, de.tags=deg)* Error in plotSmear(res$table$logFC.individuals2, de.tags = deg) : Currently only supports DGEList/DGELRT/DGEExact objects as the object argument. On Thu, Feb 18, 2016 at 6:36 PM, myprogramming2016 [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User myprogramming2016 <https: support.bioconductor.org="" u="" 9741=""/> wrote Comment: > ANOVA using edgeR <https: support.bioconductor.org="" p="" 78480="" #78560="">: > > Sorry Aaron! There was an error in subsetting. I have revised the code and > its working fine now. > > Thanks for your help. > > > > > > > > > > > > > > ------------------------------ > > Post tags: edger > > You may reply via email or visit > C: ANOVA using edgeR >

ADD REPLY • link 9.2 years ago myprogramming2016 • 0

0

Entering edit mode

No, there isn't. The error message is fairly self-explanatory; the ANOVA yields multiple log-fold changes for each gene, so how on earth would you fit that into a single plot?

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Is there any way to show differentially expressed genes? Thanks On Sat, Feb 20, 2016 at 11:59 AM, Aaron Lun [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Aaron Lun <https: support.bioconductor.org="" u="" 6732=""/> wrote Comment: > ANOVA using edgeR <https: support.bioconductor.org="" p="" 78480="" #78612="">: > > No, there isn't. The error message is fairly self-explanatory; the ANOVA > yields multiple log-fold changes for each gene, so how on earth would you > fit that into a single plot? > > ------------------------------ > > Post tags: edger > > You may reply via email or visit > C: ANOVA using edgeR >

ADD REPLY • link 9.2 years ago myprogramming2016 • 0

0

Entering edit mode

With an ANOVA? Not unless you want to look at 50 plots (one showing the log-fold change for each individual against the rest). Even then, it wouldn't help; the log-fold change may only be large in one or two individuals/plots, but you'd end up marking them as DE in all plots because the ANOVA doesn't specify between which individuals the DE occurs.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

score 0 · Answer 2 · 2016-02-20

0

Entering edit mode

myprogramming2016 • 0

@myprogramming2016-9741

Last seen 7.9 years ago

Thanks for the clarification.

Secondly, I would like to look at the genes which are highly variable across the individuals.

Does topTags() function can be used to get highly variable genes?

top<-topTags(res,n=15)

ADD COMMENT • link 9.2 years ago myprogramming2016 • 0

1

Entering edit mode

No. edgeR is built to identify DE genes, not highly variable genes. If you want to identify HVGs across individuals, I would suggest fitting an intercept-only model (i.e., all individuals treated as replicates, probably after summing all libraries for each individual together so you get one sample per individual) and then ranking the genes on their tagwise dispersions after running estimateDisp. This will give you an idea of the top HVGs; it doesn't really make sense to ask whether they're significantly highly variable, because what amount of variability would you expect under the null hypothesis anyway?

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

score 0 · Answer 3 · 2016-02-23

0

Entering edit mode

myprogramming2016 • 0

@myprogramming2016-9741

Last seen 7.9 years ago

Thanks for your suggestion.

Is there any way to look at the genes expressed in each individual?

Thanks

ADD COMMENT • link 9.2 years ago myprogramming2016 • 0

1

Entering edit mode

Looking at "the genes expressed in each individual" doesn't make any sense. I'm sure that you'll find plenty of genes that are expressed, like actin or GAPDH or something equally boring. I can only presume you want to find genes that are unique to each individual, though even this seems pointless. Who cares about what happens in a particular individual, if you can't generalize it to/replicate the result in other individuals?

P.S. please use the "Add comment" field instead of the "Add answer" field to respond, otherwise the thread will be broken.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

I forgot to mention you that I am dealing with the plant species.

Yeah, I would like to look at the genes that are unique to a particular individuals.

I don't know whether edgeR can perform this analysis.

Thanks.

ADD REPLY • link 9.2 years ago myprogramming2016 • 0

1

Entering edit mode

Hm... well, if you can re-use particular individuals for new experiments, perhaps individual-specific inferences might be useful. Still seems a bit iffy, though. Anyway, if you want to identify genes that are unique to an individual, there's a couple of things you can do. The least specific approach would be:

plant <- factor(rep(1:50, each=3))
design <- model.matrix(~0 + plant)
# ... edgeR stuff ...
con <- makeContrasts(plant1 - (plant2 + ... + plant50)/49, levels=design)
res <- glmLRT(fit, contrast=con)

... which would identify genes that are up/down-regulated in plant 1 compared to the average of all other plants. This should give you a fair number of DE genes, but many of these may not be specific to plant 1 in the strict sense - for example, if a gene is expressed in both plant 1 and 50, you might still detect it as being DE if all other plants have lower expression.

The most specific approach would be something like this:

con <- makeContrasts(plant1 - plant2, ..., plant1 - plant50, levels=design)
all.pvalues <- list()
all.signs <- list()
for (x in seq_len(ncol(con))) {
    res <- glmLRT(fit, contrast=con[,x])
    all.pvalues[[x]] <- res$table$PValue
    all.signs[[x]] <- sign(res$table$logFC)
}
final.pvalue <- do.call(pmax, all.pvalues)
final.fdr <- p.adjust(final.pvalue, method="BH")
all.signs <- do.call(cbind, all.signs)
same.sign <-  rowSums(all.signs[,1]!=all.signs)==0L
keep <- which(final.fdr <= 0.05 & same.sign)

... which will identify genes that are DE in the same direction between plant 1 and each other plant. This will probably give no genes, as it's very conservative - lack of evidence for DE between any two plants will cause the gene to be discarded.

A compromise might be to do an ANOVA:

res <- glmLRT(fit, contrast=con) # using the same contrast above.
current.fdr <- p.adjust(res$table$PValue, method="BH")
all.signs <- sign(res$table[,grep("logFC", colnames(res$table)),drop=FALSE])
same.sign <-  rowSums(all.signs[,1]!=all.signs)==0L
keep <- which(current.fdr <= 0.05 & same.sign)

... which will identify all genes that have any DE between plant 1 and any other plant, and where plant 1's expression is either higher or lower than that of all other plants.

I should add that you'll have to set up different contrasts for each plant. This can be done without re-writing the long makeContrasts call every time. For the first approach, you can just do:

con <- rep(-1/49, 50)
con[x] <- 1 # where 'x' is the plant of interest.

... while for the second and third approaches, you can do:

con <- -diag(50)
con[x,] <- 1
con <- con[,-x]

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron for your help.

ADD REPLY • link 9.2 years ago myprogramming2016 • 0

0

Entering edit mode

Hi Aaron,

Thanks for your help!. It really helps.

Do I need to change this code (con[x,] <- 1) and run for each plant separately. That means, I will have to run it 50 times.

Secondly, How do I get an estimate of how many genes expressed in each plants?
In the end, I would like to calculate common genes across all the plants.

ADD REPLY • link 9.2 years ago myprogramming2016 • 0

1

Entering edit mode

First question, yes.

Second question, that's generally not a sensible thing to ask, as it depends on what you consider "expressed". 1 read? 2 reads? 10 reads? How much of that is due to mapping errors, or leaky transcription, or contamination from something else? If you want to find non-DE genes, you can do an ANOVA like before, and just pick the genes with large p-values.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi, Do you have any script to prepare data matrix for edgeR? I have used featurecounts using Linux platform to get raw read counts.

Sorry! I did ask this question before, but couldn't resolve an issue.

Commands: for i in *.sam; do featureCounts -p -T 5 -t exon -g gene_id -a genes.gtf -o $i.counts.txt $i;done

Thanks,

YK

ADD REPLY • link 9.1 years ago myprogramming2016 • 0

0

Entering edit mode

This doesn't seem related to the ANOVA question. Please keep the discussion relevant to the original question, or start a new post. In any case, it seems that Wei has provided you with some useful links in his answer (https://support.bioconductor.org/p/78611/#78641). If it doesn't help, respond to his answer directly.

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k