Using linear models to find differential gene expression (NGS)

0

Entering edit mode

Johnny H ▴ 80

@johnny-h-3952

Last seen 9.8 years ago

United Kingdom

Hi. I have found some R/Bioconductor/Genominator code on the web (below) and it measures differential expression of RNA-seq short read data using a general linear model. Can someone explain some basic questions I have? 1) What is the reason for using 2 glm's for measuring differential expression? 2) In the function(y) there are two linear models ran; one with argument y ~ groups and the other with argument y ~ 1. Why do this? 3) What does the offset do? 4) Why use ANOVA; is to compare the linear models? 5) What can we say about results, if adjusted for multiple testing; how would you explain a significant result? 6) Would an adjusted p-value of <= 0.05 be significant? Basically, I don't know much about the statistics done below and any advice or pointers to good literature for this would be a great help. Thank you. > head(geneCountsUI) mut_1_f mut_2_f wt_1_f wt_2_f YAL069W 0 0 0 0 YBL049W 19 18 10 4 # Normalisation of RNA-seq lanes notZero <- which(rowSums(geneCountsUI) != 0) upper.quartiles <- apply(geneCountsUI[notZero, ], 2, function(x) quantile(x, 0.75)) uq.scaled <- upper.quartiles/sum(upper.quartiles) * sum(laneCounts) # Calculating differential expression groups <- factor(rep(c("mut", "wt"), times = c(2, 2))) pvalues <- apply(geneCountsUI[notZero, ], 1, function(y) { fit <- glm(y ~ groups, family = poisson(), offset = log(uq.scaled)) fit0 <- glm(y ~ 1, family = poisson(), offset = log(uq.scaled)) anova(fit0, fit, test = "Chisq")[2, 5] }) [[alternative HTML version deleted]]

• 7.8k views

ADD COMMENT • link updated 14.6 years ago by Thomas J Hardcastle ▴ 180 • written 14.6 years ago by Johnny H ▴ 80

1

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 4.7 years ago

Zentrum für Molekularbiologie, Universi…

Hi Johnny On 09/01/2010 02:25 PM, Johnny H wrote: > I have found some R/Bioconductor/Genominator code on the web (below) and it > measures differential expression of RNA-seq short read data using a general > linear model. The code you quote uses a GLM of the Poisson family to analyse RNA- Seq. Please note that this is not appropriate because it cannot and does not assess biological variation! As this might contradict what other people say here, I demonstrate: Assume you have two conditions ("A" and "B") with two biological replicates each. For this example, we further assume that the library sizes are, due to some magic, the same in all four samples, so that we don't have to worry about normalization. Let's say we have, for a given gene, these counts: > y <- c( A1=180, A2=200, B1=240, B2=260 ) > group <- factor( c( "A", "A", "B", "B" ) ) Let's see the within group mean and standard deviation: > tapply( y, group, mean ) A B 190 250 > tapply( y, group, sd ) A B 14.14214 14.14214 From condition "A" to "B", the counts raise from an average of 190 to an average of 250, with a within-group SD of below 15. This should be significant, and it is: > fit0 <- glm( y ~ 1, family=poisson() ) > fit1 <- glm( y ~ group, family=poisson() ) > anova( fit0, fit1, test="Chisq" ) Analysis of Deviance Table Model 1: y ~ 1 Model 2: y ~ group Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 3 18.2681 2 2 1.8533 1 16.415 5.089e-05 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Now, for comparison, look at this data: > y2 <- c( A1=100, A2=280, B1=150, B2=350 ) > group <- factor( c( "A", "A", "B", "B" ) ) The means are the same, but the SD is much larger: > tapply( y2, group, mean ) A B 190 250 > tapply( y2, group, sd ) A B 127.2792 141.4214 An increase by 60 (from 190 to 250) should not be significant, if the standard deviation within group is more than twice as much. But as a Poisson regression does not care about this, we get a very low p value: > fit0 <- glm( y2 ~ 1, family=poisson() ) > fit1 <- glm( y2 ~ group, family=poisson() ) > anova( fit0, fit1, test="Chisq" ) Analysis of Deviance Table Model 1: y2 ~ 1 Model 2: y2 ~ group Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 3 187.48 2 2 171.06 1 16.415 5.089e-05 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 The proper thing to do is to use a family that allows for overdispersion, typically the negative binomial family, not the Poisson family. For a detailed discussion of this issue, see our preprint: http://precedings.nature.com/documents/4282/ For a solution of your problem that works for simple comparisons (one condition against another one), see the our "DESeq" package (or Robinson et al.'s "edgeR" package). If you need to do something more sophisticated, like a two-way anova: we are working on it. Simon +--- | Dr. Simon Anders, Dipl.-Phys. | European Molecular Biology Laboratory (EMBL), Heidelberg | office phone +49-6221-387-8632 | preferred (permanent) e-mail: sanders at fs.tum.de

ADD COMMENT • link 14.6 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Dear Simon, Naomi and Kasper, Thank you very much for your detailed answers and suggestions. It is very helpful. Simon, it is very interesting how you describe the example. From what I understand; I should use a negative Binomial statistic (your DESeq or EdgeR) that accounts for biological variation (when comparing two conditions), which reduces false positive calls produced by the Poisson approach, which does not account for the biological variation. Thank you, Cheers, J On Wed, Sep 1, 2010 at 3:47 PM, Simon Anders <anders@embl.de> wrote: > Hi Johnny > > > On 09/01/2010 02:25 PM, Johnny H wrote: > >> I have found some R/Bioconductor/Genominator code on the web (below) and >> it >> measures differential expression of RNA-seq short read data using a >> general >> linear model. >> > > The code you quote uses a GLM of the Poisson family to analyse RNA- Seq. > Please note that this is not appropriate because it cannot and does not > assess biological variation! > > As this might contradict what other people say here, I demonstrate: > > Assume you have two conditions ("A" and "B") with two biological replicates > each. For this example, we further assume that the library sizes are, due to > some magic, the same in all four samples, so that we don't have to worry > about normalization. > > Let's say we have, for a given gene, these counts: > > > y <- c( A1=180, A2=200, B1=240, B2=260 ) > > group <- factor( c( "A", "A", "B", "B" ) ) > > Let's see the within group mean and standard deviation: > > > tapply( y, group, mean ) > A B > 190 250 > > tapply( y, group, sd ) > A B > 14.14214 14.14214 > > From condition "A" to "B", the counts raise from an average of 190 to an > average of 250, with a within-group SD of below 15. This should be > significant, and it is: > > > fit0 <- glm( y ~ 1, family=poisson() ) > > fit1 <- glm( y ~ group, family=poisson() ) > > > anova( fit0, fit1, test="Chisq" ) > Analysis of Deviance Table > > Model 1: y ~ 1 > Model 2: y ~ group > Resid. Df Resid. Dev Df Deviance P(>|Chi|) > 1 3 18.2681 > 2 2 1.8533 1 16.415 5.089e-05 *** > --- > Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > > > Now, for comparison, look at this data: > > > y2 <- c( A1=100, A2=280, B1=150, B2=350 ) > > group <- factor( c( "A", "A", "B", "B" ) ) > > The means are the same, but the SD is much larger: > > > tapply( y2, group, mean ) > A B > 190 250 > > tapply( y2, group, sd ) > A B > 127.2792 141.4214 > > An increase by 60 (from 190 to 250) should not be significant, if the > standard deviation within group is more than twice as much. > > But as a Poisson regression does not care about this, we get a very low p > value: > > > fit0 <- glm( y2 ~ 1, family=poisson() ) > > fit1 <- glm( y2 ~ group, family=poisson() ) > > anova( fit0, fit1, test="Chisq" ) > Analysis of Deviance Table > > Model 1: y2 ~ 1 > Model 2: y2 ~ group > Resid. Df Resid. Dev Df Deviance P(>|Chi|) > 1 3 187.48 > 2 2 171.06 1 16.415 5.089e-05 *** > --- > Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > > > The proper thing to do is to use a family that allows for overdispersion, > typically the negative binomial family, not the Poisson family. For a > detailed discussion of this issue, see our preprint: > http://precedings.nature.com/documents/4282/ > > For a solution of your problem that works for simple comparisons (one > condition against another one), see the our "DESeq" package (or Robinson et > al.'s "edgeR" package). > > If you need to do something more sophisticated, like a two-way anova: we > are working on it. > > Simon > > > +--- > | Dr. Simon Anders, Dipl.-Phys. > | European Molecular Biology Laboratory (EMBL), Heidelberg > | office phone +49-6221-387-8632 > | preferred (permanent) e-mail: sanders@fs.tum.de > [[alternative HTML version deleted]]

ADD REPLY • link 14.6 years ago Johnny H ▴ 80

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 21 months ago

United States

This code is from the paper Bullard et al. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics (2010) vol. 11 (1) pp. 94 which extensively discusses why to use a GLM and why to include an offset. Most of the questions you have below are essentially basic statistics questions and are not - in my opinion - very suitable for a discussing on an email list. I suggest you find a local statistician or someone who is familiar with GLMs in order to explain this (they don't need to know anything about RNA-Seq). But inline below there are a few quick answers. On Wed, Sep 1, 2010 at 8:25 AM, Johnny H <ukfriend22 at="" googlemail.com=""> wrote: > Hi. > I have found some R/Bioconductor/Genominator code on the web (below) and it > measures differential expression of RNA-seq short read data using a general > linear model. > > Can someone explain some basic questions I have? > > 1) What is the reason for using 2 glm's for measuring differential > expression? You (always) compare two models in statistical test (whether or not it is explicit). One model says there is a group specific effect and the other model says there is no difference between the two models. > 2) In the function(y) there are two linear models ran; one with argument y ~ > groups and the other with argument y ~ 1. Why do this? See 1. > 3) What does the offset do? It controls for the sequencing effort, more or less like dividing with the total number of mapped reads when computing RPKM. But if you just straight up divide, you don't have integer values anymore. > 4) Why use ANOVA; is to compare the linear models? In this case it computes a likelihood ratio test statistic between the two models and computes asymptotic p-values based on a Chisq approximation. > 5) What can we say about results, if adjusted for multiple testing; how > would you explain a significant result? Usually people want to control the FDR. You can do that using for example rawp2adj2 from the multtest package. You need the full vector of p-values. Note that it has been shown that this poisson test is terrible for biological replicates (or at least terrible in the sense of giving bad p-values). > 6) Would an adjusted p-value of <= 0.05 be significant? See 5), but that would generally be the aim of using adjusted p-values. However, adjustment procedures assume you have valid raw p-values before adjusting. > > Basically, I don't know much about the statistics done below and any advice > or pointers to good literature for this would be a great help. Thank you. > >> head(geneCountsUI) > ? ? ? ?mut_1_f mut_2_f wt_1_f wt_2_f > YAL069W ? ? ? 0 ? ? ? 0 ? ? ?0 ? ? ?0 > YBL049W ? ? ?19 ? ? ?18 ? ? 10 ? ? ?4 > > # Normalisation of RNA-seq lanes > notZero <- which(rowSums(geneCountsUI) != 0) > upper.quartiles <- apply(geneCountsUI[notZero, ], 2, function(x) quantile(x, > 0.75)) > uq.scaled <- upper.quartiles/sum(upper.quartiles) * sum(laneCounts) > > # Calculating differential expression > groups <- factor(rep(c("mut", "wt"), times = c(2, 2))) > > pvalues <- apply(geneCountsUI[notZero, ], 1, > ?function(y) { > ? ?fit <- glm(y ~ groups, family = poisson(), offset = log(uq.scaled)) > ? ?fit0 <- glm(y ~ 1, family = poisson(), offset = log(uq.scaled)) > ? ?anova(fit0, fit, test = "Chisq")[2, 5] > }) > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.6 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Thomas J Hardcastle ▴ 180

@thomas-j-hardcastle-3860

Last seen 7.5 years ago

United Kingdom

Hi Johnny, In addition to the previous answers, you might also like to take a look at the BMC Bioinformatics paper 'baySeq: Empirical Bayes methods for identifying differential expression in sequence count data' which compares the effectiveness of a number of methods of detecting differential expression in sequence data as well as introducing a novel method for such analyses. Tom ** <http: www.biomedcentral.com="" 1471-2105="" 11="" 422=""> -- Dr. Thomas J. Hardcastle Department of Plant Sciences University of Cambridge Downing Street Cambridge, CB2 3EA United Kingdom

ADD COMMENT • link 14.6 years ago Thomas J Hardcastle ▴ 180

Login before adding your answer.