Question

F test for two phenotypic and multiple phenotypic categories.

0

Entering edit mode

StephK ▴ 70

@stephk-5152

Last seen 8.0 years ago

Dear all,

I have some gene expression data sets. In some of the data sets, I am comparing two age categories, and in some I am comparing multiple age categories. For each dataset, I want to do a two-tailed F test between the age categories to check if the slope of the curve is different from zero, which would indicate an association between expression signal and age.

I've been trying to understand the method here: https://www.biostars.org/p/222041/. Dr. Devon Ryan is so kindly trying to help me. I'm really struggling and I just don't understand specifically where in the code I am meant to change; I didn't want to keep bothering Dr. Ryan, so I emailed Prof. Smyth who said to ask here. The problem is I can get the code to run, but I want to make sure I am using the software correctly.

If someone had the time to glance at the code below and edit the line that I should change, so that the code reads: I input the pData (different age categories; sometimes two, sometimes more than 2 age categories) and gene expression data, and the output is a two-tailed F test with adjusted F test P values that tells me if the slope of the curve differs from 0, indicating a relationship between expression and age.I'm also not sure if I can use the same script for two age categories and more than two age categories.

The code:

The gene expression data (Table.tabbed) looks same for both 2-age, and more-than-2-age cases:

Gene    Sample1    Sample2    Sample3    Sample4    Sample5    Sample6….
Gene1 5.2 5.05 5.01 4.96 5.31 4.834
Gene2 4.85 4.59 4.91 5.06 4.75 4.86
Gene3 7.68 7.30 7.58 7.64 7.44 7.69

and the pData file looks like this (Two-Age case):

Sample Age
Sample1    Age2.5
Sample2    Age2.5
Sample3    Age2.5
Sample4    Age20
Sample5    Age20
Sample6    Age20

pData file: (More than 2 age category case):

Sample    Age
Sample1    3.0
Sample2    3.0
Sample3    10.0
Sample4    10.0
Sample5    50.0
Sample6    50.0

###########

This is the code that I use for both the two-age, and more-than-two-age category data sets:

library(limma)
library(Biobase)

#Make an expression data set

expr <-as.matrix(read.table("Table",header=TRUE,sep="\t",row.names=1,as.is=TRUE))minimalSet <-ExpressionSet(assayData=expr)
pData <-read.table("pData",row.names=1,header=TRUE,sep="\t")
metadata <-data.frame(labelDescription=c("Age"),row.names=c("Age"))
phenoData <-new("AnnotatedDataFrame",data=pData,varMetadata=metadata)
exampleSet <-ExpressionSet(assayData=expr,phenoData=phenoData,annotation="Mouse430_2")

#Do limma

f <-factor(as.character(exampleSet$Age))
design <-model.matrix(~f) #I think this line is the first problem
fit = eBayes(lmFit(exampleSet,design))
topTable(fit) #And I think this line is the second problem

############

The output (2-age case):

Removing intercept from test coefficients
logFC AveExpr          t      P.Value    adj.P.Val
Gene1   -19.8 13.2 -19.6 9.3e-09 3.3e-05
Gene2    18.6 14.7  13.8 2.0e-07 3.6e-04
Gene3   13.0 12.4  11.0 1.4e-06 1.6e-03

The output (more than 2 age case):

Removing intercept from test coefficients
          fAge12.0  fAge23.0   fAge6.0   fAge9.0   AveExpr  F PVal AdjPVal
Gene1     5.3  10.08 2.4   3.2  2.0 5.7 3.5e-17 1.3e-13
Gene2     4.2  1.4   3.03  2.4  2.3 4.3 1.e-16  1.3e-12
Gene3     1.8  3.6   7.9   9.9  8.2 3.6 1.e-12  1.5e-11

If someone could tell me specifically what I should write/how I should edit the code so that I input the pData (different age categories; sometimes two, sometimes more than 2 age categories) and gene expression data, and the output is a two-tailed F test with adjusted F test P values that tells me if the slope of the curve differs from 0, indicating a relationship between expression and age. I have to stress, I can get the code to run, and I can get it to run if I modify certain things, it's just wanting to make sure that the code i'm running is actually doing what I want it to do. I appreciate it, thank you.

limma f test differential gene expression • 964 views

ADD COMMENT • link updated 8.0 years ago by Ryan C. Thompson ★ 7.9k • written 8.0 years ago by StephK ▴ 70

score 0 · Answer 1 · 2016-11-19

You say you want to test the slope, but you're treating the age as a categorical factor rather than a covariate. To make the code consistent with your text, you would do something like:

f <- exampleSet$Age # this should be numeric.
design <- model.matrix(~f)

... which yields a design matrix with an intercept and slope. The latter represents the log-fold change in expression with respect to a single unit of time (weeks, years, etc.). Running topTable will then perform a t-test against the null hypothesis that the slope is equal to zero, i.e., there is no effect of time on expression.

That being said, treating the age as a factor is not unreasonable if you have replicates at each age. If you run your original code:

f <-factor(exampleSet$Age)
design <-model.matrix(~f)

... this will yield a design matrix with an intercept (representing the average log-expression of the samples with the lowest age) and various coefficients representing the log-fold change of the other ages over the lowest. Running topTable will then perform a F-test against the null hypothesis that all log-fold changes are zero, i.e., there is no time effect on expression.

The difference between these two approaches is that the first (with the slope) assumes a linear effect of age, whereas the second does not. On the flip side, the second approach needs to use up more information to estimate the many coefficients, whereas the first only needs to estimate the intercept and slope, which can provide more power if the effect is truly linear. I generally use splines as this provides a compromise between these two extremes, but if you have enough replicates and/or you're interested in behaviour at specific ages, then you could go with the factor approach. In short, your code looks fine (assuming you've filtered and normalized elsewhere).

Of course, the choice between these two methods only makes a difference if you have more than two different ages. In your 2-age case, the above approaches should be identical in the p-values that are reported. The only difference should be in the size of the log-fold changes, and that should just be due to scaling. For example, if you have a log-fold change of 10 between age=1 and age=6 when you treat the age as a factor, this would result in a log-fold change of 10/(6-1)=2 when you treat the age as a covariate.