Question

topTable no t-test results

0

Entering edit mode

s.vander.sluis • 0

@svandersluis-22980

Last seen 4.9 years ago

Hi, I conducted a multi-level experiment with gene expression measures in multiple brain regions in 2 groups.

So for instance, my data for region 1 look like this (note that for this post, I renamed the rows: these are gene names in my data): (Note that I recoded missings so that I have at least 3 valid observations per group for every gene; otherwise, all observations for that gene are recoded to NA).

  Control.A Control.A.1 Control.A.2 Control.A.3    Case.A  Case.A.1  Case.A.2  Case.A.3
1        NA          NA          NA          NA       NA       NA       NA       NA
2        NA          NA          NA          NA       NA       NA       NA       NA
3        NA          NA          NA          NA       NA       NA       NA       NA
4        NA          NA          NA          NA       NA       NA       NA       NA
5  24.97941    25.50746    24.64801    25.49045 26.14554 25.77994 25.57589 25.84357
6  23.77846    23.86665    23.96921    23.60105 24.06159 23.44947 24.59412 24.40061

When I look at the topTable results for region 1, I see that there are sometimes values for logFC and AveExpr, but unaccompanied by test results.

For instance, my topTable results for the data shown above look like this:

      logFC  AveExpr        t    P.Value adj.P.Val         B
1        NA 25.06110       NA         NA        NA        NA
2        NA      NaN       NA         NA        NA        NA
3        NA 24.48188       NA         NA        NA        NA
4        NA 22.03522       NA         NA        NA        NA
5 0.6799025 25.47455 2.545508 0.01499063 0.1087153 -3.502281
6 0.3226035 24.04767       NA         NA        NA        NA

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

I do not understand this. First, I do not understand how I can sometimes have AVeExpr values but no logFC values. Second, if I have logFC and/or AveExpr values, why do I not obtain test results?

Any ideas on what went wrong?

kind regards
Sophie

limma • 1.1k views

ADD COMMENT • link updated 4.9 years ago by Gordon Smyth 52k • written 4.9 years ago by s.vander.sluis • 0

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

It seems that your data has a huge number of NAs.

limma needs only one non-NA value to compute AveExpr but it needs at least two groups to have non-NA values before it can compute a logFC. Hence, if there are variable patterns of NA between genes, it is perfectly possible to get AveExpr without the other values.

On the other hand, I do not see how it is possible to get a logFC without a t-statistic. limma will usually give a t-statistic and p-value even for genes with no residual df (it does that by using the empirical Bayes prior). So I really would like to see the code that you used to generate the topTable.

Note that there is no need to recode any values to NA. limma will correctly use whatever non-NA values it finds and it is perfectly capable of using just 1 or 2 values in a group.

I don't know what technology you are using or what special issues there might be that lead to NAs but, in my experience, it is seldom necessary to have any NA values in a expression data set. If the data are from microarrays or RNA-seq, then NA values can almost always be avoided by appropriate processing of the data.

Correction

It is possible to get logFCs without t-statistics if you are using contrasts.fit, which I now see is happening in your code. I will elaborate on this in a separate answer.

ADD COMMENT • link 4.9 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks for the swift response.

I recoded the missings because I wanted the group comparisons to be based on at least three data points per group. I will ask the person who gave me the data whether there are technical reasons for the number of missings (it is an exploratory study so maybe the expression of some genes was just too low in certain regions?).

However, I already checked; the missingness does not cause the strange output (if I leave all observed values in, I still see logFC and AveExpr without t-tests, and AVeExpr values for genes where I only see NA's in my raw data, like for genes 1, 3 and 4 in the above example).

Maybe the array file that I obtained has a different format than expected by limma? My raw data file looks like this:

Given K genes, measured in P regions, in 2 groups with 4 subjects each,

the file has K rows, and P x 2 x 4 columns.

i.e., each column has expression data of one person in a specific region.

I used the file in that format as array for the ExpressionSet. Is another format expected?

The code looks like this:

# "condition" refers to case/control (a factor) and "regionid" refers to a factor distinguishing 6 different brain regions
f <- factor(paste(ExprSet$condition, ExprSet$regionid, sep = "."))
design <- model.matrix(~0+f)
colnames(design) <- levels(f)

#Estimate correlation between measurements made on the same subject
# "subject" is a factor distinguishing the subjects per group
corfit <- duplicateCorrelation(ExprSet, design, block = ExprSet$subject)
corfit$consensus.correlation

#Fit model with inter-subject correlation
fit <- lmFit(ExprSet, design, block = ExprSet$subject, correlation = corfit$consensus)

cm <- makeContrasts(
  CasevsControlCBWM =   Case.A - Control.A,   
  CasevsControlCortex = Case.B - Control.B,
  CasevsControlFRWM =   Case.C - Control.C,
  CasevsControlPons =   Case.D - Control.D,
  CasevsControlSPCGM =  Case.E - Control.E,
  CasevsControlSPCWM =  Case.F - Control.F,
  levels = design)

fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)

results <- decideTests(fit2)  
summary(results)
res.CBWM <- topTable(fit2, coef = "CasevsControlCBWM", adjust = "BH", sort = "none", n = Inf)

ADD REPLY • link 4.9 years ago s.vander.sluis • 0

0

Entering edit mode

I will ask the person who gave me the data whether there are technical reasons for the number of missings

Personally, I don't think there could ever be a good justification for this many NAs for microarray or RNA-seq data. (The situation is different with mass spectrometry data.) If you are a bioinformatician, and the data is microarray or RNA-seq, you could ask them to provide you with the raw data so you can redo the preprocessing yourself.

it is an exploratory study so maybe the expression of some genes was just too low in certain regions?

It is not correct to replace low values with NAs. This will introduce bias and will make all your results somewhat suspect.

I still see logFC and AveExpr without t-tests, and AVeExpr values for genes where I only see NA's in my raw data,

It is not possible to get AveExpr results when the data are all NA. If you want to make this claim, please provide a reproducible example than I can run to confirm the problem. Or alternatively you could email your data and code to me confidentially. It is possible to get logFC without a t-test when one or more of the groups are entirely missing.

like for genes 1, 3 and 4 in the above example.

You have shown only the first 8 expression values for these genes, but clearly your dataset is much larger. limma will compute an AveExpr value if there are any non-NA values in any group. The non-NA values do not need to be in the two groups being compared in the contrast.

I used the file in that format as array for the ExpressionSet. Is another format expected?

An ExpressionSet is perfectly ok.

ADD REPLY • link 4.9 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you, I send you a personal email.

kind regards

Sophie

ADD REPLY • link 4.9 years ago s.vander.sluis • 0

score 1 · Accepted Answer · 2020-02-26

Thank you for sending me your data offline. I am guessing that this is proteomics mass spectrometry data, which would explain the NAs.

The results you are seeing are correct, but can be improved. I see that your experiment has 6 brain regions from case and control individuals, so there are 12 distinct groups in your experiment (6 regions x 2 conditions).

limma will return an AveExpr value if any of the 12 groups contain any non-NA values. I think you are assuming that the non-NA values have be in the two groups that are being compared, but they do not. If you read the ?topTable help page you will see that AveExpr is the average log-expression for that gene across all 48 samples in your experiment. You will get the same AveExpr values no matter which contrast you test.
limma will return a logFC value if the two groups being compared both contain non-NA values.
If you use contrasts.fit, then limma needs all 12 groups to contain at least one non-NA value before it can compute t-statistics for any of the contrasts. You might think that limma should recognize that only some of the groups are involved in each contrast, but this is invisible to limma because of the nature of multiplication in R. Computing 0*NA in R gives NA, so limma needs all the coefficients to be non-NA in order to get an non-NA contrast value.

You can avoid the use of contrasts.fit like this. The results will be same as for your design matrix except that you will get a t-statistic and p-value whenever you have a logFC:

keep <- (rowSums(!is.na(exprs(ExprSet))) >= 4)
ExprSet <- ExprSet[keep,]
design <- model.matrix(~regionid+regionid:condition,data=pData(ExprSet))
corfit <- duplicateCorrelation(ExprSet, design, block = ExprSet$subject)
corfit$consensus.correlation
fit <- lmFit(ExprSet, design, block = ExprSet$subject, correlation = corfit$consensus)
fit <- eBayes(fit)
res.CBWM <- topTable(fit,coef="regionidA:conditionVWM",n=Inf)