All genes display positive fold change in topTable results after limma
2
0
Entering edit mode
Pietro ▴ 40
@pietro-13029
Last seen 7 months ago
Italy

I am analyzing an Agilent-012391 Whole Human Genome Oligo Microarray G4112A dataset from RAW files downloaded from GEO. I am following the limma tutorial section 17.4.

When I run the tutorial with example dataset everything is fine, but when I run with data from GSE68744, all the genes result with a positive fold change, while the command summary(decideTests(fit)) says that I have ~800 up and ~1500 down DEG.

Here's my code

> x <- read.maimages(files = targets, source = 'agilent', green.only=TRUE, path = 'data/GSE68744/GSE68744_RAW/', other.columns="gIsWellAboveBG")
> y <- backgroundCorrect(x, method="normexp")
> y <- normalizeBetweenArrays(y, method="quantile")
> Control <- y$genes$ControlType== 1L 
> Control2 <- y$genes$ControlType== -1L 
> NoSymbol <- is.na(y$genes$GeneName) 
> MissingSymbol <- y$genes$GeneName == ''
> IsExpr <- rowSums(y$other$gIsWellAboveBG > 0) >= 74
> yfilt <- y[!Control & !Control2 & !NoSymbol & IsExpr & !MissingSymbol, ]
> yfilt$genes <- yfilt$genes[,c("ProbeName", "GeneName",  'Description')]
> Treatment <- targets[,"tissue type:ch1"]
> levels <- c("epithelium","stroma")
> Treatment <- factor(Treatment, levels=levels)
> colnames(yfilt$E) <- targets$geo_accession
> design <- model.matrix(~Treatment)
> fit <- lmFit(yfilt, design)
> fit <- eBayes(fit, trend=TRUE,r obust=TRUE)
> summary(decideTests(fit))
> topTable(fit, coef=2, n=Inf, adjust.method="BH")

Thanks

limma microarray foldchange DEG GEO • 1.7k views
ADD COMMENT
1
Entering edit mode

Can you copy/paste the actual output of summary(decideTests(fit)) as well as the output from the following command, please?

summary(topTable(fit, coef=2, n=Inf, adjust.method="BH")$logFC)
ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

I think that the GSE68744 authors have made things rather difficult for you with their nonstandard design and data entry. I am also not a fan of Universal Reference -- direct comparison between epithelium and stroma on the same arrays would have been far more precise and economical. Anywhere, here is how I would read and analyze this data.

First, create a two-color targets frame as in Section 4.3 of the limma User's Guide. You need to create columns Cy3 and Cy5 recording the samples labelled green and red for each microarray:

> head(targets)
         geo     tissue tumor dye     title
1 GSM1680263 epithelium    12 Cy5 012E_rep1
2 GSM1680264 epithelium    12 Cy3 012E_rep2
3 GSM1680265     stroma    12 Cy5 012S_rep1
4 GSM1680266     stroma    12 Cy3 012S_rep2
5 GSM1680267 epithelium    22 Cy3 022E_rep1
6 GSM1680268 epithelium    22 Cy5 022E_rep2
> targets$Cy3 <- "reference"
> targets$Cy5 <- "reference"
> i <- targets$dye=="Cy3"
> targets$Cy3[i] <- targets$tissue[i]
> i <- targets$dye=="Cy5"
> targets$Cy5[i] <- targets$tissue[i]
> head(targets)
         geo     tissue tumor dye     title        Cy3        Cy5
1 GSM1680263 epithelium    12 Cy5 012E_rep1  reference epithelium
2 GSM1680264 epithelium    12 Cy3 012E_rep2 epithelium  reference
3 GSM1680265     stroma    12 Cy5 012S_rep1  reference     stroma
4 GSM1680266     stroma    12 Cy3 012S_rep2     stroma  reference
5 GSM1680267 epithelium    22 Cy3 022E_rep1 epithelium  reference
6 GSM1680268 epithelium    22 Cy5 022E_rep2  reference epithelium

Create the design matrix:

> design <- modelMatrix(targets,ref="stroma")
> design <- cbind(Dye=1,design)

Read and normalize the data:

> f <- dir(pattern="GSM*")
> f[1:4]
[1] "GSM1680263_US22502628_251239119555_S01_A01.txt.gz"
[2] "GSM1680264_US22502628_251239119556_S01_A01.txt.gz"
[3] "GSM1680265_US22502628_251239112288_S01_A01.txt.gz"
[4] "GSM1680266_US22502628_251239112309_S01_A01.txt.gz"
> RG <- read.maimages(f,source="agilent")
> RGb <- RG[RG$genes$ControlType==0,]
> RGb$genes$ControlType <- NULL
> RGb <- backgroundCorrect(RGb,method="normexp")
> MA <- normalizeWithinArrays(RGb,method="loess")

Now find genes DE between epithelium and stroma:

> fit <- lmFit(MA,design)
> fit <- eBayes(fit)
> options(digits=3)
> topTable(fit,coef="epithelium",n=30)[,c(5,8:12)]
           GeneName logFC AveExpr     t  P.Value adj.P.Val
14880         LAMA4 -2.49    9.80 -18.0 4.39e-39  1.83e-34
16953          NNMT -2.59   11.20 -17.6 3.90e-38  5.88e-34
26775   HOM-TES-103 -1.29   10.18 -17.6 4.23e-38  5.88e-34
36196         TGFBI -2.25   11.92 -17.3 3.00e-37  3.13e-33
28357           FYN -1.21    9.24 -17.2 4.99e-37  4.16e-33
38249      SERPINF1 -1.91   10.54 -17.0 1.21e-36  7.79e-33
22500         ITM2C -2.36   11.80 -17.0 1.31e-36  7.79e-33
29003         AEBP1 -2.41    9.14 -16.9 1.88e-36  9.80e-33
19700          TPM2 -2.07   10.61 -16.8 3.53e-36  1.63e-32
11159       MGC4083 -2.34   11.93 -16.6 1.60e-35  6.65e-32
37562         COTL1 -1.85   11.97 -16.0 5.04e-34  1.91e-30
30614         CHES1 -1.95   10.71 -15.6 6.17e-33  2.14e-29
27502          TPM2 -2.00    9.63 -15.4 1.93e-32  6.18e-29
17826      BC004295 -2.56   11.93 -15.3 3.23e-32  9.62e-29
22819    THC1496865 -2.76    7.24 -15.2 3.98e-32  1.09e-28
20585         COTL1 -1.71   12.18 -15.2 4.18e-32  1.09e-28
21173         GNG11 -2.36   11.08 -15.2 5.48e-32  1.34e-28
13897        ANTXR2 -1.62    9.86 -15.0 1.85e-31  4.28e-28
37720      BC032783 -2.60   11.27 -14.9 2.37e-31  5.20e-28
24240         IFI16 -2.27    9.42 -14.9 2.59e-31  5.40e-28
30020         T1A-2 -2.52    9.45 -14.9 3.06e-31  6.08e-28
21054      AF131762 -2.18    9.16 -14.8 4.70e-31  8.91e-28
42630      AK021980 -2.83    7.96 -14.6 2.41e-30  4.37e-27
5412           AIF1 -2.26    9.60 -14.5 2.96e-30  5.14e-27
4287      I_1152228 -2.05   10.20 -14.5 3.39e-30  5.66e-27
29726          HEPH -2.29    8.71 -14.5 4.39e-30  7.03e-27
25520 DKFZP566K1924 -2.16    8.88 -14.4 5.23e-30  8.07e-27
981        AU121101 -1.71    9.83 -14.3 1.23e-29  1.83e-26
27696        M36501 -2.27   12.34 -14.3 1.31e-29  1.88e-26
27158         INHBA -1.97    8.61 -14.2 1.59e-29  2.20e-26

There are 6000 up and 4582 down DE genes at FDR < 0.05:

> summary(decideTests(fit))
         Dye epithelium reference
Down    3390       4582     21351
NotSig 30846      31093      3797
Up      7439       6000     16527

Note that the top 30 DE genes are all down-regulated in epithelium relative to stroma but, overall, there are more genes up in epithelium. It would seem that the genes with largest fold changes are down-regulated. You can see the assymetry between up and down genes using suitable plots:

> volcanoplot(fit,coef="epithelium")
> plotMD(fit,coef="epithelium")
ADD COMMENT
0
Entering edit mode

Thanks, super useful

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

Given that you have 1500 significantly down-regulated genes, it is obviously false to claim that all the log fold changes are positive. Why are you not happy with 800 up and 1500 down DEG?

A more serious problem is that GSE68744 is a two-colour dataset with a dye-swap design, but your code is analysing it as if it was single-channel. Your analysis also ignores the paired nature of the samples as well as the technical replicates. There are actually only 36 tumour samples in this dataset even though there are 148 microarrays. Your analysis is treating it as 148 independent samples.

I have not looked at this dataset closely but the dye-swap design implies that the Cy3 (green) channel will be reference in about half the samples. Since you are only reading the geen channel, this would seem to imply that half of the data in your analysis is reference and the corresponding tumour samples have been discarded.

ADD COMMENT
0
Entering edit mode

Thanks for highlighting the errors. In this dataset there are 36 tumor epithelia and 36 matching tumor stroma. For each sample there are 2 "replicates", "rep1" holds Cy5 while "rep2" holds Cy3. Here an extract of the amended "targets" file.

enter image description here

I guess now that green.only argument has to be FALSE. Is this the only thing to change when importing the files?

> x <- read.maimages(files = targets, source = 'agilent', green.only=FALSE, path = 'data/GSE68744/GSE68744_RAW/', other.columns="gIsWellAboveBG")

Then I think the design of course would need to be different.

ADD REPLY

Login before adding your answer.

Traffic: 457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6