Hi Ming,
These extra genes are the genes that you missed if you treat A and B
the
same.
If you want to see whether they are also different between A and B,
you
should test them formally using the contrast A-B.
For your purpose, you can do the two testings separately and then look
for
common genes.
Regards,
Yunshun
------------------------------
Message: 4
Date: Wed, 12 Mar 2014 13:56:04 +0000
From: Ming Yi <yi02@hotmail.com>
To: "yuchen@wehi.EDU.AU" <yuchen@wehi.edu.au>,
"georg.otto@imm.ox.ac.uk" <georg.otto@imm.ox.ac.uk>
Cc: Bioconductor mailing list <bioconductor@r-project.org>
Subject: Re: [BioC] edgeR design matrix, one group vs average of other
groups
Message-ID: <blu177-w35d8e2e73bcca1ff76b9f7dd760@phx.gbl>
Content-Type: text/plain
Hi, Yunshun and Georg:
I have dataset with a very similar situation. I have a lung cancer
dataset
with tumors vs normals and amongst them, these tumors have a few
subtypes as
we believe and hypothesized, I am looking for the difference between
the
subtypes of tumors, e.g., tumor type 1 vs tumor type 2, as well as
tumor
type 1 vs normal type 1, or tumor type 2 vs normal type 2, etc, but
also at
the same time also interested in overall tumors vs normals contrasts.
since
I want to assess the whole thing in the same roof. in the
makeContrasts
function, I did set up tumor_vs_normal=0.5*(tumor type 1+tumor type
2)-0.5*(normal type 1+normal type 2) like Georg in the first model.
whileas
I can do similar setting as Georg did in second model, ignore the
subtypes
of tumors, just simply do tumor-normal in setting, which is generally
done
in the field since many generally not know much about the subtypes or
they
want to study the tumor as a whole. As you pointed out, the second
model
ignored the subtypes and!
did not account for the difference between A and B (in my case,
difference
between type 1 and 2). In other words, the variance between the two
subtypes
of tumors would be considered as "common" variance in general amongst
the
overall tumors. And you also mentioned: That makes the LR statistics
smaller than they should, which results in fewer DE genes in the
second
case. And so in the first model, the DEGs derived from the
tumor_vs_normal=0.5*(tumor type 1+tumor type 2)-0.5*(normal type
1+normal
type 2) would have more DEGs than that derived from simple overall
tumors vs
normals=tumor-normal contrast, simply because the first model would
have
added genes that was given up by the 2nd model but picked up by the
first
one because now some of the internal variance of tumors would be
considered
as difference of the two subtyes to my understanding. Now my question
is:
what these "extra" genes represented for? are these genes represented
the
DEGs between tumor vs normal overall !
as well as also represented the difference between the subtypes?
In fact, the reason I am asking about and discussed here is: I am
looking
for genes that separating well the subtypes of tumors but also
separating
the tumor and normal overall in general. My appraoches is finding the
DEGs
from the subtypes of tumors e.g., tumor type 1 vs tumor type 2, and
the
general DEGs from tumor_vs_normal=0.5*(tumor type 1+tumor type
2)-0.5*(normal type 1+normal type 2) and combing the common genes from
these
two sets of DEGs. Turns out the common genes seem not able to separate
well
the two tumor subtypes as well as overall tumor vs normal at the same
time.
One reason is the difference between the overall tumor vs normal is so
strong that weaken the difference between the two subtypes. Also the
DEGs
from tumor_vs_normal I used are derived from
tumor_vs_normal=0.5*(tumor type
1+tumor type 2)-0.5*(normal type 1+normal type 2) setting rather than
the
tumor_vs_normal contrast from simply tumor-normal in GLM model. so
for
purpose of mine: looking for gene!
s that separating well the subtypes of tumors but also separating the
tumor
and normal overall in general, what would be the best way I can do in
using
these DEGs? Sorry, it is a bit off the original topic, but I thought
because
it seems so relevant to how to set up the model for deriving DEGs for
a
variary of purposes, probably worthy to be posted here for a
discussion.
Thanks for your insightful opinion in advance!
best
Ming
> From: yuchen@wehi.EDU.AU
> To: georg.otto@imm.ox.ac.uk
> Date: Wed, 12 Mar 2014 15:37:36 +1100
> CC: bioconductor@r-project.org
> Subject: Re: [BioC] edgeR design matrix, one group vs average of
other
groups
>
> Dear Georg,
>
>
>
> I think the first model is more appropriate.
>
>
>
> In your second model, the deviance under the alternative is larger
than it
> should (since the difference between A and B is not accounted for).
>
> That makes the LR statistics smaller than they should, which results
in
> fewer DE genes.
>
>
>
> By the way, the dispersions have to be re-estimated if a different
design
> matrix is used.
>
> I'm not sure whether you've done it or not as I cannot tell from the
given
> code.
>
>
>
> Regards,
>
> Yunshun Chen
>
>
>
>
>
> _____
>
> Dear Bioconductors,
>
> I am working on RNA-seq data with multiple experimental factors and
I am
> trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach.
>
>
> > design <- model.matrix(~0+group, data=y$samples)
> > colnames(design) <- levels(y$samples$group)
> > design
> A B C
> sample.1 1 0 0
> sample.2 1 0 0
> sample.3 0 1 0
> sample.4 0 1 0
> sample.5 0 0 1
>
> > fit <- glmFit(y, design)
>
>
> I want to know which genes are differentially expressed in C
compared to
> the other groups, so I chose to compare C to the average of A and B
>
> > lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1))
>
>
> Alternatively I could put A and B in a single group
>
> > design
> A.B C
> sample.1 1 0
> sample.2 1 0
> sample.3 1 0
> sample.4 1 0
> sample.5 0 1
>
> > fit <- glmFit(y, design)
>
> an compare C to A.B
>
> > lrt <- glmLRT(fit, contrast=c(-1,1))
>
>
> When I try this with my own data, the first approach gives me many
more
> differentially expressed genes than the second one, but the second
gene
> set is a subset of the first one. I would be very grateful if
somebody
> could explain to me what is the difference between the approaches,
and
> which one is the more appropriate for my purpose (find genes
specific
> for condition C)
>
> Best wishes,
>
> Georg
>
> > sessionInfo()
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] limma_3.18.13
>
> loaded via a namespace (and not attached):
> [1] compiler_3.0.1 tools_3.0.1
>
>
>
>
>
>
>
______________________________________________________________________
> The information in this email is confidential and
inte...{{dropped:12}}