Dear Mark,
In principle, you can work with any valid design matrix, and there's
nothing wrong in principle with the one you have defined. However
edgeR,
as we've currently programmed it, is only able to do the likelihood
ratio
test on 5 df that I suggested if you define the design matrix as in my
suggested code:
model.matrix(~lifecycles)
in which each coefficient is compared back to the first lifecycle. In
this parametrization, the first coefficient is the level for 1G, the
others are 2G-1G, 3G-1G and so on.
Best wishes
Gordon
On Tue, 21 Jun 2011, Lawson, Mark (mjl3p) wrote:
> Thanks for you help on this Gordon.
>
> I am still playing around with the analysis but just to make sure I
am on the right track, I want to ensure that I created my design
correctly.
>
> lifecycles <- c("1G", "1G", "2G", "3G", "4G", "61G", "62G")
> lifecycles <- factor(lifecycles, levels=c("1G", "2G", "3G", "4G",
"61G", "62G"))
> design <- model.matrix(~0+lifecycles)
> colnames(design) <- levels(lifecycles)
>
> (Recall, the first lifecycle has two sets of data)
>
> ~ Mark
>
> On Jun 17, 2011, at 7:45 PM, Gordon K Smyth wrote:
>
>> Counts per million should have been:
>>
>> cpm <- 1e6 * t(t(y$counts)/y$samples$lib.size)
>>
>> Gordon
>>
>> On Fri, 17 Jun 2011, Gordon K Smyth wrote:
>>
>>> Dear Mark,
>>>
>>> I'd suggest that you filter out genes that fail to achieve a
reasonable count in any library. I can't give you a firm cutoff for
this, it depends on the sequencing depth, and the analysis should not
be sensitive to this anyway. In my lab, we have tended to keep genes
that achieve at least 1 count per thousand reads in at least a minimum
number of libraries. If y is a DGEList object, you might use
>>>
>>> cpm <- t(t(y$counts)/y$samples$lib.size)
>>> keep <- rowSums(cpm>1)>0
>>> y.filtered <- y[keep,]
>>>
>>> Since you have two replicates of one lifecycle, you could base
your estimate of variability (the biological coefficient of variation)
on these replicates.
>>>
>>> design <- model.matrix(~lifecycle)
>>> y.filtered <- estimateGLMCommonDisp(y.filtered,design)
>>>
>>> Perhaps also
>>>
>>> y.filtered <- estimateGLMTrendedDisp(y.filtered,design)
>>>
>>> Then
>>>
>>> fit <- glmFit(y.filtered,design)
>>>
>>> Then you could test for genes that change at all through the 6
lifecycles by
>>>
>>> lrt <- glmLRT(y.filtered,fit,coef=2:6)
>>> topTags(lrt)
>>>
>>> This will perform a likelihood ratio test for each gene on 5
degrees of freedom. Then you could examine the significant genes to
see what patterns they show across the lifecycles:
>>>
>>> is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1
>>> library(limma)
>>> plotlines(lrt$coef[is.sig,],first.column.origin=TRUE)
>>>
>>> Best wishes
>>> Gordon
>>>
>>> ---------------------------------------------
>>> Professor Gordon K Smyth,
>>> Bioinformatics Division,
>>> Walter and Eliza Hall Institute of Medical Research,
>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>> Tel: (03) 9345 2326, Fax (03) 9347 0852,
>>> smyth at wehi.edu.au
>>>
http://www.wehi.edu.au
>>>
http://www.statsci.org/smyth
>>>
>>> On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote:
>>>
>>>> Hi Gordon
>>>> Thanks for getting back to me on this. I don't mind the transfer,
whatever leads to answers to my questions is fine by me.
>>>> I am sorry that I forgot to classify the lifecycles of the
samples. They represent 6 distinct lifecycles (the first measured
lifecycle has two samples). Will this affect the prior.n you
suggested?
>>>> ~ Mark
>>>> On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote:
>>>>> Hi Mark,
>>>>> This is an analysis question rather than a devel suggestion, so
I've transferred it from Bioc-sig-seq to Bioconductor. Hope you don't
mind.
>>>>> If you have divided your samples into two groups, and you have 7
samples, then I recommend you set prior.n=4. See
>>>>>
https://stat.ethz.ch/pipermail/bioc-sig-
sequencing/2011-June/002042.html
>>>>> for an explanation.
>>>>> I can't answer your other questions without knowing whether you
have biological replicates in your experiment. You have 7 samples.
Do these samples all correspond to distinct lifecycle stages, or are
some stages observed more than once? How many stages are there?
>>>>> Best wishes
>>>>> Gordon
>>>>>> Date: Thu, 9 Jun 2011 18:19:42 +0000
>>>>>> From: "Lawson, Mark (mjl3p)" <mjl3p at="" virginia.edu="">
>>>>>> To: "bioc-sig-sequencing at r-project.org"
>>>>>> <bioc-sig-sequencing at="" r-project.org="">
>>>>>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of
>>>>>> Timeseries Data
>>>>>> (I apologize if this message was sent more than once)
>>>>>> Dear Analysis Gurus
>>>>>> I am currently performing a gene expression analysis on a plant
parasite. I have mapped Illumina read counts for various stages in
this parasites lifecycle. Of interest for us in this analysis are
genes that are differentially expressed during these lifecycles. To
determine this, I have focused on two types of differential
expression: "peaks" and "cliffs." "Peaks" occur when a gene is
differentially expressed in one time sample (either higher or lower
than the remaining samples) and "cliffs" occur when a gene is
differentially expressed between two groups of sample (for instance
higher expression in the first three samples than the last three).
>>>>>> To determine these peaks and cliffs, I have been creating
groups in which the desired peak/cliff is "case" and the remaining
samples are "control." I then run common dispersion and/or tagwise
dispersion and extract those reads with an FDR of less than 0.1. So,
my questions:
>>>>>> 1.) How much filtering of data should I do? Right now I have a
fair amount of genes that are expressed in 0, 1, 2 etc. samples. It
seems logical that I would filter out genes that have no expression,
but at what level should it stop? Also, should there be different
filtering depending on the analysis (peak or cliff)?
>>>>>> 2.) When doing tagwise dispersion, what should I set my prior.n
to (I currently have 7 samples)? Does it depend on the type of
analysis?
>>>>>> 3.) Should I investigate using a more advanced glm based
analysis? Any advice on crafting a design for this?
>>>>>> 4.) Any other ideas on analyses to perform on a set of
timeseries data with EdgeR?
>>>>>> I greatly appreciate any help/advice and thank you in advance!
>>>>>> Mark J. Lawson, Ph.D.
>>>>>> Bioinformatics Research Scientist
>>>>>> Center for Public Health Genomics, UVA
>>>>>> mlawson at virginia.edu
>>>
>>
>>
______________________________________________________________________
>> The information in this email is confidential and intended solely
for the addressee.
>> You must not disclose, forward, print or use it without the
permission of the sender.
>>
______________________________________________________________________
>
> Mark J. Lawson, Ph.D.
> Bioinformatics Research Scientist
> Center for Public Health Genomics, UVA
> mlawson at virginia.edu
>
>
>
>
>
>
>
>
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}