Alessandro,
That is a big question, 'how to do DE in NGS' can be a bit involved
for
a single question. So, here is a script! This is my current RNA seq
project, where I have 3 replicates, 2 tissues, and 2 genotypes, a
smaller
factorial design than what you use.
I have included the parts for setting up the DGEList, contrast/design
matrices, and calling deferentially expression with some extra
formatting
that I use to make tables to send to my lab mates. There is a section
called 'annotation' where I use biomaRt to get some gene descriptions,
which may or may not be available to you.
I used the method in 3.3.1 of edgeR vignette to do the grouping (each
treatment combintation as a group). My stats are not good enough for
me to
comment on if this is the best / worst/ equivlent to any other method.
I added some extra comments to make it easier to read.
As your problems become more specific you will be able to find answers
to
your questions in the bioconductor list serve archive, just fyi.
Hope this helps!
######################################################################
# SAMS edgeR SCRIPT
######################################################################
# Sample description, design, contrasts
######################################################################
#make design matrix - make targets matrix and give good row names.
Then
reorder the Genotype to include Columbia as ref
targets <- data.frame(Genotype = factor(rep(c("Columbia", "OPT3"),
each
=6), levels = c("Columbia", "OPT3")), Tissue =factor(rep(rep(c("Root",
"Shoot"), each =3), 2), levels = c("Root", "Shoot"))) #this is a
manual way
to make a 'targets'
rownames(targets) <- substring(rownames(colData(olap)), 1,3)
group <- factor(paste(targets$Genotype, targets$Tissue, sep = "."))
#assign
each contrast combination as a group
design <- model.matrix(~0+group) #make a design matrix, ~0 means no
intercept
colnames(design) <- levels(group) #pretty names
#Make Contrast matrix
Cont <- makeContrasts(
Roots = Columbia.Root - OPT3.Root,
Shoots = Columbia.Shoot - OPT3.Shoot,
DifDif = (Columbia.Root - OPT3.Root) - (Columbia.Shoot - OPT3.Shoot),
levels = design)
# make a DGEList of the read counts and supply the group to make
groups,
calc Normilzation factors
######################################################################
#
edger <- DGEList(assays(olap)$counts, group = group)
#assays(olap)$counts
extracts the count matrix
rownames(edger$samples) <- substring(rownames(edger$samples),1,3)
#make it
pretty (remove .bam extension)
keep <- rowSums(cpm(edger) > 2) >= 4 #a gene must have 2cpm at least
four
times in all the data to keep
edgerfilter <- edger[keep,] #FALSE - 8204, TRUE - 19212
edgerN <- calcNormFactors(edgerfilter) #calculate normalization
factors
#calculate dispersion
edgerN <- estimateCommonDisp(edgerN, verbose =TRUE)
edgerN <- estimateTagwiseDisp(edgerN, trend = "none")
#fit the model
fit <- glmFit(edgerN, design) #fit the model
cpmMat <- cpm(edgerN); colnames(cpmMat) <- substr(colnames(cpmMat),
1,3) #keeping
a normalized count matrix is good for heatmaps later
######################################################################
# Differential Expression
######################################################################
#The next three sections are basically the same thing just for each
contrast. Betters ways may exist (like a loop of some form?) but
this is
verbose and easy-ish to read.
# Root DE
######################################################################
#Conduct the likelihood ratio test (LRT). Call significance at FDR
moderated (benjamani and hochberg) at 0.05.
#make this logical (logical of if each gene is DE). Get the names of
these
genes.
#make a new data frame with a FDR column(same adjust as above). This
is
like the 'toptable' (from limma)
RootsLrt <- glmLRT(fit, contrast=Cont[,"Roots"]) #call DE genes
#decide testsDGE() looks at each gene in the contrast and labels it
zero if
it is not changing, -1/1 for -/+ fold change
#Note that you have to supply arguements to p.adjust (adjust method)
and a
p.value that is your alpha level
rdt <- decideTestsDGE(RootsLrt, adjust.method = "BH", p.value =
0.01);rownames(rdt) <- rownames(RootsLrt$table) #summary(rdt)
isDEr <- as.logical(rdt) #convert the -1,0,1 (down, nc, up) to TRUE
(DE)
FALSE (nc)
rDEnames <- rownames(RootsLrt$table[isDEr,]) #names of DE genes
#there may be a better way, but it works. just adding a FDR column
manually using p.adjust
Rtable <- data.frame(RootsLrt$table[isDEr,], FDR =
p.adjust(RootsLrt$table$PValue, "BH")[isDEr])
#organize the table by FDR
Rtable <- Rtable[order(Rtable$FDR),]
#get the gene names for each gene with logFC > 2 (either direction)
and an
FDR > 0.05, this is for heatmaps later
rDEnames2 <- rownames(Rtable[Rtable$FDR < 0.05& abs(Rtable$logFC) >
2,])
######################################################################
# Annotation
######################################################################
#Get annotation dbs for next sections
library(biomaRt)
athGene <- useMart("plants_mart_18",dataset="athaliana_eg_gene")
# Roots annotation
######################################################
RLrt <- data.frame(tair_locus = rownames(RootsLrt$table)[isDEr],
RootsLrt$table[isDEr,])
RALrt <- merge( RLrt, getBM(attributes = c("tair_locus",
"tair_symbol",
"transmembrane_domain","description"), filters = "tair_locus", values
=
rownames(RLrt), mart = athGene),by = "tair_locus")
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats graphics grDevices utils
datasets
[8] methods base
other attached packages:
[1] biomaRt_2.16.0 RColorBrewer_1.0-5 gplots_2.11.0.1
[4] MASS_7.3-26 KernSmooth_2.23-10 caTools_1.14
[7] gdata_2.12.0.2 gtools_2.7.1 GenomicRanges_1.12.4
[10] IRanges_1.18.1 BiocGenerics_0.6.0 edgeR_3.2.3
[13] limma_3.16.5
loaded via a namespace (and not attached):
[1] Biostrings_2.28.0 bitops_1.0-5 RCurl_1.95-4.1
Rsamtools_1.12.3
[5] stats4_3.0.1 tcltk_3.0.1 tools_3.0.1 XML_3.96-1.1
[9] zlibbioc_1.6.0
On Thu, Jun 13, 2013 at 7:32 AM, Alessandro Botton [guest] <
guest@bioconductor.org> wrote:
>
> Hi all.
> This is my first post here. I wasn't sure of posting my question to
this
> mailing list, as my competence in statistics is very poor. So,
please,
> forgive me in advance for what I could say...
> A few words to describe my experimental design. I have RNAseq data
from 54
> samples. My experiment deals with apple fruits collected at two
different
> times (H=at harvest, and PH=after postharvest storage) from trees of
three
> different varieties (G, F, and P) grown under three different
agronomic
> conditions (L, M, and H). I have done 3 biological replicates. So 2
times x
> 3 varieties x 3 conditions x 3 replicates = 54.
> Based upon the suggestions of some colleagues of mine, I have
decided to
> start using EdgeR for the analysis of these data, as they told me
(without
> knowing how!!!) that this package implements multifactor-multilevel
> pipelines.
> My first objective is to achieve a list of differentially expressed
genes
> that are affected:
> 1) only by the genotype (variety)
> 2) only by the agronomic condition
> 3) only by storage
> or by an interaction of
> 4) genotype x agr. condition
> 5) genotype x storage
> 6) agr. condition x storage.
> I could not be interested by the triple interaction.
> I have read the EdgeR vignette and case studies. I have searched the
> internet but found just discussions in which the "statistical level"
was
> too high for me.
> Would you please give me some code examples to get these lists?
> The first problem may be the setting up of the design matrix... May
I use
> the same scheme of the paragraph 3.5 of the EdgeR vignette
(Comparisons
> Both Between and Within Subjects)?
> Thanks in advance to whom will reply and sorry for this "low level"
> question.
> Alessandro
>
>
>
> -- output of sessionInfo():
>
> R version 2.15.3 (2013-03-01)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8
>
> attached base packages:
> [1] grid stats graphics grDevices utils datasets
methods
> base
>
> other attached packages:
> [1] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-8
caTools_1.14
> gdata_2.12.0 gtools_2.7.1 RColorBrewer_1.0-5
edgeR_3.0.8
> limma_3.14.4 DESeq_1.10.1 lattice_0.20-13
locfit_1.5-9
> [13] Biobase_2.18.0 BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] annotate_1.36.0 AnnotationDbi_1.20.7 bitops_1.0-4.2
> DBI_0.2-5 genefilter_1.40.0 geneplotter_1.36.0
> IRanges_1.16.6 parallel_2.15.3 RSQLite_0.11.2
> splines_2.15.3 stats4_2.15.3
> [12] survival_2.37-2 tools_2.15.3 XML_3.96-1.1
> xtable_1.7-1
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Sam McInturf
[[alternative HTML version deleted]]