Entering edit mode
Federico Gaiti
▴
130
@federico-gaiti-6419
Last seen 10.4 years ago
Hi Mike,
I did the DGE with the mmultifactorial design combining stranded and
unstranded data.
Here it is:
Multifactorial (considering both stranded and unstranded data)
> head(CountTable)
ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1
COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3
asmbl_1 30 0 24 48 84 5 1 1 47 15
8 6 47 28 27 47
asmbl_10 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
asmbl_100 0 0 0 0 0 4 0 0 0 1
2 3 2 5 5 7
asmbl_1000 0 8 0 7 5 5 19 7 14 4
4 7 9 4 12 1
asmbl_10000 11 75 0 73 103 12 112 51 65 43
17 57 56 18 23 63
asmbl_10001 0 1 0 0 3 11 6 4 4 4
11 1 5 0 3 0
dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design
=~libType + condition)
colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM
P","COMP","JUV","ADULT"))
design(dds)<-formula(~libType + condition )
dds<-DESeq(dds)
res<-results(dds)
resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT"))
resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV"))
resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP"))
> resultsNames(dds)
[1] "Intercept" "libTypestranded"
[3] "libTypeunstranded" "conditionPRECOMP"
[5] "conditionCOMP" "conditionJUV"
[7] "conditionADULT"
Just to conlude, have I done anything wrong? Would you suggest any
further/different analysis?
Thanks for all the help
Federico
________________________________
From: Michael Love [michaelisaiahlove@gmail.com]
Sent: Friday, 28 February 2014 12:11 PM
To: Federico Gaiti
Cc: Steve Lianoglou; bioconductor@r-project.org
Subject: Re: [BioC] Low number of replicates DESeq
Then I would recommend the multifactorial design, as it's the best you
can do without stranded replicates. it will be underpowered for
transcripts which are mostly made up of those positions which Simon
described: "position which is covered by a regular gene on one strand
and by an overlapping antisense transcript on the other strand".
Because for such transcripts, only the single replicate for the
stranded experiments will contribute signal (if I am getting it
correct this time).
On Thu, Feb 27, 2014 at 7:36 PM, Federico Gaiti
<f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote:
No worries at all.
It's all a good exercise for me. I'm learning a lot just from this
email exchange.
Back to the DGE and considering the situation, what would you
raccomend for the DGE on DESeq2?
Thanks again
Federico
________________________________
From: Michael Love
[michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>]
Sent: Friday, 28 February 2014 10:27 AM
To: Federico Gaiti
Cc: Steve Lianoglou;
bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Subject: RE: [BioC] Low number of replicates DESeq
Hi Federico,
Yes, Simon is right. Please ignore my previous email. Sorry for adding
to the confusion.
Mike
On Feb 27, 2014 5:51 PM, "Federico Gaiti"
<f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote:
Hi Mike,
Thanks for reply. I see what you mean but I'm a bit confused about ht-
seq count now.
Please see also an open thread with Simon Anders where I'm discussing
this in details: http://seqanswers.com/forums/showthread.php?p=133959&
posted=1#post133959
Sorry for the crosspost I know, I just didn't know it's not a good
idea. It's actually one of the first question I post online so I'm
still learning how all this works.
I am investgating lncRNAs, which can be intronic, intergenic, can
overlap on the same strand of another gene or have anti-sense
orientation. That's what I meant with "I need to have anti-sense
transcription". I need the stranded data to account for this.
As for htseq-count I thought that depending on the library preparation
for stranded libraries I could select -s reverse or -s yes. And so to
be sure I did a quick test on the stranded libraries using
infer_experiment.py, it is indeed forward-reverse.
This is PairEnd Data
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9189
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811
Fraction of reads explained by other combinations: 0.0000
1++,1,2+-,2-+
read1 mapped to + strand indicates parental gene on + strand
read1 mapped to - strand indicates parental gene on - strand
read2 mapped to + strand indicates parental gene on - strand
read2 mapped to - strand indicates parental gene on + strand
Based on this I ran TOPHAT with fr-secondstrand option and htseq-count
with -s yes
As Simon said in the other thread "if a read maps to a position which
is covered by a regular gene on one strand and by an overlapping
antisense transcript on the other strand, then this read will be
counted as ambiguous if you have set "stranded" to "no", because there
is no information to decide whether the read originated from the sense
of from the antisense transcript.
For "stranded=yes", however, the read will be counted for the feature
that is on the same strand as the read"
So why would my stranded experiment counted with -s yes capture only
sense trasncription? Shouldn't my stranded experiment counted with -s
yes capture both sense and anti-sense transcription based on where the
reads map?
Also, if selecting this option depends on the library prepration
protocol and not on the DGE design, shouldn't -s reverse be "wrong" in
my case?
Thanks for clarifications and help
Federico
________________________________
From: Michael Love
[michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>]
Sent: Friday, 28 February 2014 2:23 AM
To: Federico Gaiti
Cc: Steve Lianoglou;
bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Subject: Re: [BioC] Low number of replicates DESeq
hi Federico,
The question of design falls on what you are looking for. The
multifactorial design gets at differences which are consistent for
both stranded and unstranded experiments (though the unstranded has 3
times more samples, so contributes more to a gene's likelihood of
being detected DE here).
But to go back to an earlier point. You mentioned earlier: "I am
investigating long non-coding RNAs and so I need to have anti-sense
transcription quantification." Your current multifactorial analysis
is looking for consistent differences across developmental stages,
between sense transcription (your stranded experiment counted with -s
yes rather than -s reverse) and when you combine sense and anti-sense
transcription (your unstranded experiments counted with -s no). Anti-
sense transcription plays little role here if we assume that more
reads are coming from sense than anti-sense.
Note that if you are looking in particular for differences in anti-
sense transcription across developmental stages, you need to use the
-s reverse option to htseq-count, and peform biological replicates. I
don't see any way around requiring more replicates, as there are both
technical and biological sources of variation which will be different
in the stranded and unstranded experiments. Adding in the unstranded
data seems not so helpful, as you are mixing a small signal of
interest (anti-sense transcription) with most likely a lot more reads
coming from sense transcription.
You mentioned, "I tried to use the option -s reverse for the stranded
data and still got really low correlation." Wouldn't this makes sense,
because you are comparing anti-sense transcription to the unstranded
protocol which is likely capturing mostly sense transcription?
Mike
On Thu, Feb 27, 2014 at 3:36 AM, Federico Gaiti
<f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote:
Hi Steve,
I carefully read the DESeq2 vignette (February 19, 2014) and then did
the DGE using two different models as you suggested and then performed
different contrasts.
Multifactorial
> head(CountTable)
ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1
COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3
asmbl_1 30 0 24 48 84 5 1 1 47 15
8 6 47 28 27 47
asmbl_10 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
asmbl_100 0 0 0 0 0 4 0 0 0 1
2 3 2 5 5 7
asmbl_1000 0 8 0 7 5 5 19 7 14 4
4 7 9 4 12 1
asmbl_10000 11 75 0 73 103 12 112 51 65 43
17 57 56 18 23 63
asmbl_10001 0 1 0 0 3 11 6 4 4 4
11 1 5 0 3 0
dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design
=~libType + condition)
colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM
P","COMP","JUV","ADULT"))
design(dds)<-formula(~libType + condition )
dds<-DESeq(dds)
res<-results(dds)
resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT"))
resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV"))
resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP"))
> resultsNames(dds)
[1] "Intercept" "libTypestranded"
[3] "libTypeunstranded" "conditionPRECOMP"
[5] "conditionCOMP" "conditionJUV"
[7] "conditionADULT"
ONE FACTOR
> head(CountTable)
ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3
PRECOMP1 PRECOMP2 PRECOMP3
asmbl_1 0 24 48 5 1 1 15 8 6
28 27 47
asmbl_10 0 0 0 0 0 0 0 0 0
0 0 0
asmbl_100 0 0 0 4 0 0 1 2 3
5 5 7
asmbl_1000 8 0 7 5 19 7 4 4 7
4 12 1
asmbl_10000 75 0 73 12 112 51 43 17 57
18 23 63
asmbl_10001 1 0 0 11 6 4 4 11 1
0 3 0
dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design
=~condition)
colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOM
P","COMP","JUV","ADULT"))
design(dds)<-formula(~condition )
dds<-DESeq(dds)
res<-results(dds)
resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT"))
resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV"))
resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP"))
> resultsNames(dds)
[1] "Intercept" "conditionPRECOMP"
[3] "conditionCOMP" "conditionJUV"
[5] "conditionADULT"
Here is the number of DE genes at a threshold of 0.05 (padj<0.05)
PreComp-Comp Comp-Juv Juv-Adult
Shared 1400 5541
5733
Multifactorial specific 98 1584
1304
One-factor specific 1436 1658 2480
As you can see considering *only* unstranded data in the analysis
detected more DE genes but they seem comparable (at least to me).
Any thougths on this? Should I rely on the multifactorial design?
Thanks for help
Fede
________________________________________
From:
mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com>
[mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com>
] on behalf of Steve Lianoglou
[lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com>]
Sent: Wednesday, 26 February 2014 7:03 PM
To: Federico Gaiti
Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Subject: Re: [BioC] Low number of replicates DESeq
Hi,
On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti
<f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote:
> Hi Steve,
>
> thanks for the reply and sorry for all the code.
> I'm still a beginner in this field so I'm still learning how to
correctly formulate my questions/emails.
Yeah, no problem, just pointing these things out -- keep in mind that
it takes even experienced people time to wade through lots of code, so
it's best to keep things short and sweet (with sufficient detail, of
course ;-)
> I agree with you about the PCA plot analysis.
> Could you just explain better to me what you mean with " If this is
the case, then encoding the libType as a main effect in your model (as
you've done) should go a long ways in dealing with this issue for
you." ??
>
> So let's see if I got what you are saying.
> Are you suggesting I should try to do a DGE with the undtranded data
with "condition" as the only level and then compare it to the DGE
outcome using a multifactorial design?
>
> This would be the way I start the multifactorial analysis:
>
> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,desi
gn=~condition + libType)
>> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRE
COMP","COMP","JUV","ADULT"))
>> design(dds)<-formula(~libType + condition)
>
> Am I getting it right?
> If so, I'll go ahead and keep you posted about the outcome
Yes, you are getting it right -- I'd put the `condition` data on the
Design data.frame before you create the dds, but I'm not if it will
matter.
Just follow closely the example in the deseq2 vignette. Read the
entire vignette actually, so you understand how to get the particular
results you are after out of your objects (ie. what the things are
that you should pass into a call to `results` for instance).
You will be working with two models, say `dds1` which was built with
*only* the unstranded data and your design is ~ condition.
dds2 will be the model with the unstranded and stranded along with the
`~ libType + condition` design.
Once you have those, look at the output from `resultsNames(dds1)` and
`resultsNames(dds2)` and see that you compare the same results between
dds1 and dds2.
This should become more clear to you as you read the deseq2 vignette
(read it again if you think you already read it once) and when you
look with your data.
Note that the DESeq2 folks recently posted an early version of the
paper detaling the deseq2 method here:
http://biorxiv.org/content/early/2014/02/19/002832
Which would be helpful to read.
-steve
--
Steve Lianoglou
Computational Biologist
Genentech
[[alternative HTML version deleted]]
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]