I have just started analyzing my transcriptome assembly data by
deseq2. I am finding that my two control samples do not coincide very
well. This may be affecting the number of differentially expressed
genes if I use the average of two controls to compare with the
treated. Is there a way I can minimize the effects of biological
variation on the analysis?
The overall counts are almost the same in all four files (~59-65
million, with an average of 62 million and SD of 3 million). C1 and C2
(untreated) are two different batches of experimental material and one
half of them are dosed (D1 and D2; treated). They are kind of paired
that way, although they are distinct set of individuals between C1 &
D1 or C2 & D2. We actually started with 3 reps but decided to pool the
2nd and 3rd reps to have sufficient RNA for follow up qPCR assays.
For some reason (could be biological differences between batches), the
counts in C2 are 3-5 fold high compared to C1 for only some
transcripts. There is a faintly similar trend between D2 and D1,
although the treatment has wiped off this difference in many of those
transcripts.
I am wondering if I can use the difference between the pairs (C1&D1
and C2&D2), at least some of the biological variability may be masked
allowing us to see the effects of the treatment more clearly on the
transcriptome.
I find from the FAQs that there is a way to do pair-wise analysis in
deseq2. I appreciate if you can elaborate the answer a little more.
I have also another question related to this study. We actually used
another closely related species for the study. Since this is small in
size, we had to pool all three reps into one. Transcripts are >75%
identical at the DNA sequence level and the gene expression changes
are similar between the species in response to the treatment. I am
thinking of combining this data as a third rep of the study for deseq2
analysis, instead of throwing out the data. I appreciate to have your
input about this step as well.
I am also attaching the code I used (thanks to Dave Wheeler).
Thank you very much for your help.
Subbaiah Chalivendra
-- output of sessionInfo():
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] BiocGenerics_0.10.0 GenomeInfoDb_1.0.2 GenomicRanges_1.16.3
[4] IRanges_1.22.9 parallel_3.1.1 stats4_3.1.1
[7] XVector_0.4.0
--
Sent via the guest posting facility at bioconductor.org.
hi Subbaiah,
On Tue, Aug 12, 2014 at 10:33 AM, Subbaiah Chalivendra [guest]
<guest at="" bioconductor.org=""> wrote:
> I have just started analyzing my transcriptome assembly data by
deseq2. I am finding that my two control samples do not coincide very
well. This may be affecting the number of differentially expressed
genes if I use the average of two controls to compare with the
treated. Is there a way I can minimize the effects of biological
variation on the analysis?
>
> The overall counts are almost the same in all four files (~59-65
million, with an average of 62 million and SD of 3 million). C1 and C2
(untreated) are two different batches of experimental material and one
half of them are dosed (D1 and D2; treated). They are kind of paired
that way, although they are distinct set of individuals between C1 &
D1 or C2 & D2. We actually started with 3 reps but decided to pool the
2nd and 3rd reps to have sufficient RNA for follow up qPCR assays.
>
> For some reason (could be biological differences between batches),
the counts in C2 are 3-5 fold high compared to C1 for only some
transcripts. There is a faintly similar trend between D2 and D1,
although the treatment has wiped off this difference in many of those
transcripts.
>
> I am wondering if I can use the difference between the pairs (C1&D1
and C2&D2), at least some of the biological variability may be masked
allowing us to see the effects of the treatment more clearly on the
transcriptome.
>
> I find from the FAQs that there is a way to do pair-wise analysis in
deseq2. I appreciate if you can elaborate the answer a little more.
>
Do you have a more specific question? We have in the vignette:
"FAQ: Can I use DESeq2 to analyze paired samples? Yes, you should use
a multi-factor design which includes the sample information as a term
in the design formula. This will account for differences between the
samples while estimating the effect due to the condition. The
condition of interest should go at the end of the design formula. See
Section 1.5."
so you would have a design of ~ sample + treatment
where the column data should look like
sample treatment
1 C
2 C
1 D
2 D
> I have also another question related to this study. We actually used
another closely related species for the study. Since this is small in
size, we had to pool all three reps into one. Transcripts are >75%
identical at the DNA sequence level and the gene expression changes
are similar between the species in response to the treatment. I am
thinking of combining this data as a third rep of the study for deseq2
analysis, instead of throwing out the data. I appreciate to have your
input about this step as well.
>
I wouldn't recommend adding this sample from the different species,
because the differences in counts due to the different transcripts of
the other species would throw off the analysis (whether you align the
reads to the genome/transcriptome of that species or to the
genome/transcriptome of the species with the 4 samples you are already
using). Basically, the feature-level effects on counts would not be
expected to cancel out across samples.
Mike
> I am also attaching the code I used (thanks to Dave Wheeler).
>
> Thank you very much for your help.
>
> Subbaiah Chalivendra
>
> -- output of sessionInfo():
>
> R version 3.1.1 (2014-07-10)
> Platform: i386-w64-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] BiocGenerics_0.10.0 GenomeInfoDb_1.0.2 GenomicRanges_1.16.3
> [4] IRanges_1.22.9 parallel_3.1.1 stats4_3.1.1
> [7] XVector_0.4.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.