Entering edit mode
Hi Marta-
Well spotted! I have been able to reproduce the behavior you describe,
and found the cause.
edgeR tends to report normalized counts-per-million (cpm), which is
good for comparing counts between experiments but uses values that are
potentially on a much different scale than for the original
experiment. Instead of scaling every experiment to one million reads,
DiffBind scales using a factor based on the library sizes of the
samples in the experiment. There was a discrepancy in how this scaling
factor is computed in dba.count for the global binding matrix, and how
it is computed for a specific contrast in dba.report. I will soon
check in a fix to make these both use the same scaling factor
(specifically, the mean library size, as has been used for the global
scores). In the next release, I'll consider adding in option to report
the cpm values as a global read score and/or for the count values in
dba.report.
Note that it is still be possible to get different normalized read
values for the same samples globally and within a specific contrast
report if the contrast does not include all the samples in the DBA
object, as the data are re-normalized separately for each contrast
using only the applicable samples.
And yes, if duplicate reads are removed in dba.count, they will not be
used in any subsequent analysis based on those counts.
Cheers-
Rory
----------------------------------------------------------------------
------
Dr. Rory Stark
Principal Bioinformatics Analyst
Cancer Research UK Cambridge Institute
University of Cambridge
Robinson Way
Cambridge CB2 0RE
United Kingdom
+44 (0)1223 769 658
rory.stark@cruk.cam.ac.uk
----------------------------------------------------------------------
------
From: Marta Byrska-Bishop <mbb5158@psu.edu<mailto:mbb5158@psu.edu>>
Date: Tue, 11 Feb 2014 13:24:37 -0500
To: Rory Stark
<rory.stark@cruk.cam.ac.uk<mailto:rory.stark@cruk.cam.ac.uk>>
Subject: questions related to DiffBind package
Hello,
I'm a graduate student in Ross Hardison's lab at Penn State
University. I've been using your DiffBind package in R for
differential binding analysis and I have a couple of questions for
you.
I'm running this analysis to compare the genome-wide binding patterns
of a wild type and a mutated form of certain transcription factor. We
have 2 replicates available for both of the TFs. I ran the
differential binding analysis using only the consensus peaks.
My question is why the read counts from individual samples that I get
from dba.report (bCounts = TRUE) are not identical with the ones I get
from saving a whole binding matrix after performing read counting
using dba.count. I understand that irrespective of the normalization
method chosen for dba.count, dba.analyze uses raw read counts and
performs normalization independently from dba.count. If for
differential binding analysis using dba.analyze I use the following
options: method = DBA_EDGER, bFullLibrarySize = FALSE, & bSubControl =
TRUE, are the read counts going to be normalized the same way as in
dba.count when using DBA_SCORE_TMM_MINUS_EFFECTIVE? I compared the
read counts I get from both dba.count and dba.report using the above
settings and they are very close, but not identical. Shouldn't they be
exactly the same?
Also, just to confirm, if in dba.count I choose an option
bRemoveDuplicates = TRUE, are the duplicate reads going to be also
filtered out in differential binding analysis using dba.analyze?
I'd greatly appreciate any information in regards to my questions.
Thank you very much for your time!
Marta
Marta Byrska-Bishop
PhD candidate
Hardison Lab || The Pennsylvania State University || Wartik 303 ||
University Park PA 16802 || lab: 814-863-3150
[[alternative HTML version deleted]]