I'm now running ATAC-seq differential analysis by DiffBind with peaks from MACS2. I have 10 samples from batch 3 and 20 samples from batch 5. How could I eliminate batch effect and normalize them in DiffBind? Sorry, I'm quite new in this area. Thanks!
If you see a batch effect (eg in the PCA plots), the best way to handle them is to model them. We don't want to adjust the count matrix directly as that violates the assumption of the differential analyses in DESeq2 and edgeR.
To model the batch, use the block parameter in dba.contrast(). For example, if the batch numbers were stored in the Replicate field, you would say block=DBA_REPLICATE. Then when you run dba.analyze(), it will do the analysis two ways: as a single-factor analysis, and with the "blocking" factor representing the batch. To use the results from the blocked analysis, you need to specify method=DBA_DESEQ2_BLOCK (or method=DBA_EDGER_BLOCK) when calling dba.report() or any of the plotting functions.
Hopefully you have some replicates in the sense that multiple samples represent alternative experiments of the same condition.
The first step is to mark the batch using one of the metadata fields. You could use for example Condition, Treatment, or Replicate. Create a column in the sample sheet with one of these names, and for the sample in the first batch set the value to "1" and for the samples in the second batch set them to "2" (or "Batch1"and "Batch2" if you are not using the Replicate field).
For example, if the comparison you want to make is between two conditions, you would set the condition labels for each sample in the Condition column, and the corresponding batch number in the Replicate column. Then, after you have loaded the sample sheet using dba() and computed the read counts for the consensus set using dba.count(), you call dba.contrast() and dba.analyze():
In the contrast listing for myDBA, you should see values for two analyses: DB.DESeq2 and DB.DESEq2.block. The first is re[presents the results for a single-factor analysis, while the second reflects the additional modelling of the batch effect.
To use the results of the multi-factor analysis, you need to specify method=DBA_DESEQ2_BLOCK, eg:
You can set the name of the metadata field you are using to represent he batch in subsequent plots by setting: myDBA$config$replicate <- "Batch"
You can use the multi-factor analysis by default (don't need to specify the method= parameter every time) by setting: myDBA$config$AnalysisMethod <- DBA_DESEQ2_BLOCK
When I follow your suggestions, it shows the error: "Error in pv.DBAreport(pv=DBA, contrast = contrast, method = method: DESeq2 analysis has not been run for this contrast."
My codes:
ta <- dba(sampleSheet = "diff.csv")
ta <- dba.count(ta, summits=75)
ta = dba.contrast(ta, categories=DBA_CONDITION, minMembers = 2, block=DBA_REPLICATE)
ta <- dba.contrast(ta)
ta = dba.analyze(ta, method=DBA_DESEQ2_BLOCK)
ta.DB = dba.report(ta, contrast=1, method=DBA_DESEQ2_BLOCK, bCounts=TRUE)
It also shows the following output when I type "ta" after the second dba.contrast()
My example code had an error (I've fixed it above). The second call to dba.contrast() should be a call to dba.analyze(). So in your code you should remove the second call to dba.contrast() and remove the "method=DBA_DESEQ2_BLOCK" parameter from the call to dba.analyze():
ta <- dba(sampleSheet = "diff.csv")
ta <- dba.count(ta, summits=75)
ta <- dba.contrast(ta, categories=DBA_CONDITION, minMembers = 2,
block=DBA_REPLICATE)
ta <- dba.analyze(ta)
ta.DB <- dba.report(ta, contrast=1, method=DBA_DESEQ2_BLOCK, bCounts=TRUE)
If you email the DBA object (ta.DB) to me (or send me a link where I can download it), I can see what is going on. Email is on Bioconductor landing page for DiffBind.
It works when I tried in another version; however, there is another error now:
It shows the following error when I try to plot heatmap or PCA plot. "Error in pv.getPlotData(DBA, attributes = attributes, contrast = contrast, : Only one site to plot -- need at least 2!".
The PCA and correlation plots may be misleading in this case. Modelling the batch effect in the GLM does not change the read counts in any way. So while the model used to identify differentially bound sites takes the batch effect into account, when you plot these sites using the count matrix, they will still exhibit the batch effect.
Short of altering the read counts for the plots (eg by singular value decomposition, as is used in SVA), we can't account for the batch effect in the plots. As DiffBind uses differential analysis methods that rely on raw reads (DESeq2 and edgeR), it avoids such alterations to the count matrix.
Thanks for your help! In this case, how could I evaluate if there is still significant batch effect in my dataset after use the "BLOCK" arguments you mentioned? Thanks!
Having modelled the batch effect, the resulting statistics should be valid. In particular, I would expect that False Positive statistics to be valid. Yo may have more False Negatives formt he increased variance, but he sites that have low FDR should be good.
I have 8 samples from 2 different batches without replicates. How could I use the block argument?
Thanks!
Hopefully you have some replicates in the sense that multiple samples represent alternative experiments of the same condition.
The first step is to mark the batch using one of the metadata fields. You could use for example
Condition
,Treatment
, orReplicate
. Create a column in the sample sheet with one of these names, and for the sample in the first batch set the value to "1" and for the samples in the second batch set them to "2" (or "Batch1"and "Batch2" if you are not using theReplicate
field).For example, if the comparison you want to make is between two conditions, you would set the condition labels for each sample in the
Condition
column, and the corresponding batch number in theReplicate
column. Then, after you have loaded the sample sheet usingdba()
and computed the read counts for the consensus set usingdba.count()
, you calldba.contrast()
anddba.analyze(
):In the contrast listing for
myDBA
, you should see values for two analyses:DB.DESeq2
andDB.DESEq2.block
. The first is re[presents the results for a single-factor analysis, while the second reflects the additional modelling of the batch effect.To use the results of the multi-factor analysis, you need to specify
method=DBA_DESEQ2_BLOCK
, eg:Some other tips:
myDBA$config$replicate <- "Batch"
method=
parameter every time) by setting:myDBA$config$AnalysisMethod <- DBA_DESEQ2_BLOCK
-Rory
Hi Rory,
When I follow your suggestions, it shows the error: "Error in pv.DBAreport(pv=DBA, contrast = contrast, method = method: DESeq2 analysis has not been run for this contrast."
My codes:
It also shows the following output when I type "ta" after the second dba.contrast()
Group1 Members1 Group2 Member2
Control 3 Test 5
Simon
I have 2 batches. One is "batch 3" and another one is "batch 5", so I put numerical value 3 or 5 in the Replicate column.
Hi Simon-
My example code had an error (I've fixed it above). The second call to
dba.contrast()
should be a call todba.analyze()
. So in your code you should remove the second call todba.contrast()
and remove the "method=DBA_DESEQ2_BLOCK
" parameter from the call todba.analyze()
: