DiffBind consensus peak help
1
0
Entering edit mode
@reubenmcgregor88-13722
Last seen 3.8 years ago

 

Hi, I am trying to carry out some differential peak analysis on histone modification data (H3K27Ac data). 

I run into problems where one peak in particular is not being called as differentially expressed, where clearly in the data is should be.

In order to try and find out where the problem is I am trying to extract a bed file of the consensus peaks to make sure diffbind is identifying the peak as a consensus peak for the analysis. 

My data is very simple. two treatments (etc, vitd) and two donors, see sample sheet print off below:

# A tibble: 4 x 11
    SampleID Tissue Factor Condition Treatment Replicate               bamReads ControlID              bamControl
       <chr>  <chr>  <chr>     <chr>     <chr>     <int>                  <chr>     <chr>                   <chr>
1  k27aceth1    cd4  k27ac       eth         a         1  Bowtie_1_27Ac_Eth.bam  natinput Bowtie_1_NInput_Eth.bam
2  k27aceth2    cd4  k27ac       eth         a         2  Bowtie_2_27Ac_Eth.bam  natinput Bowtie_1_NInput_Eth.bam
3 k27acvitd1    cd4  k27ac      vitd         a         1 Bowtie_1_27Ac_VitD.bam  natinput Bowtie_1_NInput_Eth.bam
4 k27acvitd2    cd4  k27ac      vitd         a         2 Bowtie_2_27Ac_VitD.bam  natinput Bowtie_1_NInput_Eth.bam
# ... with 2 more variables: Peaks <chr>, PeakCaller <chr>

The code I used for differential peak analysis is as follows:

> samples <- dba(sampleSheet=diffbind_samples, peakCaller = "bed", peakFormat = "bed")

> samples_counts <- dba.count(samples)

> samples_counts <- dba.contrast(samples_counts, categories=DBA_CONDITION, minMembers=2)

> samples_counts <- dba.analyze(samples_counts)

and then exported the results as follows:

> samples_counts.DB <- dba.report(samples_counts, th=0.1, fold=0.585)

> write.table(samples_counts.DB, "samples_counts_0.1_0.586.txt", sep="\t", row.names = F)

So I guess my main question is how to retrieve the regions that where used for the analysis, I have tried the following:

> samples_consensus <- dba.peakset(samples, consensus=c(DBA_TISSUE,DBA_CONDITION), minOverlap=1)

> samples_consensus <- dba(samples_consensus, mask=samples_consensus$masks$Consensus,minOverlap=1)

> consensus_peaks <- dba.peakset(samples_consensus, bRetrieve=TRUE)

> write.table(consensus_peaks, "consensus_peaks.txt", sep="\t")

Below is an example of what the output gave me when visualised in IGV. The top four tracks of the top section are the bigwig of the bamfiles, then in the bottom section, the two blue bars are the peaks called of the red peaks in bigwig and just below it is the "consensus_peaks" defined above and as you can see even though the peaks are called in the original analysis, it is not idnentified as a consensus peak.

Here is the sessionInfo() output:

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_1.1.1                DiffBind_2.4.8             SummarizedExperiment_1.6.3
 [4] DelayedArray_0.2.7         matrixStats_0.52.2         Biobase_2.36.2            
 [7] GenomicRanges_1.28.4       GenomeInfoDb_1.12.2        IRanges_2.10.2            
[10] S4Vectors_0.14.3           BiocGenerics_0.22.0       

loaded via a namespace (and not attached):
  [1] Category_2.42.1          bitops_1.0-6             bit64_0.9-7              RColorBrewer_1.1-2      
  [5] httr_1.3.1               tools_3.4.1              backports_1.1.0          R6_2.2.2                
  [9] rpart_4.1-11             KernSmooth_2.23-15       Hmisc_4.0-3              DBI_0.7                 
 [13] lazyeval_0.2.0           colorspace_1.3-2         nnet_7.3-12              gridExtra_2.2.1         
 [17] DESeq2_1.16.1            bit_1.1-12               compiler_3.4.1           sendmailR_1.2-1         
 [21] graph_1.54.0             htmlTable_1.9            plotly_4.7.1             rtracklayer_1.36.4      
 [25] caTools_1.17.1           scales_0.5.0             checkmate_1.8.3          BatchJobs_1.6           
 [29] genefilter_1.58.1        RBGL_1.52.0              stringr_1.2.0            digest_0.6.12           
 [33] Rsamtools_1.28.0         foreign_0.8-69           AnnotationForge_1.18.1   XVector_0.16.0          
 [37] base64enc_0.1-3          pkgconfig_2.0.1          htmltools_0.3.6          limma_3.32.5            
 [41] htmlwidgets_0.9          rlang_0.1.2              RSQLite_2.0              BBmisc_1.11             
 [45] GOstats_2.42.0           bindr_0.1                hwriter_1.3.2            jsonlite_1.5            
 [49] BiocParallel_1.10.1      gtools_3.5.0             acepack_1.4.1            dplyr_0.7.2             
 [53] RCurl_1.95-4.8           magrittr_1.5             GO.db_3.4.1              GenomeInfoDbData_0.99.0 
 [57] Formula_1.2-2            Matrix_1.2-11            Rcpp_0.12.12             munsell_0.4.3           
 [61] stringi_1.1.5            edgeR_3.18.1             zlibbioc_1.22.0          fail_1.3                
 [65] gplots_3.0.1             plyr_1.8.4               grid_3.4.1               blob_1.1.0              
 [69] ggrepel_0.6.5            gdata_2.18.0             lattice_0.20-35          Biostrings_2.44.2       
 [73] splines_3.4.1            GenomicFeatures_1.28.4   annotate_1.54.0          hms_0.3                 
 [77] locfit_1.5-9.1           knitr_1.17               rjson_0.2.15             systemPipeR_1.10.2      
 [81] geneplotter_1.54.0       biomaRt_2.32.1           XML_3.98-1.9             glue_1.1.1              
 [85] ShortRead_1.34.0         latticeExtra_0.6-28      data.table_1.10.4        gtable_0.2.0            
 [89] purrr_0.2.3              tidyr_0.7.0              amap_0.8-14              assertthat_0.2.0        
 [93] ggplot2_2.2.1            xtable_1.8-2             survival_2.41-3          viridisLite_0.2.0       
 [97] pheatmap_1.0.8           tibble_1.3.4             GenomicAlignments_1.12.2 AnnotationDbi_1.38.2    
[101] memoise_1.1.0            bindrcpp_0.2             cluster_2.0.6            brew_1.0-6              
[105] GSEABase_1.38.1

diffbind r chipseq • 2.3k views
ADD COMMENT
0
Entering edit mode

Hi,

Could you report the output of sessionInfo() please?  Some older versions of DiffBind had a bug that under some cases caused peaks to be reported at incorrect locations.  (You should always include the output of sessionInfo() when you ask a question.)

 - Gord

ADD REPLY
0
Entering edit mode

Hi Gord,

Have added to original post, Sorry I will remember this in the future. Seems it is version 2.4.8?

-Reuben

ADD REPLY
1
Entering edit mode

Thanks, that looks current, so the old bug shouldn't be a problem.  Unfortunately I can't see anything obvious.  The default for dba.count is minOverlap=2, so presence of a peak in two samples should be sufficient.

Could you post or email your sample sheet?  And the results of printing the dba objects after each of the first 4 function calls (dba to dba.analyze).

To get a report, you use the dba.report function, which has a whole lot of parameters determining what you get back.  In this case, you will want at least:

  1. th=1 so all sites, even non-significant ones, are reported
  2. bCalled=TRUE to see which peaks were called (note you need to call dba.count with bCalledMasks=TRUE)

  3. bCounts=TRUE to see the counts for each peak
  4. DataType=DBA_DATA_FRAME (probably easiest to work with)

You may want to use the "file" parameter to get it all written to a file easily.

If I can't work out what's going on with all that, then I'll have to drag Rory into the discussion.  He knows the internals much better than I do.

ADD REPLY
0
Entering edit mode

Thanks for this, I learned a bit about the dab.report,

Here is my sample sheet:

SampleID Tissue Factor Condition Treatment Replicate bamReads ControlID bamControl Peaks PeakCaller
                       

1

k27aceth1

cd4

k27ac

eth

a

1

Bowtie_1_27Ac_Eth.bam

natinput

Bowtie_1_NInput_Eth.bam

27ac.eth.don1.peaks.styleregion.250bp.mindist5kb_diffbind.bed

bed

2

k27aceth2

cd4

k27ac

eth

a

2

Bowtie_2_27Ac_Eth.bam

natinput

Bowtie_1_NInput_Eth.bam

27ac.eth.don2.peaks.styleregion.250bp.mindist5kb.diffbind.bed

bed

3

k27acvitd1

cd4

k27ac

vitd

a

1

Bowtie_1_27Ac_VitD.bam

natinput

Bowtie_1_NInput_Eth.bam

27ac.vitD.don1.peaks.styleregion.250bp.mindist5kb_diffbind.bed

bed

4

k27acvitd2

cd4

k27ac

vitd

a

2

Bowtie_2_27Ac_VitD.bam

natinput

Bowtie_1_NInput_Eth.bam

27ac.vitD.don2.peaks.styleregion.250bp.mindist5kb.diffbind.bed

bed

After doing the suggested DBA.report, I can identify the peak in question and the report shows something like this;

Chr Start End Conc Conc_eth Conc_vitd Fold p-value FDR Called1 Called2 k27aceth1 k27aceth2 k27acvitd1 k27acvitd2
chr7 22653887 22655828 7.35 -0.21 8.35 -8.56 1.21E-14 8.52E-11 0 2 0.73 1 319.78 332.63

If I am interpreting this right the normalised values are -0,21 in etc and 8.35 in vitD which is a 8.56 fold increase from eth to vitd with an FDR of 8.52E-11. I am not quite sure how to interpret the Called1 and Called2 columns however? And the final k27aceth1 etc are the counts in each sample, which in this case show k27ac going from around 1 in both etc to over 300 in vitd treatment?

Let me know if this helps?

Again thank you for your help.

ADD REPLY
1
Entering edit mode

Called1 and Called2 report the number of samples in each group that had the peak identified in the original peak calling. In this example, this peak was not identified in either of the eth samples, but it was called in both of the vitd samples.

-R

ADD REPLY
0
Entering edit mode

This tells us that DiffBind is indeed finding the difference extremely significant.  I'll have to leave it to Rory to understand what's happening with the output.

ADD REPLY
0
Entering edit mode

Ok I have an update, when I switched around my sample sheet (I did this purely as I wanted to have the vitd samples as the positive fold change values) to have the vitd samples at the top, the peak in question is now in the  output of

> samples_counts.DB <- dba.report(samples_counts, th=0.1, fold=0.585)

> write.table(samples_counts.DB, "samples_counts_0.1_0.586.txt", sep="\t", row.names = F)

So the exact issue above seems to be resolved.

However, I have now spotted another issue which perhaps you or Rory can help me with?

I have histone, broad-peak data, however some of the peaks are actually rather narrow whilst some are broad. So what I can see happening now is that the broader peaks are being nicely identified as differentially expressed by diffbind, but the narrow peaks which seem highly up-regulated (though big fold changes) are not reaching FDR significance.

I wonder whether this is something you have come across before? I have tried playing around with "summits" function and "bFullLibrarySize" but not sure how I can best tackle this problem? To give you an idea here is the readout for one of the peaks in question:

Chr

Start

End

Conc

Conc_vitd

Conc_eth

Fold

p-value

FDR

Called1

Called2

k27acvitd1

k27acvitd2

k27aceth1

k27aceth2

chr2

102918383

102918857

7.02

7.79

5.23

2.55

0.00444

0.513

2

1

217.33

225.1

69.3

6

 

Let me know if you would like me to re-post as a separate question? 

ADD REPLY
0
Entering edit mode

Changing the order of lines in the sample sheet will change the order of the (auto-generated) comparisons (i.e. control vs treatment as opposed to treatment vs control) (please confirm or correct, Rory).  But if that's changing what shows up as significant, sounds like something's wrong.  Rory?  Any thoughts?

As to the new question, yes, a new question would be a good idea.

ADD REPLY
0
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 28 days ago
Cambridge, UK

If you could email me the DBA object "samples"

> samples <- dba(sampleSheet=diffbind_samples, peakCaller = "bed", peakFormat = "bed")

I can have a look at what is going on.

Cheers-

 

ADD COMMENT
0
Entering edit mode

Thanks Rory, I have emailed it now

ADD REPLY

Login before adding your answer.

Traffic: 569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6