Question

DiffBind consensus peak help

0

Entering edit mode

reubenmcgregor88 • 0

@reubenmcgregor88-13722

Last seen 3.8 years ago

Hi, I am trying to carry out some differential peak analysis on histone modification data (H3K27Ac data).

I run into problems where one peak in particular is not being called as differentially expressed, where clearly in the data is should be.

In order to try and find out where the problem is I am trying to extract a bed file of the consensus peaks to make sure diffbind is identifying the peak as a consensus peak for the analysis.

My data is very simple. two treatments (etc, vitd) and two donors, see sample sheet print off below:

# A tibble: 4 x 11
SampleID Tissue Factor Condition Treatment Replicate bamReads ControlID bamControl
<chr> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
1 k27aceth1 cd4 k27ac eth a 1 Bowtie_1_27Ac_Eth.bam natinput Bowtie_1_NInput_Eth.bam
2 k27aceth2 cd4 k27ac eth a 2 Bowtie_2_27Ac_Eth.bam natinput Bowtie_1_NInput_Eth.bam
3 k27acvitd1 cd4 k27ac vitd a 1 Bowtie_1_27Ac_VitD.bam natinput Bowtie_1_NInput_Eth.bam
4 k27acvitd2 cd4 k27ac vitd a 2 Bowtie_2_27Ac_VitD.bam natinput Bowtie_1_NInput_Eth.bam
# ... with 2 more variables: Peaks <chr>, PeakCaller <chr>

The code I used for differential peak analysis is as follows:

> samples <- dba(sampleSheet=diffbind_samples, peakCaller = "bed", peakFormat = "bed")

> samples_counts <- dba.count(samples)

> samples_counts <- dba.contrast(samples_counts, categories=DBA_CONDITION, minMembers=2)

> samples_counts <- dba.analyze(samples_counts)

and then exported the results as follows:

> samples_counts.DB <- dba.report(samples_counts, th=0.1, fold=0.585)

> write.table(samples_counts.DB, "samples_counts_0.1_0.586.txt", sep="\t", row.names = F)

So I guess my main question is how to retrieve the regions that where used for the analysis, I have tried the following:

> samples_consensus <- dba.peakset(samples, consensus=c(DBA_TISSUE,DBA_CONDITION), minOverlap=1)

> samples_consensus <- dba(samples_consensus, mask=samples_consensus$masks$Consensus,minOverlap=1)

> consensus_peaks <- dba.peakset(samples_consensus, bRetrieve=TRUE)

> write.table(consensus_peaks, "consensus_peaks.txt", sep="\t")

Below is an example of what the output gave me when visualised in IGV. The top four tracks of the top section are the bigwig of the bamfiles, then in the bottom section, the two blue bars are the peaks called of the red peaks in bigwig and just below it is the "consensus_peaks" defined above and as you can see even though the peaks are called in the original analysis, it is not idnentified as a consensus peak.

Here is the sessionInfo() output:

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] readr_1.1.1 DiffBind_2.4.8 SummarizedExperiment_1.6.3
[4] DelayedArray_0.2.7 matrixStats_0.52.2 Biobase_2.36.2
[7] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2 IRanges_2.10.2
[10] S4Vectors_0.14.3 BiocGenerics_0.22.0

loaded via a namespace (and not attached):
[1] Category_2.42.1 bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[5] httr_1.3.1 tools_3.4.1 backports_1.1.0 R6_2.2.2
[9] rpart_4.1-11 KernSmooth_2.23-15 Hmisc_4.0-3 DBI_0.7
[13] lazyeval_0.2.0 colorspace_1.3-2 nnet_7.3-12 gridExtra_2.2.1
[17] DESeq2_1.16.1 bit_1.1-12 compiler_3.4.1 sendmailR_1.2-1
[21] graph_1.54.0 htmlTable_1.9 plotly_4.7.1 rtracklayer_1.36.4
[25] caTools_1.17.1 scales_0.5.0 checkmate_1.8.3 BatchJobs_1.6
[29] genefilter_1.58.1 RBGL_1.52.0 stringr_1.2.0 digest_0.6.12
[33] Rsamtools_1.28.0 foreign_0.8-69 AnnotationForge_1.18.1 XVector_0.16.0
[37] base64enc_0.1-3 pkgconfig_2.0.1 htmltools_0.3.6 limma_3.32.5
[41] htmlwidgets_0.9 rlang_0.1.2 RSQLite_2.0 BBmisc_1.11
[45] GOstats_2.42.0 bindr_0.1 hwriter_1.3.2 jsonlite_1.5
[49] BiocParallel_1.10.1 gtools_3.5.0 acepack_1.4.1 dplyr_0.7.2
[53] RCurl_1.95-4.8 magrittr_1.5 GO.db_3.4.1 GenomeInfoDbData_0.99.0
[57] Formula_1.2-2 Matrix_1.2-11 Rcpp_0.12.12 munsell_0.4.3
[61] stringi_1.1.5 edgeR_3.18.1 zlibbioc_1.22.0 fail_1.3
[65] gplots_3.0.1 plyr_1.8.4 grid_3.4.1 blob_1.1.0
[69] ggrepel_0.6.5 gdata_2.18.0 lattice_0.20-35 Biostrings_2.44.2
[73] splines_3.4.1 GenomicFeatures_1.28.4 annotate_1.54.0 hms_0.3
[77] locfit_1.5-9.1 knitr_1.17 rjson_0.2.15 systemPipeR_1.10.2
[81] geneplotter_1.54.0 biomaRt_2.32.1 XML_3.98-1.9 glue_1.1.1
[85] ShortRead_1.34.0 latticeExtra_0.6-28 data.table_1.10.4 gtable_0.2.0
[89] purrr_0.2.3 tidyr_0.7.0 amap_0.8-14 assertthat_0.2.0
[93] ggplot2_2.2.1 xtable_1.8-2 survival_2.41-3 viridisLite_0.2.0
[97] pheatmap_1.0.8 tibble_1.3.4 GenomicAlignments_1.12.2 AnnotationDbi_1.38.2
[101] memoise_1.1.0 bindrcpp_0.2 cluster_2.0.6 brew_1.0-6
[105] GSEABase_1.38.1

diffbind r chipseq • 2.3k views

ADD COMMENT • link updated 7.2 years ago by Rory Stark ★ 5.2k • written 7.2 years ago by reubenmcgregor88 • 0

0

Entering edit mode

Hi,

Could you report the output of sessionInfo() please? Some older versions of DiffBind had a bug that under some cases caused peaks to be reported at incorrect locations. (You should always include the output of sessionInfo() when you ask a question.)

- Gord

ADD REPLY • link 7.2 years ago gdbzork • 0

0

Entering edit mode

Hi Gord,

Have added to original post, Sorry I will remember this in the future. Seems it is version 2.4.8?

-Reuben

ADD REPLY • link 7.2 years ago reubenmcgregor88 • 0

1

Entering edit mode

Thanks, that looks current, so the old bug shouldn't be a problem. Unfortunately I can't see anything obvious. The default for dba.count is minOverlap=2, so presence of a peak in two samples should be sufficient.

Could you post or email your sample sheet? And the results of printing the dba objects after each of the first 4 function calls (dba to dba.analyze).

To get a report, you use the dba.report function, which has a whole lot of parameters determining what you get back. In this case, you will want at least:

th=1 so all sites, even non-significant ones, are reported
bCalled=TRUE to see which peaks were called (note you need to call dba.count with bCalledMasks=TRUE)
bCounts=TRUE to see the counts for each peak
DataType=DBA_DATA_FRAME (probably easiest to work with)

You may want to use the "file" parameter to get it all written to a file easily.

If I can't work out what's going on with all that, then I'll have to drag Rory into the discussion. He knows the internals much better than I do.

ADD REPLY • link 7.2 years ago Gord Brown ▴ 670

0

Entering edit mode

Thanks for this, I learned a bit about the dab.report,

Here is my sample sheet:

SampleID	Tissue	Factor	Condition	Treatment	Replicate	bamReads	ControlID	bamControl	Peaks	PeakCaller


1	k27aceth1	cd4	k27ac	eth	a	1	Bowtie_1_27Ac_Eth.bam	natinput	Bowtie_1_NInput_Eth.bam	27ac.eth.don1.peaks.styleregion.250bp.mindist5kb_diffbind.bed	bed
2	k27aceth2	cd4	k27ac	eth	a	2	Bowtie_2_27Ac_Eth.bam	natinput	Bowtie_1_NInput_Eth.bam	27ac.eth.don2.peaks.styleregion.250bp.mindist5kb.diffbind.bed	bed
3	k27acvitd1	cd4	k27ac	vitd	a	1	Bowtie_1_27Ac_VitD.bam	natinput	Bowtie_1_NInput_Eth.bam	27ac.vitD.don1.peaks.styleregion.250bp.mindist5kb_diffbind.bed	bed
4	k27acvitd2	cd4	k27ac	vitd	a	2	Bowtie_2_27Ac_VitD.bam	natinput	Bowtie_1_NInput_Eth.bam	27ac.vitD.don2.peaks.styleregion.250bp.mindist5kb.diffbind.bed	bed

After doing the suggested DBA.report, I can identify the peak in question and the report shows something like this;

Chr	Start	End	Conc	Conc_eth	Conc_vitd	Fold	p-value	FDR	Called1	Called2	k27aceth1	k27aceth2	k27acvitd1	k27acvitd2
chr7	22653887	22655828	7.35	-0.21	8.35	-8.56	1.21E-14	8.52E-11	0	2	0.73	1	319.78	332.63

If I am interpreting this right the normalised values are -0,21 in etc and 8.35 in vitD which is a 8.56 fold increase from eth to vitd with an FDR of 8.52E-11. I am not quite sure how to interpret the Called1 and Called2 columns however? And the final k27aceth1 etc are the counts in each sample, which in this case show k27ac going from around 1 in both etc to over 300 in vitd treatment?

Let me know if this helps?

Again thank you for your help.

ADD REPLY • link 7.2 years ago reubenmcgregor88 • 0

1

Entering edit mode

Called1 and Called2 report the number of samples in each group that had the peak identified in the original peak calling. In this example, this peak was not identified in either of the eth samples, but it was called in both of the vitd samples.

-R

ADD REPLY • link 7.2 years ago Rory Stark ★ 5.2k

0

Entering edit mode

This tells us that DiffBind is indeed finding the difference extremely significant. I'll have to leave it to Rory to understand what's happening with the output.

ADD REPLY • link 7.2 years ago Gord Brown ▴ 670

0

Entering edit mode

Ok I have an update, when I switched around my sample sheet (I did this purely as I wanted to have the vitd samples as the positive fold change values) to have the vitd samples at the top, the peak in question is now in the output of

> samples_counts.DB <- dba.report(samples_counts, th=0.1, fold=0.585)

> write.table(samples_counts.DB, "samples_counts_0.1_0.586.txt", sep="\t", row.names = F)

So the exact issue above seems to be resolved.

However, I have now spotted another issue which perhaps you or Rory can help me with?

I have histone, broad-peak data, however some of the peaks are actually rather narrow whilst some are broad. So what I can see happening now is that the broader peaks are being nicely identified as differentially expressed by diffbind, but the narrow peaks which seem highly up-regulated (though big fold changes) are not reaching FDR significance.

I wonder whether this is something you have come across before? I have tried playing around with "summits" function and "bFullLibrarySize" but not sure how I can best tackle this problem? To give you an idea here is the readout for one of the peaks in question:

Chr	Start	End	Conc	Conc_vitd	Conc_eth	Fold	p-value	FDR	Called1	Called2	k27acvitd1	k27acvitd2	k27aceth1	k27aceth2
chr2	102918383	102918857	7.02	7.79	5.23	2.55	0.00444	0.513	2	1	217.33	225.1	69.3	6

Let me know if you would like me to re-post as a separate question?

ADD REPLY • link 7.2 years ago reubenmcgregor88 • 0

0

Entering edit mode

Changing the order of lines in the sample sheet will change the order of the (auto-generated) comparisons (i.e. control vs treatment as opposed to treatment vs control) (please confirm or correct, Rory). But if that's changing what shows up as significant, sounds like something's wrong. Rory? Any thoughts?

As to the new question, yes, a new question would be a good idea.

ADD REPLY • link 7.2 years ago Gord Brown ▴ 670

score 0 · Answer 1 · 2017-09-06

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 4 weeks ago

Cambridge, UK

If you could email me the DBA object "samples"

> samples <- dba(sampleSheet=diffbind_samples, peakCaller = "bed", peakFormat = "bed")

I can have a look at what is going on.

Cheers-

ADD COMMENT • link 7.2 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thanks Rory, I have emailed it now

ADD REPLY • link 7.2 years ago reubenmcgregor88 • 0